Dagster vs. Airflow and Prefect: why asset-based orchestration wins
A pragmatic comparison of Dagster, Airflow, Prefect, and Flyte for modern data and ML pipelines. Why the asset-based model beats task-DAGs, with code, a cost model, and migration notes.
Every data platform I've inherited in the last five years had one thing in common: a graveyard of stale Airflow DAGs, each one re-running a brittle chain of tasks that nobody dared refactor because the lineage existed only in tribal knowledge. The orchestration layer was managing procedures, not data.
Dagster inverts that. Instead of scheduling tasks that happen to produce data as a side effect, you declare the data assets you want to exist and let the orchestrator figure out when to materialize them. After porting two production pipelines and a research workflow away from Airflow, I think this is the single biggest win available in modern data engineering.
TL;DR
- Airflow and classic Prefect schedule tasks: imperative units of work arranged into a DAG.
- Dagster schedules assets: declarative data objects with lineage, partitions, freshness policies, and type-checked I/O.
- The asset model makes lineage, partial re-runs, data-quality checks, and testing first-class — all the things that usually live in README files and tribal knowledge in Airflow shops.
- Local development in Dagster is the closest thing to "run it in a notebook" that serious orchestration tools offer.
- Airflow still wins on breadth of operators, conservative enterprise adoption, and non-data jobs (arbitrary shell, legacy systems).
Tasks vs. assets, in one picture
A task-based DAG says "run extract, then transform, then load". The orchestrator has no idea what data comes out; that's your problem. If the downstream table gets corrupted, you kick off the whole DAG and hope.
An asset-based graph says "the customers_clean table is defined as customers_clean = clean(customers_raw)". The orchestrator now knows:
- Which inputs each asset depends on.
- What partitions (by date, region, tenant) each asset has.
- Which downstream assets become stale when an upstream one is re-materialized.
- What schema, type, and data-quality contract each asset must satisfy.
Formally, a pipeline is a directed acyclic graph over assets . The cost of bringing the graph to a consistent state from a starting set of invalidated assets is
where are the transitive descendants of and is the cost of materializing asset . The asset model knows exactly; the task model has to be told, usually by a human copy-pasting task IDs into a backfill command.
The same pipeline in both worlds
A toy ETL: ingest a CSV, clean it, produce a summary table.
Airflow 2.x (TaskFlow API)
from airflow.decorators import dag, task
from datetime import datetime
import pandas as pd
@dag(schedule="@daily", start_date=datetime(2026, 1, 1), catchup=False)
def daily_sales_dag():
@task
def extract() -> str:
df = pd.read_csv("s3://raw/sales.csv")
path = "/tmp/sales.parquet"
df.to_parquet(path)
return path
@task
def clean(path: str) -> str:
df = pd.read_parquet(path).dropna()
out = "/tmp/sales_clean.parquet"
df.to_parquet(out)
return out
@task
def summary(path: str):
df = pd.read_parquet(path)
df.groupby("region")["amount"].sum().to_csv("/tmp/summary.csv")
summary(clean(extract()))
daily_sales_dag()The orchestrator sees three tasks connected by string-typed paths. It has no idea what sales_clean is, cannot enforce a schema on it, and can only re-run the whole DAG.
Dagster (asset-based)
from dagster import asset, AssetIn, Definitions
import pandas as pd
@asset
def sales_raw() -> pd.DataFrame:
return pd.read_csv("s3://raw/sales.csv")
@asset(ins={"sales_raw": AssetIn()})
def sales_clean(sales_raw: pd.DataFrame) -> pd.DataFrame:
return sales_raw.dropna()
@asset(ins={"sales_clean": AssetIn()})
def sales_by_region(sales_clean: pd.DataFrame) -> pd.DataFrame:
return sales_clean.groupby("region", as_index=False)["amount"].sum()
defs = Definitions(assets=[sales_raw, sales_clean, sales_by_region])Same three steps — but now Dagster knows three things it couldn't before: the outputs are typed pd.DataFrame, they have names that match a target storage location (managed by IO managers), and sales_by_region becomes stale automatically when sales_raw is re-materialized.
Add a partition definition and an asset check, and you get reliable backfills and data-quality enforcement without custom Python:
from dagster import DailyPartitionsDefinition, asset, asset_check
daily = DailyPartitionsDefinition(start_date="2026-01-01")
@asset(partitions_def=daily)
def sales_raw(context): ...
@asset_check(asset=sales_raw)
def sales_raw_not_empty(sales_raw: pd.DataFrame):
return {"passed": len(sales_raw) > 0}Trying to express the equivalent in Airflow means wiring up ExternalTaskSensors, custom XCom serializers, and a separate Great Expectations integration. It's doable — and I've done it — but the fact that you have to do it is exactly the problem.
Comparison I trust
These are real differences I've hit in production, not feature-list trivia.
| Dimension | Airflow | Prefect 2+ | Dagster | Flyte |
|---|---|---|---|---|
| Core model | Task DAG | Flow of tasks | Asset graph | Typed tasks + workflows |
| Lineage | External (OpenLineage) | External | First-class | First-class |
| Partitions / backfill | Per-DAG run dates | Custom | Per-asset, multi-dim | Per-task |
| Data-quality checks | Add-on (Great Expectations) | Add-on | @asset_check native |
External |
| Local dev UX | Painful (Docker, scheduler) | Good | Excellent (dagster dev) |
Moderate (K8s-first) |
| Typing | Loose (XCom = pickle) | Loose / optional | Strong (Python types + IO) | Strong (Protobuf / types) |
| Non-data jobs | Excellent (operators) | Good | OK (Ops for raw work) |
OK |
| Operator ecosystem | Huge | Moderate | Moderate (growing) | Small |
| Infra footprint | Webserver + scheduler + DB | Server + workers | Code-location + webserver | Kubernetes, mandatory |
| Managed cloud | Astronomer, MWAA | Prefect Cloud | Dagster+ | Union.ai |
Where Dagster hurts today
No tool is free.
- Operator breadth. Airflow has 1,500+ provider packages. Dagster integrations are growing fast but narrower; for niche SaaS connectors you may still have to write an
Op. - Kubernetes-native ML. If every workload is a GPU pod and you already live in K8s, Flyte's ergonomics around typed inputs, caching, and pod specs are still best-in-class.
- Arbitrary non-data jobs. Reporting-email senders, cleanup scripts, legacy shell pipelines — Airflow was built for this. Using Dagster for only non-data jobs is using a microscope as a hammer.
- Team inertia. A mature Airflow shop with 200 DAGs, custom sensors, and trained on-call rotations won't benefit from a rewrite unless the data-quality and lineage wins are genuinely urgent.
Migration recipe
If you're starting fresh, pick Dagster. For an existing Airflow codebase, a staged migration is realistic:
- Wrap, don't rewrite. Use Dagster's
@assetto represent the outputs of your existing Airflow DAGs, and let Airflow trigger Dagster withDagsterCloudOperatoror a plain HTTP call. You get lineage on what matters — the data — without rebuilding every task. - Port the high-value tables first. Tables with bad data-quality stories, frequent backfills, or opaque lineage benefit most. Leave the "runs for 2 years without touching it" DAGs alone.
- Adopt asset checks as you go. Every time an on-call incident stems from "nobody noticed the upstream was null", codify that as an
@asset_check. This is where Dagster pays back the investment. - Keep Airflow for the long tail. Reporting emails, vendor-API polling, cleanup jobs — don't fight it. Airflow is fine at these, and Dagster isn't trying to be.
For Prefect users, the translation is more direct: Prefect flows map closely to Dagster jobs, and most teams adopt assets incrementally alongside flows rather than replacing them wholesale.
Closing thought
Orchestration has spent a decade optimizing the wrong abstraction. Tasks are implementation details; data is the contract. Dagster is the first mainstream orchestrator that treats assets as the object of interest, and once you've shipped a pipeline with typed I/O, automatic partitioning, and asset-level data checks, going back feels like debugging in print statements after using a real debugger.
It's not a silver bullet — Airflow is still the right answer for some teams and some workloads. But for greenfield data and ML platforms in 2026, asset-based orchestration is the default I'd reach for.
Further reading
- Dagster documentation — Software-Defined Assets
- Airflow 2 TaskFlow API
- Data orchestration: a field guide — Dagster's own landscape survey, still the most honest I've read.
- Flyte vs. Airflow (Union.ai) — useful for the K8s-native angle.