Table of Contents
In a previous article, I argued that the tech industry needs to double down on open-source to work through the tech downturn that will be needlessly extended by companies bleeding themselves dry relying on overpriced commercial software.
The open-source ecosystem is sprawling and is full of potentially cumbersome solutions. Throughout the last year, my team and I have packaged together dozens of open-source tools and have learned the ins and outs of these tools.
Based on our learnings, we are sharing our perspective on open-source tools and which ones we have found to be the best.
We plan on making this a recurring series where we compare tools against each other, and hope to aid engineering teams in their decision-making process.
What is Data Orchestration?
The main purpose of a data orchestration tool is to ensure jobs are executed on a schedule only after a set of dependencies is satisfied. Data orchestration tools will do most of the grunt work for you, like managing connection secrets and plumbing job inputs and outputs. An advantage of using data orchestration tools is that they provide a nice user interface to help engineers visualize all work flowing through the system.
What is Airflow and what are popular Airflow alternatives?
Airflow has been the industry standard for data orchestration since Airbnb began the project in 2014. Since then the project took off, and there are almost certainly millions of lines of DAG code scattered across git repos throughout global engineering organizations.
You can check out Apache Airflow on GitHub which as of March 2023 has 29,300 stars.
At its time, Airflow was simply an amazing technology and was better than any alternative. However, eight years later it's beginning to show its age. Airflow's monolithic architecture, rampant legacy code, and aging interface have engineering teams ditching the once-popular technology for other alternatives such as Prefect and Dagster.
What is Dagster?
Dagster is a recently created open-source project targeted at the same problem space as Airflow but is built with a modern cloud-native design in mind. For an open-source software (OSS), it has a slick interface with a modular architecture and a decent SDK in which to write DAGs.
You can check out the code base for Dagster on GitHub which as of March 2023 has 6,700 stars.
To be upfront, we are big fans of Dagster and use it as our default orchestrator for our model data stack. But let's go into more detail about these tools and how they compare against each other.
Why you should still use Airflow
To be blunt, Airflow is simply an old technology. Before continuing to use it in production, it is worth considering some of its specific flaws.
As mentioned earlier, the clearest defects of Airflow are its monolithic architecture, legacy server code, and outdated interface.
Airflow’s architecture is pretty simple, consisting of a web tier, a scheduler tier, and a worker tier. The web tier accepts CRUD requests and serves out the interface. The scheduler polls its DB for jobs that are ready to execute and then sends it to the worker, which is either a pool of celery workers or a dedicated K8s pod per job. DAG code is then loaded into all processes and executed directly as python function calls as work dispatches.
While this is a perfectly workable approach for small, one-team use cases, it is simply not scalable. As your organization adopts Airflow you end up with a severe dependency management problem. In fact, it is more common than you think. The DAG code needs to be loaded in-process and multiple teams will likely contribute DAGs to the same Airflow cluster with divergent Python dependencies. That makes it more likely you physically cannot run the DAGs for your entire organization on the same cluster. To do so, you will need to split it out to handle the incomplete pip dependencies.
In fact, pip is a very flaky dependency manager. It’s fairly common for pip installs to update the pip version of Airflow itself, causing a database migration and bringing your cluster into an unknown state that can only be reconciled manually. This is only exacerbated by the second problem: bad legacy code.
Airflow’s migration system is alembic, which is fine for simple flask deployments but is not fit for a repeatedly deployed OSS tool.
I’ve seen numerous times where alembic migrations get unsynced due to persisting an incorrect version number (probably from phantom pip upgrades). Most DB ORMs in other languages are not a concern when performing migrations since they are done in a much more robust, and intelligent fashion. This is not true of Airflow.
Additionally, Airflow’s authentication system uses a legacy package called authlib, which is not a huge issue if you just want username/password auth. However, if you want to do something more interesting like setting up OIDC, you will need to spend a few hours looking at terrible, legacy python code and ultimately realizing you need to subclass a specific python object to implement an OAuth handler.
It also has unusual, phantom conventions on how users are registered in the airflow database that can byte you if any auth providers are mis implemented
Finally, while aesthetics are in the realm of de gustibus, non disputandum est, Airflow’s user interface is considerably out of date. In the world of OSS, this is somewhat expected, but there certainly are better user experiences out there among competing job orchestrators.
That said, we promised to explain why Airflow is here to stay, and there’s a simple answer: there’s a massive amount of existing Airflow code already built.
At most this can be classified as tech debt from the above observations, and the upfront cost of rewriting all that code is rarely worth it if you can just baby your cluster. There are ways to move off if you truly wanted to, and I’d be interested in someone building an API-compatible scheduler to drop in and replace airflow as well, but for now, the path dependencies need to be respected in a lot of codebases.
Airflow is going to be around for a while, which is why we’ve invested a lot of effort in supporting it on Plural. We still want Airflow users to have a simple operational experience with their clusters.
Why Dagster is better than Airflow and any Airflow alternatives?
The huge innovation Dagster has introduced is leveraging containerization to entirely solve the Airflow monolith dependency issue at the architectural level. It segments an architecture similar to Airflow by moving the scheduling tier into a GRPC-compliant microservice that can accept any number of “user deployments.” They then register job types with the scheduler and web server and then spawn them as isolated docker containers within a k8s job, which can consolidate the dependencies and source code into isolated units.
This enables any number of teams to share the same scheduler without the worry of trampling on each other's code, simplifying the operational profile of your setup. This also removes the risk of pip upgrades interfering with Dagster’s core source code and all the database migration headaches that can cause.
On the aesthetic side, Dagster benefits from being built in the 2020s and has a sleek, modern interface, with nice timeline visualization for running jobs alongside more familiar graph visualization.
Like many new OSS projects, Dagster has its warts. The most notable is its web interface does not come with authentication at all in the OSS version. Plural helps there by using our OAuth proxy infrastructure to inject sidecars to provide authentication with OpenID Connect, but you could also host it on a private network to provide a measure of security as well. That said, mature projects with some web-facing components really should be supporting authentication as table stakes, which is a bit disappointing.
This is more of a niche concern, but I also think they should build an operator for provisioning user code deployments for a running dagster instance. Currently, their creation is wrapped in a helm chart, which can in theory be deployed independently.
However, in most realistic cases this will involve all dev teams writing code for Dagster having to submit PRs to a single repo managing the installation of that helm chart, instead of creating deployments in the namespaces or git repos in which they naturally work. Using a CRD to instantiate these would be a natural evolution to the more decentralized operating model the product seems to be built for.
Every engineering org will have its own tradeoffs to make in adopting any software, and our preferences will not necessarily be the winning consideration everywhere. Hopefully, we have helped people either learn about a new tool or realize some issues with their assumed favorite before getting too locked in.
Be the first to know when we drop something new.