Is Apache Airflow good enough for current data engineering needs?
Not so long ago, if you would ask any data engineer or data scientist about what tools do they use for orchestrating and scheduling their data pipelines, the default answer would likely be Apache Airflow. Even though Airflow can solve many current data engineering problems, I would argue that for some ETL & Data Science use cases it may not be the best choice.
In this article, I will discuss the pros and cons that I experienced while working with Airflow in the last two years and derive from that the use cases for which Airflow is still a great choice. I hope that by the end of this article, you will be able to determine whether it fits your ETL & Data Science needs.
What are the Airflow’s strengths?
Undeniably, Apache Airflow has an amazing community. There is a large number of individuals using Airflow and contributing to this open-source project. If you want to solve a particular data engineering problem, the chances are that somebody in the community has already solved that and shared their solution online or even contributed their implementation to the codebase.
Companies placing strategic bets on Airflow
Many companies decided to invest in Apache Airflow and support its growth, among them:
- Google with its Cloud Composer GCP service,
- Astronomer offering enterprise support in deploying Airflow on Kubernetes,
- Polidea heavily contributing to the codebase with many PMC members
- GoDataDriven offering Apache Airflow training.
The support from those companies ensures that there are people working full-time to further improve the software which guarantees long-term stability, support, and training.
The possibility to define your workflows within Python code is incredibly helpful, as it allows you to incorporate almost any custom workflow logic into your orchestration system.
Airflow allows you to extend the functionality by:
- using plugins ex. to add extra menu items within the UI,
- adding custom Operators or building on top of the existing ones.
Wide range of Operators
If you look at the number of Operators available in the Airflow Github repository, you will find that Airflow supports a wide range of connectors to external systems. This means that in many cases you will find code templates that you can use to interact with a variety of databases, execution engines, and cloud providers without having to implement the code yourself.
The number of connectors to external systems shows that Airflow can be used as a “glue” that ties together data from many different sources.
What are the Airflow’s weaknesses?
From the list of advantages listed above, you can see that, overall, Airflow is a great product for data engineering from the perspective of tying many external systems together. The community put in an amazing amount of work building a wide range of features and connectors. However, it has several weak spots that prevent me from truly loving working with it. Some of them may be fixed in future releases, so I discuss the issues as they are at the moment of writing.
No versioning of your data pipelines
These days, when we have version control systems and different versions of our Docker images stored in the Docker registries, we take versioning for granted — it’s a basic feature that simply should be there, no questions asked. However, Airflow still doesn’t have it. If you delete a task from your DAG code and redeploy it, you will lose the metadata related to that task.
Not intuitive for new users
I used Airflow long enough to understand its internals and to even extend its functionality by writing custom components. However, teaching a team of data engineers who haven’t used Airflow before on how to use it proved time-consuming as one needs to learn an entirely new “syntax”. Some data engineers considered the entire experience not intuitive.
One prominent example was related to scheduling: many (me including) found it very confusing that Airflow starts scheduling the jobs at the end of the scheduling interval. This means that the schedule interval doesn’t start immediately, but only when the execution_date reaches start_date + schedule_interval. This seems to play well with batch ETL jobs that are running only once per night, but for jobs that are running every 10 minutes, it's rather confusing and may result in unexpected bugs when used by new users inexperienced with the tool, especially if the catchup option is not used properly.
Configuration overload right from the start + hard to use locally
In order to start using Airflow on a local machine, a data professional new to the tool needs to learn:
- the scheduling logic built into the product — such as the mentioned nuances related to a start date, execution date, schedule-intervals, catchup
- an entire set of concepts and configuration details — operators vs. tasks, executors, DAGs, default arguments, airflow.cfg, Airflow metadata DB, the home directory for deploying DAGs, …).
Plus, if you are a Windows user, you really can’t use the tool locally unless you use docker-compose files which are not even part of the official Airflow repository — many people use puckel/docker-airflow setup. It’s all doable but I wish it would be more intuitive and easier for new users.
I know that Airflow released an official docker image in the last months, but what is still missing is an officialdocker-compose file where new users (especially Windows users) could get a full basic setup, together with a metadata database container and a bind mount to copy their DAGs into the container. An official docker-compose file would be very helpful to be able to run Airflow locally on Windows.
If you use Astronomer paid version of Airflow, you could use astro CLI which mitigates the problem of local testing to some extent.
Setting up Airflow architecture for production is NOT easy
In order to obtain a production-ready setup, you really have two choices:
- Celery Executor: if you choose this option you need to know how Celery works + you need to be familiar with RabbitMQ or Redis as your message broker in order to set up and maintain the worker queues that can execute your Airflow pipelines. To the best of my knowledge, there are no official tutorials or deployment recipes directly from Airflow to make this scale-out process easier for the users. I personally learned it from this blog article (kudos for sharing!). Overall, I wish that this setup was easier for the users, or at least that Airflow would provide some official docs on how to set it up properly.
- Kubernetes Executor: this executor is relatively new compared to Celery, but it allows you to leverage the power of Kubernetes to automatically scale your workers (even down to zero!) and to manage all the Python package dependencies in a robust way because everything must be containerized to work on Kubernetes. However, also in this regard, I did not find much support in the official docs on how to properly set it up and maintain it.
My experience in setting up Airflow on AWS for the company I worked at was that you can either:
- hire some external consultant to do it for you
- get a paid version from Google (Cloud Composer) or from Astronomer.io
- or you can try and error, cross your fingers and hope it won’t break.
Overall, Airflow’s architecture includes many components such as:
- the scheduler,
- metadata database,
- worker nodes,
- message broker + Celery + Flower if choosing Celery executor,
- possibly some shared volume such as AWS EFS for the common DAGs storage between the worker nodes,
- setting up the values in the airflow.cfg properly
- configuring the log storage ex. to S3 + ideally some lifecycle policy, as usually, you don’t need to look at very old logs and to pay for their storage
- registering a domain for the UI
- adding some monitoring to prevent your metadata database and worker nodes from exceeding their compute capacity and storage
- adding some Auth layer for the UI + database user management for access to the metadata database.
Those are MANY components to maintain and to ensure that they all work well together, and it seems that the open-source version of Airflow doesn’t make this setup easy for the users.
From my experience so far, choosing Astronomer seems to be the easiest choice if you want to use Airflow in production (especially if you use AWS or Azure and not GCP), as you get plenty of features added on top, such as monitoring of your nodes, pulling logs to one central place, Auth layer (and integration with Active Directories), support, SLA and the team from Astronomer will maintain at least some of the components listed above.
Lack of data sharing between tasks encourages not-atomic tasks
There is currently no natural “Pythonic” way of sharing data between tasks in Airflow other than by using XComs which were designed to only share small amounts of metadata (there are plans on the roadmap to introduce functional DAGs so the data sharing might get somehow better in the future).
A task means a basic atomic unit of work in a data pipeline. Because there is no easy way of sharing data between tasks in Airflow, instead of tasks being atomic, i.e. responsible for only 1 thing (ex. only extracting the data), people often tend to use entire scripts as tasks such as a script doing entire ETL (triggered with BashOprator ex. “python stage_adwords_etl.py”), which in turn makes the maintenance more difficult as you need to debug an entire script (full ETL) instead of a small atomic task (ex. only the “Extract” part).
If your tasks are not atomic, you can’t just retry the Load part of ETL when it fails — you need to retry the entire ETL.
Scheduler as a bottleneck
If you worked with Airflow before, you may have experienced that after hitting the Trigger DAG button in the UI, you need to wait quite a long time before you can see that the task really starts running.
The scheduler often needs up to several minutes before the task is scheduled and picked up by the worker process to be executed, at least it was the case when I was using Airflow deployed on EC2 earlier this year. Airflow community is working on improving the scheduler, so I hope it will get more performant in the next releases but at the time of writing this bottleneck prevents from applying Airflow to use-cases where this latency is not acceptable or desirable.
Use cases for which Airflow is still a good option
In this article, I highlighted several times that Airflow works well when all it needs to do is to schedule jobs that:
- run on external systems such as Spark, Hadoop, Druid, or some external cloud services such as AWS Sagemaker, AWS ECS or AWS Batch,
- submit a SQL code to some in-memory database.
Airflow was not designed to execute any workflows directly inside of Airflow, but just to schedule them and to keep the execution within external systems.
This implies that Airflow is still a good choice if your task is, for instance, to submit a Spark job and store the data on a Hadoop cluster or to execute some SQL transformation in Snowflake or to trigger a SageMaker training job.
To give you an example: imagine a company where data engineers are creating ETL jobs in Pentaho Data Integration and they are using CeleryExecutor to orchestrate BashOperator tasks on an AWS EC2 instance. Those jobs are not dockerized and the task is just to schedule a bash command to run on a particular server. Airflow works well in this use case.
If all you need to do in your workflow system is to submit some bash command to an external system and your actual data flow is defined within Spark, SageMaker, or, as in the example above, in Pentaho Data Integration, Airflow should work quite well for you, because data dependencies are managed by those external systems and Airflow only needs to manage the state dependencies between the tasks. The same is true if you use some in-memory databases such as Snowflake, Exasol, or SAP Hana, where the actual work is executed within those databases and your workflow orchestration system simply submits the query to it.
Use cases for which Airflow may not be the best option
If you want that your workflow system works closely together with your execution layer and is able to pass data between the tasks within Python code, then Airflow may not be the best choice in this case.
Airflow is only able to pass the state dependencies between tasks (plus perhaps some metadata through XComs) and NOT data dependencies. This implies that, if you build your workflows mainly in Python and you have a lot of data science use cases, which by their nature heavily rely on data sharing between tasks, other tools may work better for you such as Prefect.
Those are the use cases for which Prefect may be a better choice than Airflow:
- if you need to share data between tasks
- if you need versioning of your data pipelines → at the time of writing, Airflow doesn’t support that
- if you would like to parallelize your Python code with Dask — Prefect supports Dask Distributed out of the box
- if you need to run dynamic parametrized data pipelines → in theory, you could get around it in Airflow, but by default, dynamic pipelines are not within the main Airflow’s scope
- if the Airflow’s scheduler latency is not acceptable by your workloads, you may find Prefect more suitable to your needs
- if you want a seamless experience when testing workflow code locally
- lastly, if you prefer an easier setup than maintaining all the Airflow components I mentioned previously, you may opt for Prefect Cloud.
I showed one possible way of setting it up in this article:
Distributed data pipelines made easy with AWS EKS and Prefect
How to set up a distributed cloud workflow orchestration system within minutes and focus on providing value rather than…towardsdatascience.com
In this article, we discussed the pros and cons of Apache Airflow as a workflow orchestration solution for ETL & Data Science. After analyzing its strengths and weaknesses, we could infer that Airflow is a great choice as long as it is used for the purpose it was designed to, i.e. to orchestrate work that is executed on external systems such as Apache Spark, Hadoop, Druid, cloud services, external servers (ex. distributed with Celery queues) or when submitting the SQL code to high-performance distributed databases such as Snowflake, Exasol or Redshift.
However, Airflow is not designed to execute your data pipelines directly, so if your ETL & Data Science code needs to pass data between the tasks, needs to be dynamic and parametrized, needs to run in parallel by using Dask or requires a low latency scheduler, then you may prefer other tools such as Prefect.
I hope this article helped you to determine whether Apache Airflow suits your current ETL & Data Science needs. Due to the lack of really good alternatives to Airflow other than Prefect, I tried to determine if Prefect can fill the gap for the use cases when Airflow may not be good enough. If you wish, I could create a more detailed comparison between the two workflow platforms: let me know in the comments!
Thank you for reading & have fun on your data journey!