Work Flow Management for Big Data: Guide to Airflow (part 2)

Data analytics has been playing a key role in the decision making process at various stages of the business in many industries. In this era of Big Data, the adoption level is going to ONLY increase day by day. It is really overwhelming to see all the Big Data technologies that are popping up every week to cater to various stages of the Big Data Solution implementations. With data being generated at a very fast pace at various sources (applications that automate the business processes), implementing solutions for use cases like “real time data ingestion from various sources”, “processing the data at different levels of the data ingestion” and “preparing the final data for analysis” become challenging. Especially, orchestrating, scheduling, managing and monitoring the pipelines is a very critical task for the Data platform to be stable and reliable. Also, due to the dynamic nature of the data sources, data inflow rate, data schema, processing needs, etc, the work flow management (pipeline generation/maintenance/monitoring) becomes even more challenging.

This is a three part series of which “Overview along with few architectural details of Airflow” was covered as part (1) of the first part. This part covers the Deployment options for Airflow in Production.

Part 2: Deployment view: gives a better picture

Based on the needs, one may have to go with a simple setup or a complex setup of Airflow. There are different ways Airflow can be deployed (especially from an Executor point of view). Below are the deployment options along with the description for each.

Standalone mode of deployment

Description: As mentioned in the above section, the typical installation of Airflow to start with would look something like this.

  • Configuration file (airflow.cfg):  which contains the details of where to pick the DAGs from, what Executor to run, how frequently the scheduler should poll the DAGs folder for new definitions, which port to start the webserver on etc..
  • Metadata Repository: Typically Mysql or postgres database is used for this purpose. All the metadata related to the DAGs, thier metrics, Tasks and thier statuses, SLAs, Variables etc are stored here.
  • Webserver: This renders the beautiful UI that shows up all the DAGs, their current states along with the metrics (which are pulled from the Repo).
  • Scheduler: This reads the DAGs, put the details about the DAGs into Repo. It initiates the Executor.
  • Executor: This is responsible for reading the schedule_interval info and creates the instances for the DAGs and Tasks into Repo.
  • Worker: The worker reads the tasks instances and actually perform the tasks and writes the status back to the Repo.       

Distributed mode of deployment

Description: The description for most of the components mentioned in the Standalone section remain the same except for the Executor part and the workers part.

  • RabbitMQ: RabbitMQ is the distributed messaging service that leveraged by Celery Executor to put the task instances into. This is where the workers would typically read the tasks from for execution. Basically there is a broker URL thats exposed by RabbitMQ for the Celery Executor and Workers to talk to.
  • Executor: Here the executor would be Celery executor (configured in airflow.cfg).The Celery executor is configured to point to the RabbitMQ Broker.
  • Workers: The workers are installed on different nodes (based on the requirement) and they are configured to read the tasks info from the RabbitMQ brokers. The workers are also configured with a Worker_result_backend which typically can be configured to be the MySQL repo itself.

Few important points to be noted here are:

  1. The Worker nodes nothing but the airflow installation it self. Just that only the worker process would be started.
  2. The DAG definitions should be in sync on all the nodes (both the primary airflow installation and the Worker nodes)

Distributed mode of deployment with High Availability set up

Description: As part of the setup for high availability of airflow installation, we are assuming that the MySQL repository is configured to be highly available and RabbitMQ would be highly available. The focus is on how to make the airflow components like the Webserver and Scheduler highly available.

The description for most of the components would remain the same as above. Below is what changes:

  • New airflow instance (standby): There would be another instance of airflow setup as a standby.
    • The ones shown in Green is the primary airflow instance.
    • The one in red is the stand by one.
  • A new DAG has to be put in place. Something called “CounterPart Poller”. The purpose of this DAG would be two fold
    • To continuously poll the counter part scheduler to see if it is up and running.
    • If the counter part instance is not reachable (which means the instance is down),
      • declare the current airflow instance as the Primary
      • move the DAGs of the (previous) primary instance out of DAGs folder and move the Counterpart Poller DAG into the DAGs folder.
      • move the actual DAGs into the DAGs folder and move the Counterpart Poller out of DAGs folder on the standby server (the one which declares itself as primary now).

Note that the declaration as primary server by the instances can be done as a flag in some mysql table.

Here is how it would work:

  1. Initially the primary (say instance 1) and standby (say instance 2) schedulers would be up and running. The instance 1 would be declared as primary airflow server in the mysql table.
  2. The DAGs folder for primary instance (instance 1) would contain the actual DAGs and the DAGs folder and the standby instance (say instance 2) would contain Counterpart poller (say PrimaryServerPoller).
    1. Primary server would be scheduling the actual DAGs as required.
    2. Standby server would be running the PrimaryServerPoller which would continuously poll the Primary Airflow scheduler.
  3. Lets assume, the Primary server has gone down. In that case, the PrimaryServerPoller would detect the same and
    1. Declare itself as the primary airflow server in the Mysql table.
    2. move the actual DAGs out of DAGs folder and moves the PrimaryServerPoller DAG into the DAGs folder on the older primary server (instance 1)
    3. move the actual DAGs into DAGs folder and move the PrimaryServerPoller DAG out of DAGs folder on the older standby (instance 2).
  4. So at this point of time, the Primary and Standby servers have swapped their positions.
  5. Even if the airflow scheduler on the current standby server (instance 1) comes back, since there would be only the CounterServerPoller DAG running on it, there would be no harm. And this server (instance 1) would remain to be standby till the current Primary server (instance 2) goes down.
  6. In case the current primary server goes down, the same process would repeat and the airflow running on instance 1 would become the Primary server.

Hopefully this give a better picture of the deployment options for airflow. Please check out part 3 of this series for steps on how to install airflow along with the commands for most of the steps.