Introducing Airflow SLA Miss Report

Analyzing Airflow SLA misses beyond the regular metrics

Nikhil Manjunatha

Published in

Clairvoyant Blog

6 min readOct 21, 2022

Codebase

https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master /sla-miss-report

Introduction

Airflow has an inherent SLA alert mechanism. It allows users to define Service Level Agreements (SLAs) on an individual task level. The results are present in the Airflow UI under Browse → SLA Misses. When the scheduler sees such an SLA miss for some task, it sends an alert by email and saves the SLA misses in its database. However, there are only limited insights that you can derive from the email message.

The common pain points faced on a daily basis which any DAG creator goes through are that as the following:

Hard to identify actionable insights from the Web UI representation of Airflow SLA
Email alerts generated for SLA misses can get too cluttered, which gets difficult to diagnose
No performance indicators available

Hence there is a need for a better metric system which can help DAG creators efficiently analyze their pipelines better.

Architecture

The process reads data from the Airflow metadata database to calculate SLA misses based on the defined DAG/task level SLAs using information. The following metadata tables are utilized:

SerializedDag: retrieve defined DAG & task SLAs
DagRuns: details about each DAG run
TaskInstances: details about each task instance in a DAG run

About

The airflow-sla-miss-report DAG consolidates the data from the metadata tables and provides meaningful insights over email to the subscriber to ensure SLAs are met. What sets it apart is the fact that it dwells on a custom KPI and indicators that are very useful to measure the performance of the DAG and also offers a comparative view of them. It also gives users the flexibility to modify the timeframe and email list according to their requirement.

The DAG utilizes three (3) timeframes (default: short: 1d, medium: 3d, long: 7d) to calculate the following KPIs:

Daily SLA Misses (timeframe: `long`)

The following details are broken down on a daily basis for the provided long timeframe (e.g. 7 days):

SLA Miss (%): percentage of tasks that missed their SLAs out of total tasks runs
Top Violator (%): task that violated its SLA the most as a percentage of its total runs
Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day

Use Case

KPI tracks the aggregate SLA miss (%) of all your pipelines on a daily basis and the DAG creator can make changes/improvements according to that.
Top Violator (% and absolute) performance indicator can help the DAG creator understand which is the common most violating DAG/Task in terms of % and number on a daily basis which will help in identifying and diagnosing the common DAG/tasks missing their SLAs.

Hourly SLA Misses (timeframe: `short`)

The following details are broken down on an hourly basis for the provided short timeframe (e.g. 1 day):

SLA Miss (%): percentage of tasks that missed their SLAs out of total tasks runs
Top Violator (%): task that violated its SLA the most as a percentage of its total runs
Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
Longest Running Task: task that took the longest time to execute within the hour window
Average Task Queue Time (s): avg time taken for tasks in `queued` state; can be used to detect scheduling bottlenecks

Use Case

KPIs is broken down on an hourly basis and provides the DAG creator with many useful insights which can help them in better scheduling of their pipelines based on the number of DAGS/tasks every hour. This will help to increase efficiency.
Along with the same 3 indicators as the last KPI, Longest Running Task and Average Task Queue Time performance indicators would help in identifying the Longest Running Task and the Average Task Queue Time for a task before it executes. Which might result in the DAG creator needing to review the DAG code of the Task and perform some fine tuning.

DAG SLA Misses (timeframe: `short, medium, long`)

The following details are broken down on a task level for all timeframes:

Current SLA (s): current defined SLA for the task
Short, Medium, Long Timeframe SLA miss % (avg execution time): % of tasks that missed their SLAs & their avg execution times over the respective timeframes

Use Case

KPI can be treated as a holistic comparative view to have a look at all the DAG/Tasks created by the DAG creator and offer them an outlook to compare how their pipelines are performing over a period of time (depending on the values of timeframes entered). This will help them gauge whether any recent changes made in the code have resulted in any improvement or not.

Sample Output Email

Challenges Faced

1. Choosing the appropriate metrics

Issue: Another challenge was in choosing the appropriate metrics and figuring out the right ones which will be useful for the DAG creator which makes their job easier.

Resolution: This involved lot of brainstorming sessions. The base criteria for choosing any metric was to see whether it serves the purpose of its results creating an impact. Since the focus was purely on SLA, we had to stick to the metrics relevant to that. All metadata tables relevant to SLA were gone through and at the end select few tables such as DagRuns, TaskInstances, and Serialized DAG were chosen for creating the metrics.

2. Making the code generic with minimal external dependencies:

Issue: The third challenge was to make the code as generic as possible with minimal external dependencies. Since the backend database can be instantiated in MySQL, PostgreSQL, and SQL Server, it would be very easy to write SQL queries on top of the meta database and retrieve the metrics but then it wouldn’t have been generic to all and would have been typecasted to one particular set of database.

Resolution: Since the code was meant to be made open source, it was of utmost importance that the code did not have any other unnecessary packages, imports, or dependencies. Hence, only numpy and pandas packages were used, this will make sure that the code runs on almost all the machines with minimal chance of breaking.

Maintenance DAG Requirements

Python: 3.7 and above
Pip packages: pandas
Airflow: v2.3 and above
Airflow metadata tables: DagRuns, TaskInstances, SerializedDag
SMTP details in airflow.cfg for sending emails

Conclusion

There are a bunch of other metrics when it comes to Airflow SLA that can be tracked. A few of the main ones which can be tracked are:

Logs of the tasks that missed their SLA: Logs for the tasks which constantly miss their SLA can be tracked. This will give us the details of the reason behind the SLA miss which can be helpful in diagnosing the issue.
Airflow operators that cause the most lag OR miss their SLA: Airflow operators that most frequently miss their SLA or have a higher average wait time can be drawn. This will help us in thinking of an alternate approach if the lag is a bit too much and not improving even after optimization.
Comparative analysis between task pool slot and task average run time: Airflow pools allow you to limit parallelism for an arbitrary set of tasks, allowing you to control when your tasks are run. We can compare a task’s average run time along with the pool slots assigned to it. If the run time is on the higher end, we can check if it has been assigned to a pool or not and assign higher priority weights to it enabling faster and smoother execution.
User Login Analysis: We can have a detailed user analysis wherein we keep a track of the average number of users who log in on a daily basis and how many active users are present at a given time. This might help in keeping a track of people logging in daily and removing any inactive users.
Details about the failed tasks: Since the KPIs in this project only involve SLA miss, we can implement a similar analysis on task failure. This will give us a holistic view of the reasons behind the failure and performance metrics related to that.

References

Airflow Installation: https://airflow.apache.org/docs/apacheairflow/stable/installation/index.html#using-pypi
Airflow DAGs: https://airflow.apache.org/docs/apacheairflow/stable/concepts/dags.html
Airflow Scheduler: https://airflow.apache.org/docs/apache-airflow/1.10.10/scheduler.html

Special Mention: I would personally like to thank Prakshal Jain for guiding me through this project and helping me navigate any doubts that I had along the way.

Introducing Airflow SLA Miss Report

Analyzing Airflow SLA misses beyond the regular metrics

Codebase

Introduction

Architecture

About

Daily SLA Misses (timeframe: long)

Hourly SLA Misses (timeframe: short)

DAG SLA Misses (timeframe: short, medium, long)

Sample Output Email

Challenges Faced

Maintenance DAG Requirements

Conclusion

References

Written by Nikhil Manjunatha

Daily SLA Misses (timeframe: `long`)

Hourly SLA Misses (timeframe: `short`)

DAG SLA Misses (timeframe: `short, medium, long`)