Aditya Kaushal
8 min readOct 4, 2020

--

A Workflow and a data pipeline management system to author, schedule, and monitor complex workflows and ETLs.

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows, which is also known as Workflow Scheduler/DAGscheduler.

Similar Projects to do workflows:

  • Luigi
  • Oozie

The Apache Airflow was developed by Airbnb in the year 2014 and later joined the Apache Software Foundation in early 2016.

Brief about Apache Airflow:

  • A workflow management system developed by Airbnb.
  • Executing, scheduling, distributing tasks around worker nodes.
  • View of present and past runs, logging features.
  • Extensible through plugins.
  • Nice UI, possibility to define REST interface.
  • Interact well with the database.
  • Used by more than 200 companies: Airbnb, Yahoo, Paypal, Intel, Stripe.

What is the workflow?

A typical example of a workflow:

  • A sequence of Tasks.
  • Started on a schedule or triggered by an event.
  • Frequently used to handle big data processing pipelines.

Steps involved in the above pipeline:

  1. Download Data.
  2. Send Data somewhere else to process.
  3. Monitor when the process is completed
  4. Get the result and generate the report.
  5. Send the report out by email.

Traditional ETL approach

Example of a naive approach:

  • Writing a script to pull data from the database and send it to the HDFS to process.
  • Schedule the script as a cronjob

What are the problems with the Traditional Approach?

  • Failure — Retry if failure happens.
  • Monitoring — Success or failure, how long does the process runs?
  • Dependencies — Data Dependencies: Job 2 runs after job 1 is finished.
  • Scalability — There is no centralized scheduler between different CRON machines.
  • Deployment — Deploy new Changes
  • Process Historic Data — Backfill/rerun historical data

Why do we use Airflow?

  1. Configurable (Dynamic) — Configuration as a code (All the configuration are stored in airflow.cfg)
  2. Great UI/UX is relative to other workflow tools.
  3. Centralized configuration
  4. Scalability — Workers can run unto Infinite Task.
  5. Resource Pooling
  6. Extensibility — Supports different type of task execution
  7. Task Parallelization via workflow design
  8. Task logs sync

Airflow Workflow:

  • A workflow as a Directed Acyclic Graph (DAG) with multiple tasks that can be executed independently.
  • Airflow DAGs are composed of tasks.

What makes Airflow great?

  • Can handle upstream/downstream dependencies (Example: upstream missing tables)
  • Easy to reprocess historical jobs by date, or re-run for specific intervals
  • Jobs can pause parameters to other jobs downstream
  • Handle errors and failures gracefully. Automatically retry when a task fails.
  • Ease of deployment of workflow changes
  • Integration with a lot of infrastructures (Hive, AWS, Google Cloud, etc)
  • Data Sensor Task to trigger a DAG when data arrives.
  • Job testing through Airflow itself.
  • Accessibility of log files and other meta-data through the web GUI
  • Implement trigger rules for tasks.
  • Monitoring all job status in real-time + Email alerts.
  • Community Support.

Airflow Applications:

  1. Data Warehousing: Cleanse, Organize, Data Quality check, and publish/stream data into our growing data warehouse.
  2. Machine Learning: Automate Machine Learning Workflows
  3. Growth Analytics: Compute metrics ground tests.
  4. Experimentation: Compute A/B testing
  5. Email Targeting: Apply rules to target and engage users through email campaigns.
  6. Sessionization: Compute clickstream and time spent datasets.
  7. Search: Compute Search ranking related metrics.
  8. Data Infrastructure maintenance: Databases scrapes, folder cleanup, applying data retention policies.

Data Science Hierarchy:

  • The Airflow frameworks put things into perspective.
  • The bottom area of the Pyramid is the part where Airflow helps us, i.e collect, clean and automate, and schedule workflows.

Airflow Key Concepts

Workflows (DAGs):

  • Workflows are called DAGs (Directed Acyclic Graph)
  • A DAGs is a collection of all tasks you to run, organize in a way that reflects their relationship and dependencies.

What is a Directed Acyclic Graph?

A Graph has 2 main parts:

  • The vertices (nodes) where the data is stored.
  • The edges (connections) which connect the nodes.

DAGs continued:

  • Graphs are used to solve many real-life problems because they are used to represents networks
  • For example Social Networks, System of Roads, Airline Flights from City to City, How the Internet is connected, etc.

Types of Graphs:

  • Undirected Graphs: The relationships exist in both directions, the edge has no directions. Example: If A was a friend of B, B would likewise be a friend of A.
  • Directed Graphs: Directions matters, since the edges in the graphs are all one-way.
  • An example graph: The course requirements for a computer science major.
  • The class prerequisites graph is clearly a digraph since you must take some classes before others.
  • Acyclic Graphs: A graph has no cycle
  • Cyclic Graphs: A Graph has cycles.

DAG Summary:

Directed Acyclic Graph that has no cycles and the data in each node flows forward in only one direction.

  • It is useful to represent a complex data flow using a graph.
  • Each Node in the graph is a task.
  • The edges represent dependencies amongst tasks.
  • These graphs are called computation graphs or data flows and it transforms the data as it flows through the graph and enables very complex numeric computations.
  • Given the data only needs to be computed once on a given task and the computation then carries forward, the graph is directed and acyclic. This is why Airflow jobs are commonly referred to as DAGs.

DAG Internal: Operator and Task:

DAGs comprise Task which are instantiations of various operators, organized in some dependent order.

  • Tasks typically don’t communicate with each other (only via Xcom- simple key-value attributes. )

Operators and Tasks:

  • DAGs (DAGs tells how to execute your Jobs) does not perform any actual computation. Instead, Operators determine what actually gets done.
  • Task (Tasks tells what to execute): Once an operator is instantiated, it is referred to as a “task”. An Operator describes a single task in the workflow.
  • Instantiating a task requires providing a unique task_id and DAG container.
  • A DAG is a container that is used to organize tasks and set their execution context.
  • DAGs (Directed tells how to execute your Jobs) does not perform any actual computation. Instead, Operators determine what actually gets done.
  • Task (Task tells what to execute): Once an operator is instantiated, it is referred to as “task”. An Operator describes a single task in the workflow.
  • Instantiating a task requires providing a unique task_id and DAG container.
  • A DAG is a container that is used to organize tasks and set their execution context.

Operators Categories:

Typically, Operators are classifieds into 3 categories:

  • Sensors
  • Operators
  • Transfers

Sensors: A certain type of operator that will keep running until a certain criterion is met. Example including waiting for a certain time, external file, or upstream data source.

  • Hdfs Sensors: Waits for a folder to land in HDFS
  • NamedHivePartition: Check whether the most recent partition of a Hive table is available for downstream processing.

Operators: Triggers a certain action

  • BashOperator: Executes a Bash Command.
  • PythonOperator: Calls an arbitrary Python Function
  • HiveOperator: Executes HQL Code or Hive Script in a specific Hive Database.
  • Big QueryOperator: Executes Big Query SQL queries in a specific BigQuery database.
  • SQL Operator
  • S3 Operator
  • Google Cloud Operators
  • Postgres Operators.

Transfers: Moves Data from location to another.

  • MySQLToHiveTransfer: Moves data from MySql to Hive.
  • S3ToRedshiftTransfer: Load files from s3 to RedShift.

Working with Operators

  • Airflow provides prebuilt operators for many common tasks.
  • There are more operators being added by the community.
  • All Operators are derived from the BaseOperators and acquire much functionality through inheritance. Contributors can extend BaseOperator's class to create custom operators as they see fit.

Defining Task Dependencies

After defining a DAG, and instance all the tasks, you can then set the dependencies or the order in which the tasks should be executed.

Task dependencies are set using:

  • The set_upstream and set_downstream operators.
  • The bit shift operators << and >>

t1.set_upstream(t2)

DAG Runs and TaskInstances

  • A key concept in Airflow is the execution_time. The execution times begin at the DAG’s start time_date and repeat every schedule_interval.
  • For this example, the scheduled execution times would be (‘2020–12–01 00:00:00’, ‘2020- 12–02 00:00:00’, …). For each execution_time, a DagRun is created and operates under the context of that execution time. A dog run is simply a DAG that has a specific execution time.

DagRuns are DAGs that runs at a certain time.

  • TaskInstances are the task belongs to that DagRuns.
  • Each DagRun and TaskInstance is associated with an entry in Airflow’s metadata database that logs their state (e.g. “queued”, “running”, “failed”, “skipped”, “up for retry”).

DAG Workflows: Perfect for ETL Paradigm

Some key points:

  • Keep tasks idempotent — preferably dockerize your ETL steps.
  • Add retries on task executions
  • Allow alert on failure

Building an Airflow Pipeline

How to create DAG/Pipeline/DAG DefinitionFile

  • Import Python Packages.
  • Configure certain parameters. (Default Args) [Owner, Start_time, End_Time, DAG_alerts, emails, retry policies.] — avast list of parameters.
  • Instantiate DAG Object (parent object that will contain different tasks.)
  • Start defining your tasks. (t1, t2, t3)
  • Task Dependencies
  • T1>T2>T3>T4 T1>[T1,T2]

Some useful task arguments

  • retries
  • email_on_failure
  • on_failure_callback
  • Pool
  • queue (Celery Only)
  • execution_timeout
  • trigger_rule

Data Pipelines in Airflow

Executors (Task Execution)

  • All the tasks which are executing are independent of each other and are running in separate threads or different workers.
  • The execution completely depends on the resources available as part of the airflow installation setup.
  • If the airflow installation has been done on a single machine then the task would run on a single machine, otherwise, if the airflow installation has been done on the distributed machine, then the task would execute separately.

Types of Resource Manager (Executor)

  • Local Executor: Single Machine.
  • Celery Executor: Multiple Workers.

The celery executor utilizes two different technologies:

  • RabbitMQ
  • Redis

Everytaskexecutesondifferentmachines.They communicate with each other using a messaging Queue as part of RabbitMQ implementation.

  • DASK: Distributed Cluster which helps to spin up the different tasks on different workers.
  • Mesos: Resource manager which helps to manage resources in a distributed cluster.

XCOM

Every task running in the Airflow environment is not sharing any information with each other and they are not exchanging any information.

Each process is mutually exclusive of each other.

XCOM: Cross Communication

  • xcom_push
  • xcom_pull

--

--