Airflow & Spark talk

Airflow and Spark Streaming at Astronomer

June 5, 2017 
By Taylor Edmiston

Overview

  1. Introduction
  1. Data Engineering
  1. Apache Airflow
  1. Apache Spark
  1. Closing

1. Introduction

My Background

  • Software Engineer at Astronomer (core platform)
  • Experience working at and mentoring several Cincy startups
  • 9 years with Python

2. Data Engineering

What is data engineering?

  • "The data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering.” —Maxime

  • Data engineers exists because companies have troves of data… but they need to be able to extract and manipulate it to glean value
  • Data engineering tools are how we make sense of it all quickly (and at scale)
  • (Which tools? Data from where? At what scale?)

What is Astronomer?

  • Astronomer’s platform connects and centralizes data, making it super simple for anyone from business users to data scientists to quickly create and monitor data pipelines across the entire organization.


3. Apache Airflow

Airflow Intro

  • Airflow is a platform to programmatically author, schedule and monitor workflows.”
  • Airflow vs other frameworks - ex. Luigi (Spotify), Azkaban (LinkedIn)
  • Components are extensible and there are community contributed ones

Components
  • DAGs
  • Operators - ex. PythonOperator
  • Executor

Airflow at Astronomer

  • Executor