Airflow & Spark talk v2.0

Airflow and Spark Streaming at Astronomer

Prepared for ACM@UC
October 12, 2017
by Taylor Edmiston

This talk is forked from a talk I gave at the Cincinnati Data Science Meetup.

Overview

  1. Introduction
  1. Data Engineering
  1. Apache Airflow
  1. Apache Spark
  1. Wrap up

1/5 - Introduction

My Background

  • Now - Software Engineer at Astronomer
  • BS in CS, Wright State ’12
  • Experience working at and mentoring several Cincy startups
  • 9 years with Python, 5 years as a professional programmer

2/5 - Data Engineering

What is data engineering?

  • "The data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering.”

  • —Maxime Beauchemin, Creator of Airflow

  • Data engineers exists because companies have troves of data… but they need to be able to extract and manipulate it to glean value
  • Data engineering tools are how we make sense of it all quickly at scale
  • (Which tools? Data from where? At what scale?)

What is Astronomer?

  • Astronomer is a data engineering platform that connects and centralizes data, making it simple for anyone from business users to data scientists to aggregate streaming data and create data pipelines.


3/5 - Apache Airflow

Airflow Intro

  • Apache Airflow is a platform to programmatically author, schedule and monitor workflows.”
  • Airflow vs other frameworks - ex. Luigi (Spotify), Azkaban (LinkedIn)
  • Components are extensible and there are community contributed ones
  • Very widely used - Airbnb, Astronomer, Carbonite, FreshBooks, HBO, IFTTT, Lyft, New Relic, Postmates, Quora, Robinhood, Stripe, Uber (hard fork), Zapier, etc. (Who uses Airflow?)

Components