Full session (30 minutes)
Engineering
Infrastructure
Data

So you need to move some data from here to there in some given time - easy! You write a few scripts and let *nix Cron handle the tasks. Then you realize that you need a more robust solution with some dependency between tasks and more visualization on them, so you build a few pipelines in Jenkins to handle that. As the data and business needs grow, handling scheduled tasks for different missions at scale becomes too complex:

  • Task dependencies (fan-in / fan-out)
  • Task priorities & SLA
  • Task modularity
  • Multiple data sources
  • Parallelize tasks
  • Visualization of tasks over time
  • Task logging
  • Task timeouts
  • Retries on failure
  • Notifications on task error

In this talk we will take a deep dive into Apache Airflow, and how it helps us solve that complexity.

Uri Shamay (cmpxchg16)