Full session (30 minutes)
Data Science
big data

With Spark being widely used as a map/reduce framework for various data pipelines, it lacked capability of defining intuitive user-defined functions (UDFs). Contrary to Spark, Pandas is one of the most prevalent data-processing python packages and while it works well locally, it cannot scale to out-of-memory computations. In this talk I will briefly discuss the introduction of Pandas vectorized UDFs to Spark, why they matter, and how it enabled us to create a scalable and dynamic feature-engineering framework, crunching through TBs of data with ease.

Asaf Valadarsky

Sefi Itzkovich