Reversim Summit 2019

Full session (30 minutes)

Data Science

big data

Engineering

With Spark being widely used as a map/reduce framework for various data pipelines, it lacked capability of defining intuitive user-defined functions (UDFs). Contrary to Spark, Pandas is one of the most prevalent data-processing python packages and while it works well locally, it cannot scale to out-of-memory computations. In this talk I will briefly discuss the introduction of Pandas vectorized UDFs to Spark, why they matter, and how it enabled us to create a scalable and dynamic feature-engineering framework, crunching through TBs of data with ease.

Spark + Pandas = ♥

Asaf Valadarsky

Sefi Itzkovich