Faster PySpark UDFs using Apache Arrow in Spark 2.3.0

Wes McKinney Mon, 30 Oct 2017 11:06:06 -0700

hi all,

One of our newest committers, Li Jin, has been driving efforts to
speed up Python UDFs in Spark using Arrow. This was just written about
today:


https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

It's really exciting to see this kind of cross-project collaboration
bear fruit, and it validates our efforts hardening the Arrow
implementations so that such work can be seen through in real world
analytics applications. We had previously been working with the Spark
community purely on IO / data access by improving the performance of
the toPandas function for Spark data frames in Python
(http://arrow.apache.org/blog/2017/07/26/spark-arrow/).

Congrats Li and all other involved individuals from the Arrow and
Spark communities for their hard work on this! It is surely just the
beginning of much exciting Arrow-related work up ahead.

- Wes

Faster PySpark UDFs using Apache Arrow in Spark 2.3.0

Reply via email to