hi all, One of our newest committers, Li Jin, has been driving efforts to speed up Python UDFs in Spark using Arrow. This was just written about today:
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html It's really exciting to see this kind of cross-project collaboration bear fruit, and it validates our efforts hardening the Arrow implementations so that such work can be seen through in real world analytics applications. We had previously been working with the Spark community purely on IO / data access by improving the performance of the toPandas function for Spark data frames in Python (http://arrow.apache.org/blog/2017/07/26/spark-arrow/). Congrats Li and all other involved individuals from the Arrow and Spark communities for their hard work on this! It is surely just the beginning of much exciting Arrow-related work up ahead. - Wes