[DISCUSS] PySpark Window UDF

Li Jin Wed, 16 May 2018 08:35:37 -0700

Hi All,

I have been looking into leverage the Arrow and Pandas UDF work we have
done so far for Window UDF in PySpark. I have done some investigation and
believe there is a way to do PySpark window UDF efficiently.


The basic idea is instead of passing each window to Python separately, we
can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices
for each window (indices are computed on the Java side), and then rolling
over the begin/end indices in Python and applies the UDF.

I have written my investigation in more details here:
https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#

I think this is a pretty promising and hope to get some feedback from the
community about this approach. Let's discuss! :)

Li

[DISCUSS] PySpark Window UDF

Reply via email to