Hi All,

I have been looking into leverage the Arrow and Pandas UDF work we have
done so far for Window UDF in PySpark. I have done some investigation and
believe there is a way to do PySpark window UDF efficiently.

The basic idea is instead of passing each window to Python separately, we
can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices
for each window (indices are computed on the Java side), and then rolling
over the begin/end indices in Python and applies the UDF.

I have written my investigation in more details here:
https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#

I think this is a pretty promising and hope to get some feedback from the
community about this approach. Let's discuss! :)

Li

Reply via email to