Hi All, I have been looking into leverage the Arrow and Pandas UDF work we have done so far for Window UDF in PySpark. I have done some investigation and believe there is a way to do PySpark window UDF efficiently.
The basic idea is instead of passing each window to Python separately, we can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices for each window (indices are computed on the Java side), and then rolling over the begin/end indices in Python and applies the UDF. I have written my investigation in more details here: https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit# I think this is a pretty promising and hope to get some feedback from the community about this approach. Let's discuss! :) Li