HeartSaVioR opened a new pull request, #50488:
URL: https://github.com/apache/spark/pull/50488

   ### What changes were proposed in this pull request?
   
   This PR proposes to get rid of usage for Arrow on sending multiple elements 
of ListState and replace it with simple custom protocol.
   
   The custom protocol we are proposing is super simple and widely used.
   
   1. Write the size of the element (in bytes), if there is no more element, 
write -1
   2. Write the element (as bytes)
   3. Go back to 1
   
   Note that this PR only makes change to ListState - we are aware that there 
are more usages of Arrow in other state types or other functionality (timer). 
We want to improve over time via benchmarking and addressing if it shows the 
latency implication.
   
   ### Why are the changes needed?
   
   For small number of elements, Arrow does not perform very well compared to 
the custom protocol. In the benchmark, we have three elements to exchange 
between Python worker and JVM, and replacing Arrow with custom protocol could 
cut the elapsed time on state interaction by 1/3.
   
   Given the natural performance diff between Scala version of 
transformWithState and PySpark version of transformWithStateInPandas, I think 
users must use the Scala version to handle noticeable volume of workloads. We 
can position PySpark version to aim for more lightweight workloads - we can 
revisit if we see the opposite demands.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's an internal change.
   
   ### How was this patch tested?
   
   Existing UT, with modification about mock expectation.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to