HeartSaVioR opened a new pull request, #50488: URL: https://github.com/apache/spark/pull/50488
### What changes were proposed in this pull request? This PR proposes to get rid of usage for Arrow on sending multiple elements of ListState and replace it with simple custom protocol. The custom protocol we are proposing is super simple and widely used. 1. Write the size of the element (in bytes), if there is no more element, write -1 2. Write the element (as bytes) 3. Go back to 1 Note that this PR only makes change to ListState - we are aware that there are more usages of Arrow in other state types or other functionality (timer). We want to improve over time via benchmarking and addressing if it shows the latency implication. ### Why are the changes needed? For small number of elements, Arrow does not perform very well compared to the custom protocol. In the benchmark, we have three elements to exchange between Python worker and JVM, and replacing Arrow with custom protocol could cut the elapsed time on state interaction by 1/3. Given the natural performance diff between Scala version of transformWithState and PySpark version of transformWithStateInPandas, I think users must use the Scala version to handle noticeable volume of workloads. We can position PySpark version to aim for more lightweight workloads - we can revisit if we see the opposite demands. ### Does this PR introduce _any_ user-facing change? No, it's an internal change. ### How was this patch tested? Existing UT, with modification about mock expectation. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org