Jungtaek Lim created SPARK-51891:
------------------------------------
Summary: Squeeze the protocol of ListState GET / PUT / APPENDLIST
for transformWithState in PySpark
Key: SPARK-51891
URL: https://issues.apache.org/jira/browse/SPARK-51891
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 4.1.0
Reporter: Jungtaek Lim
There are more opportunities to optimize the list state operations further in
transformWithState in PySpark.
* ListState.get() requires one more request to notice there is no further data
to read. We can remove that request.
** This needs inlining response data to the proto message.
* ListState.put() / ListState.appendList() requires one more request for every
case. For small list, inlining the data into proto message should definitely
help.
* ListState.put() / ListState.appendList() moved from Arrow to custom
protocol, but we realized pickled Python Row contains the schema information as
string, which is larger than we anticipated - depending on the number of
columns and the number of rows, Arrow could be efficient at some point.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]