APPENDLIST for transformWithState in PySpark

Jungtaek Lim (Jira) Wed, 23 Apr 2025 20:54:30 -0700

Jungtaek Lim created SPARK-51891:
------------------------------------

             Summary: Squeeze the protocol of ListState GET / PUT / APPENDLIST 
for transformWithState in PySpark
                 Key: SPARK-51891
                 URL: https://issues.apache.org/jira/browse/SPARK-51891
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 4.1.0
            Reporter: Jungtaek Lim



There are more opportunities to optimize the list state operations further in 
transformWithState in PySpark.

 
 * ListState.get() requires one more request to notice there is no further data 
to read. We can remove that request.
 ** This needs inlining response data to the proto message.
 * ListState.put() / ListState.appendList() requires one more request for every 
case. For small list, inlining the data into proto message should definitely 
help.
 * ListState.put() / ListState.appendList() moved from Arrow to custom 
protocol, but we realized pickled Python Row contains the schema information as 
string, which is larger than we anticipated - depending on the number of 
columns and the number of rows, Arrow could be efficient at some point.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-51891) Squeeze the protocol of ListState GET / PUT / APPENDLIST for transformWithState in PySpark

Reply via email to