[ https://issues.apache.org/jira/browse/SPARK-51891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-51891: ----------------------------------- Labels: pull-request-available (was: ) > Squeeze the protocol of ListState GET / PUT / APPENDLIST for > transformWithState in PySpark > ------------------------------------------------------------------------------------------ > > Key: SPARK-51891 > URL: https://issues.apache.org/jira/browse/SPARK-51891 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 4.1.0 > Reporter: Jungtaek Lim > Priority: Major > Labels: pull-request-available > > There are more opportunities to optimize the list state operations further in > transformWithState in PySpark. > > * ListState.get() requires one more request to notice there is no further > data to read. We can remove that request. > ** This needs inlining response data to the proto message. > * ListState.put() / ListState.appendList() requires one more request for > every case. For small list, inlining the data into proto message should > definitely help. > * ListState.put() / ListState.appendList() moved from Arrow to custom > protocol, but we realized pickled Python Row contains the schema information > as string, which is larger than we anticipated - depending on the number of > columns and the number of rows, Arrow could be efficient at some point. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org