Jungtaek Lim created SPARK-51891: ------------------------------------ Summary: Squeeze the protocol of ListState GET / PUT / APPENDLIST for transformWithState in PySpark Key: SPARK-51891 URL: https://issues.apache.org/jira/browse/SPARK-51891 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 4.1.0 Reporter: Jungtaek Lim
There are more opportunities to optimize the list state operations further in transformWithState in PySpark. * ListState.get() requires one more request to notice there is no further data to read. We can remove that request. ** This needs inlining response data to the proto message. * ListState.put() / ListState.appendList() requires one more request for every case. For small list, inlining the data into proto message should definitely help. * ListState.put() / ListState.appendList() moved from Arrow to custom protocol, but we realized pickled Python Row contains the schema information as string, which is larger than we anticipated - depending on the number of columns and the number of rows, Arrow could be efficient at some point. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org