Jungtaek Lim created SPARK-51351: ------------------------------------ Summary: TWS PySpark implementation materializes the entire output iterator in python worker Key: SPARK-51351 URL: https://issues.apache.org/jira/browse/SPARK-51351 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jungtaek Lim
I figured out the implementation of dump_stream is materializing the output iterator. This means all the outputs will be materialized when JVM signals to Python worker that there is no further input (at task completion), which brings up two critical issues: # output is seriously delayed to be produced for downstream operator # memory usage on Python worker would be problematic We need to make this be "lazy evaluated", mostly iterator/generator. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org