Jungtaek Lim created SPARK-51351:
------------------------------------

             Summary: TWS PySpark implementation materializes the entire output 
iterator in python worker
                 Key: SPARK-51351
                 URL: https://issues.apache.org/jira/browse/SPARK-51351
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 4.0.0
            Reporter: Jungtaek Lim


I figured out the implementation of dump_stream is materializing the output 
iterator.

This means all the outputs will be materialized when JVM signals to Python 
worker that there is no further input (at task completion), which brings up two 
critical issues:
 # output is seriously delayed to be produced for downstream operator
 # memory usage on Python worker would be problematic

We need to make this be "lazy evaluated", mostly iterator/generator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to