Yicong Huang created SPARK-56222:
------------------------------------

             Summary: Split ArrowStreamSerializer into GroupSerializer and 
CoGroupSerializer for better type hints
                 Key: SPARK-56222
                 URL: https://issues.apache.org/jira/browse/SPARK-56222
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


Currently ArrowStreamSerializer uses num_dfs to switch between plain stream 
(num_dfs=0), grouped (num_dfs=1), and cogrouped (num_dfs=2) modes. This makes 
load_stream return different types depending on runtime state:

- num_dfs=0: Iterator[pa.RecordBatch]
- num_dfs=1: Iterator[Iterator[pa.RecordBatch]]
- num_dfs=2: Iterator[Tuple[Iterator[pa.RecordBatch], Iterator[pa.RecordBatch]]]

This prevents proper type annotations on both the serializer and the func 
closures in read_udfs().

Proposal: split into dedicated GroupSerializer and CoGroupSerializer classes so 
each has a single, well-typed load_stream signature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to