Yicong Huang created SPARK-56222:
------------------------------------
Summary: Split ArrowStreamSerializer into GroupSerializer and
CoGroupSerializer for better type hints
Key: SPARK-56222
URL: https://issues.apache.org/jira/browse/SPARK-56222
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Currently ArrowStreamSerializer uses num_dfs to switch between plain stream
(num_dfs=0), grouped (num_dfs=1), and cogrouped (num_dfs=2) modes. This makes
load_stream return different types depending on runtime state:
- num_dfs=0: Iterator[pa.RecordBatch]
- num_dfs=1: Iterator[Iterator[pa.RecordBatch]]
- num_dfs=2: Iterator[Tuple[Iterator[pa.RecordBatch], Iterator[pa.RecordBatch]]]
This prevents proper type annotations on both the serializer and the func
closures in read_udfs().
Proposal: split into dedicated GroupSerializer and CoGroupSerializer classes so
each has a single, well-typed load_stream signature.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]