[ 
https://issues.apache.org/jira/browse/SPARK-54657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-54657.
-----------------------------------
    Resolution: Won't Do

> Refactor pyspark.sql.pandas.serializers for improved maintainability
> --------------------------------------------------------------------
>
>                 Key: SPARK-54657
>                 URL: https://issues.apache.org/jira/browse/SPARK-54657
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> The {{serializers.py}} file has grown to ~2200 lines with 25+ serializer 
> classes. Many share duplicated patterns that could be consolidated.
> The main issues:
> 1. *Duplicated load_stream patterns* - The "dataframes_in_group" reading loop 
> is repeated in 6+ classes:
> {code:python}
> # This pattern appears in GroupArrowUDFSerializer, 
> ArrowStreamAggArrowUDFSerializer,
> # ArrowStreamAggPandasUDFSerializer, GroupPandasUDFSerializer, 
> CogroupArrowUDFSerializer, etc.
> dataframes_in_group = None
> while dataframes_in_group is None or dataframes_in_group > 0:
>     dataframes_in_group = read_int(stream)
>     if dataframes_in_group == 1:
>         # process batches...
>     elif dataframes_in_group != 0:
>         raise PySparkValueError(...)
> {code}
> 2. *Duplicated dump_stream patterns* - The START_ARROW_STREAM writing appears 
> in 4+ classes:
> {code:python}
> # Repeated in ArrowStreamUDFSerializer, ArrowStreamPandasUDFSerializer, 
> # ArrowStreamArrowUDFSerializer, ApplyInPandasWithStateSerializer, etc.
> should_write_start_length = True
> for batch in iterator:
>     if should_write_start_length:
>         write_int(SpecialLengths.START_ARROW_STREAM, stream)
>         should_write_start_length = False
>     yield batch
> {code}
> 3. *Cogroup and single group handling are separate* - 
> {{GroupArrowUDFSerializer}} and {{CogroupArrowUDFSerializer}} have nearly 
> identical logic except one reads 1 dataframe per group, the other reads 2.
> 4. *File is too large* to navigate easily.
> Proposed refactoring:
> - Extract common patterns into mixins ({{GroupedLoadStreamMixin}}, 
> {{StartArrowStreamDumpMixin}})
> - Unify cogroup/single group handling logic
> - Split into submodules



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to