[ 
https://issues.apache.org/jira/browse/SPARK-55579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Huang updated SPARK-55579:
---------------------------------
    Description: 
h2. Background

Currently, {{SQL_SCALAR_ARROW_ITER_UDF}} uses Pandas-specific error classes 
(e.g., {{PANDAS_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}}, 
{{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_PANDAS_UDF}}, 
{{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_PANDAS_UDF}}).

Since this is a pure Arrow UDF eval type, it should use Arrow-specific error 
classes for clarity and consistency.

  was:
h2. Background

Currently, {{SQL_SCALAR_ARROW_ITER_UDF}} uses Pandas-specific error classes 
(e.g., {{PANDAS_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}}, 
{{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_PANDAS_UDF}}, 
{{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_PANDAS_UDF}}).

Since this is a pure Arrow UDF eval type, it should use Arrow-specific error 
classes for clarity and consistency.

h2. Proposal

Create three new error classes in both Python and Scala:

1. {{ARROW_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}} - For fail-fast check when output 
exceeds input rows
2. {{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_ARROW_UDF}} - For final row count 
mismatch
3. {{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_ARROW_UDF}} - For iterator 
consumption verification

Update the {{SQL_SCALAR_ARROW_ITER_UDF}} implementation in 
{{python/pyspark/worker.py}} to use these new error classes.

h2. Files to modify

- {{python/pyspark/errors/error-conditions.json}} - Add new error class 
definitions
- {{common/utils/src/main/resources/error/error-conditions.json}} - Add 
corresponding Scala definitions
- {{python/pyspark/worker.py}} - Update error_class parameters in verify_* 
function calls (lines ~3045, ~3052, ~3061)


> Create Arrow-specific error classes for SCALAR_ITER_ARROW_UDF
> -------------------------------------------------------------
>
>                 Key: SPARK-55579
>                 URL: https://issues.apache.org/jira/browse/SPARK-55579
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Minor
>
> h2. Background
> Currently, {{SQL_SCALAR_ARROW_ITER_UDF}} uses Pandas-specific error classes 
> (e.g., {{PANDAS_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}}, 
> {{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_PANDAS_UDF}}, 
> {{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_PANDAS_UDF}}).
> Since this is a pure Arrow UDF eval type, it should use Arrow-specific error 
> classes for clarity and consistency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to