Yicong-Huang opened a new pull request, #54555:
URL: https://github.com/apache/spark/pull/54555

   ### What changes were proposed in this pull request?
   
   Add ASV microbenchmarks for two scalar Arrow UDF eval types in 
`python/benchmarks/bench_eval_type.py`:
   
   - `ScalarArrowUDFBench` (`SQL_SCALAR_ARROW_UDF`)
   - `ScalarArrowIterUDFBench` (`SQL_SCALAR_ARROW_ITER_UDF`)
   
   Each benchmark class constructs the complete binary protocol that 
`worker.py`'s `main(infile, outfile)` expects, then runs the full worker 
round-trip. This exercises the actual serde + UDF execution path without 
needing a live Spark session.
   
   **Cases per class** (varying batch size, column count, UDF complexity):
   - `small_few` / `small_many` — small batches with few vs. many columns
   - `large_few` / `large_many` — large batches with few vs. many columns
   - `compute` — arithmetic UDF on two columns
   - `mixed` — mixed column types with string UDF
   - Each case has both `time_` and `peakmem_` variants
   
   ### Why are the changes needed?
   
   There are currently no microbenchmarks for the PySpark scalar Arrow UDF 
worker pipeline. These benchmarks provide a baseline for measuring serde 
overhead and UDF execution performance, which is essential for:
   
   1. Catching performance regressions in the Arrow IPC path
   2. Comparing eval types (scalar Arrow vs. scalar Arrow iterator) under 
identical workloads
   3. Guiding future optimizations (e.g., iterator overhead)
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This only adds benchmark files under `python/benchmarks/`.
   
   ### How was this patch tested?
   
   ASV benchmark runs (`asv run --python=same`):
   
   **ScalarArrowUDFBench** (`SQL_SCALAR_ARROW_UDF`):
   
   | Benchmark | Time | Peak Memory |
   |-----------|------|-------------|
   | small_batches_few_cols (1k rows, 5 cols, 1500 batches) | 45.2 ms | 2.85 GB 
|
   | small_batches_many_cols (1k rows, 50 cols, 200 batches) | 17.9 ms | 2.85 
GB |
   | large_batches_few_cols (10k rows, 5 cols, 3500 batches) | 361 ms | 2.99 GB 
|
   | large_batches_many_cols (10k rows, 50 cols, 400 batches) | 162 ms | 2.87 
GB |
   | compute (10k rows, 3 cols, 500 batches) | 69.5 ms | 2.89 GB |
   | mixed_types (3 rows, 5 cols, 1300 batches) | 27.4 ms | 2.85 GB |
   
   **ScalarArrowIterUDFBench** (`SQL_SCALAR_ARROW_ITER_UDF`):
   
   | Benchmark | Time | Peak Memory |
   |-----------|------|-------------|
   | small_batches_few_cols (1k rows, 5 cols, 1500 batches) | 42.5 ms | 2.85 GB 
|
   | small_batches_many_cols (1k rows, 50 cols, 200 batches) | 17.7 ms | 2.85 
GB |
   | large_batches_few_cols (10k rows, 5 cols, 3500 batches) | 344 ms | 2.99 GB 
|
   | large_batches_many_cols (10k rows, 50 cols, 400 batches) | 163 ms | 2.87 
GB |
   | compute (10k rows, 3 cols, 500 batches) | 70.3 ms | 2.89 GB |
   | mixed_types (3 rows, 5 cols, 1300 batches) | 25.3 ms | 2.85 GB |
   
   The iterator variant shows near-identical performance to the non-iterator 
variant, confirming that the iterator wrapping overhead is negligible.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to