Brijesh-Thakkar opened a new pull request, #19581:
URL: https://github.com/apache/datafusion/pull/19581
## Which issue does this PR close?
- Addresses apache/datafusion-comet#2986
## Rationale for this change
The `octet_length` scalar function showed significant performance
degradation in
Spark workloads when executed via Comet, as reported in the Comet
performance EPIC.
The existing implementation relied on the generic Arrow `length` kernel for
array
inputs, which introduces unnecessary overhead in vectorized execution. Since
`octet_length` semantics require computing the number of bytes in UTF-8
strings,
this can be implemented more efficiently using Arrow’s concrete string array
APIs.
Optimizing this function in DataFusion improves performance for downstream
projects
such as Comet and Spark without changing behavior or semantics.
## What changes are included in this PR?
- Replaced the use of the generic Arrow `length` kernel for array inputs in
`octet_length`
- Added a specialized implementation for:
- `StringArray`
- `LargeStringArray`
- `StringViewArray`
- Computed byte lengths directly using `value_length`, avoiding unnecessary
indirection and overhead
- Left the scalar execution path unchanged
## Are these changes tested?
Yes.
- Existing unit tests for `octet_length` were executed and pass successfully
- Core integration tests exercising `octet_length` also pass
- No new tests were required, as existing coverage already validates
correctness
across scalar and array inputs, including UTF-8 and null handling
## Are there any user-facing changes?
No.
This change is purely a performance optimization and does not affect:
- SQL syntax
- Function semantics
- Return types
- Error behavior
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]