Re: [I] Deduplicate Spark function code with native/default datafusion function code [datafusion]

via GitHub Wed, 17 Dec 2025 04:43:47 -0800


Jefffrey commented on issue #17964:
URL: https://github.com/apache/datafusion/issues/17964#issuecomment-3665182353


   > [@Jefffrey](https://github.com/Jefffrey) I took a look at avg, and had 
some questions; This is the current behaviour:
   > 
   >     * Spark avg only handles numeric→Float64, non-distinct, and uses i64 
count with Float64 sum. State schema is [sum: input_type, count: Int64].
   > 
   >     * DF avg supports decimals/durations/ints/floats, distinct, u64 
counts, and richer accumulators/state.
   > 
   > 
   > My Thoughts:
   > 
   >     1. Extract a configurable/shared avg in `datafusion_functions` (or a 
shared helper) that supports a “Spark mode” (i64 counts, state schema), but 
otherwise reuses the DF avg implementation (type coercion, distinct, groups 
accumulator).
   > 
   >     2. Replace the Spark avg implementation with a thin wrapper 
(`make_udaf_function!` style) over that shared avg, carrying only 
Spark-specific differences (e.g., count type or any ANSI-mode tweaks).
   > 
   >     3. If count type must stay `i64`(if this is what we want?), we can 
make it a small configuration knob in the shared code rather than a forked 
accumulator; otherwise align to DF’s u64 to remove more divergence.
   > 
   > 
   > I would like to know what are your thoughts on this.
   
   I'm hesitant to include the count being `i64` as part of the configuration 
for a shared average, unless we can find concrete reason for this being the 
case other than "Spark does it this way". For example, if there are execution 
related issues with using a `u64` in comet or if there is an edge case that is 
triggered when count is `i64` instead of `u64` (can we do averages for counts 
that high?).
   
   Similarly for the input types that Spark average supports, we likely need to 
check against Spark itself if we should be supporting more types natively or if 
it also just converts input types into float. I can't tell whether the Spark 
average implementation in DataFusion is completely following Spark semantics, 
or it so far only implements a subset of the semantics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Deduplicate Spark function code with native/default datafusion function code [datafusion]

Reply via email to