Jefffrey commented on issue #17964: URL: https://github.com/apache/datafusion/issues/17964#issuecomment-3665182353
> [@Jefffrey](https://github.com/Jefffrey) I took a look at avg, and had some questions; This is the current behaviour: > > * Spark avg only handles numeric→Float64, non-distinct, and uses i64 count with Float64 sum. State schema is [sum: input_type, count: Int64]. > > * DF avg supports decimals/durations/ints/floats, distinct, u64 counts, and richer accumulators/state. > > > My Thoughts: > > 1. Extract a configurable/shared avg in `datafusion_functions` (or a shared helper) that supports a “Spark mode” (i64 counts, state schema), but otherwise reuses the DF avg implementation (type coercion, distinct, groups accumulator). > > 2. Replace the Spark avg implementation with a thin wrapper (`make_udaf_function!` style) over that shared avg, carrying only Spark-specific differences (e.g., count type or any ANSI-mode tweaks). > > 3. If count type must stay `i64`(if this is what we want?), we can make it a small configuration knob in the shared code rather than a forked accumulator; otherwise align to DF’s u64 to remove more divergence. > > > I would like to know what are your thoughts on this. I'm hesitant to include the count being `i64` as part of the configuration for a shared average, unless we can find concrete reason for this being the case other than "Spark does it this way". For example, if there are execution related issues with using a `u64` in comet or if there is an edge case that is triggered when count is `i64` instead of `u64` (can we do averages for counts that high?). Similarly for the input types that Spark average supports, we likely need to check against Spark itself if we should be supporting more types natively or if it also just converts input types into float. I can't tell whether the Spark average implementation in DataFusion is completely following Spark semantics, or it so far only implements a subset of the semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
