korowa commented on code in PR #13681: URL: https://github.com/apache/datafusion/pull/13681#discussion_r1932677883
########## datafusion/functions-aggregate/src/median.rs: ########## @@ -230,6 +276,212 @@ impl<T: ArrowNumericType> Accumulator for MedianAccumulator<T> { } } +/// The median groups accumulator accumulates the raw input values +/// +/// For calculating the accurate medians of groups, we need to store all values +/// of groups before final evaluation. +/// So values in each group will be stored in a `Vec<T>`, and the total group values +/// will be actually organized as a `Vec<Vec<T>>`. +/// +#[derive(Debug)] +struct MedianGroupsAccumulator<T: ArrowNumericType + Send> { + data_type: DataType, + group_values: Vec<Vec<T::Native>>, Review Comment: Just wonder -- using `Vec<Vec<>>` for as a state storage doesn't seem to differ much from a regular accumulator, but this PR still introduces a noticeable performance improvement. Are there any other optimizations that could be used in regular accumulator? P.S. asking just because when I was doing +- same for count distinct ([PR](https://github.com/apache/datafusion/pull/8721)), the performance for GroupsAccumulator with `Vec<HashSet<>>` was not that significant comparing to regular accumulators with `HashSet<>` states. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org