alamb commented on code in PR #13681:
URL: https://github.com/apache/datafusion/pull/13681#discussion_r1932796896


##########
datafusion/functions-aggregate/src/median.rs:
##########
@@ -230,6 +276,212 @@ impl<T: ArrowNumericType> Accumulator for 
MedianAccumulator<T> {
     }
 }
 
+/// The median groups accumulator accumulates the raw input values
+///
+/// For calculating the accurate medians of groups, we need to store all values
+/// of groups before final evaluation.
+/// So values in each group will be stored in a `Vec<T>`, and the total group 
values
+/// will be actually organized as a `Vec<Vec<T>>`.
+///
+#[derive(Debug)]
+struct MedianGroupsAccumulator<T: ArrowNumericType + Send> {
+    data_type: DataType,
+    group_values: Vec<Vec<T::Native>>,

Review Comment:
   I think among other things, the intermediate state management (creating 
ListArrays directly rather than from ScalarValue) probably helps a lot: 
   
   
https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate/src/median.rs#L200-L199
   
   There is also an extra allocation per group when using the groups 
accumulator adapter thingie
   
   That being said, it is a fair question how much better the existing 
MedianAccumulator could be if it built the ListArrays as does this PR directly 
🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to