Re: [PR] Improve speed of `median` by implementing special `GroupsAccumulator` [datafusion]

via GitHub Tue, 28 Jan 2025 20:25:33 -0800


Rachelint commented on code in PR #13681:
URL: https://github.com/apache/datafusion/pull/13681#discussion_r1933233382



##########
datafusion/functions-aggregate/src/median.rs:
##########
@@ -230,6 +276,212 @@ impl<T: ArrowNumericType> Accumulator for 
MedianAccumulator<T> {
     }
 }
 
+/// The median groups accumulator accumulates the raw input values
+///
+/// For calculating the accurate medians of groups, we need to store all values
+/// of groups before final evaluation.
+/// So values in each group will be stored in a `Vec<T>`, and the total group 
values
+/// will be actually organized as a `Vec<Vec<T>>`.
+///
+#[derive(Debug)]
+struct MedianGroupsAccumulator<T: ArrowNumericType + Send> {
+    data_type: DataType,
+    group_values: Vec<Vec<T::Native>>,

Review Comment:
   @korowa I think what mentioned by @alamb is a important about the 
improvement.
   
   Following are some other points for me:
   - in `GroupsAccumulatorAdapter::update_batch`, we need to reorder the `input 
batch`, and use `slice` to split the reordered batch after. I think such two 
operations may be not cheap.
   
https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs#L241-L265
   
   - in `GroupsAccumulatorAdapter::merge_batch`, the similar problem as `input 
batch` may be even more serious... Becasue we need to reorder a `ListArray`
   
   - and in `GroupsAccumulatorAdapter::state`, extra allocations exist as 
mentioned by @alamb .
   



##########
datafusion/functions-aggregate/src/median.rs:
##########
@@ -230,6 +276,212 @@ impl<T: ArrowNumericType> Accumulator for 
MedianAccumulator<T> {
     }
 }
 
+/// The median groups accumulator accumulates the raw input values
+///
+/// For calculating the accurate medians of groups, we need to store all values
+/// of groups before final evaluation.
+/// So values in each group will be stored in a `Vec<T>`, and the total group 
values
+/// will be actually organized as a `Vec<Vec<T>>`.
+///
+#[derive(Debug)]
+struct MedianGroupsAccumulator<T: ArrowNumericType + Send> {
+    data_type: DataType,
+    group_values: Vec<Vec<T::Native>>,

Review Comment:
   @korowa I think what mentioned by @alamb is a important point about the 
improvement.
   
   Following are some other points for me:
   - in `GroupsAccumulatorAdapter::update_batch`, we need to reorder the `input 
batch`, and use `slice` to split the reordered batch after. I think such two 
operations may be not cheap.
   
https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs#L241-L265
   
   - in `GroupsAccumulatorAdapter::merge_batch`, the similar problem as `input 
batch` may be even more serious... Becasue we need to reorder a `ListArray`
   
   - and in `GroupsAccumulatorAdapter::state`, extra allocations exist as 
mentioned by @alamb .
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve speed of `median` by implementing special `GroupsAccumulator` [datafusion]

Reply via email to