Re: [PR] perf: Optimize `array_agg()` using `GroupsAccumulator` [datafusion]

via GitHub Fri, 27 Feb 2026 19:07:56 -0800


neilconway commented on PR #20504:
URL: https://github.com/apache/datafusion/pull/20504#issuecomment-3976212053


   I think my initial approach of organizing the aggregate state by group 
wasn't ideal -- when there are many groups, this leads to a lot of small 
allocations and a more random memory access pattern. I pushed a new version of 
this PR that uses a per-batch organization instead: that is, for each batch we 
keep a reference to the batch contents, plus a Vec of `(group_idx, row_idx)` 
pairs, one for each row. There are many fewer batches than there are groups (at 
least in the many-groups case), so this can be a significant win. Updated 
benchmark numbers:
   
   ```
     
┌─────────────────────────┬─────────┬────────────────┬──────────────────────┐
     │        Benchmark        │  main   │ feature branch │        Change       
 │
     
├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
     │ few_groups              │ 607 µs  │ 679 µs         │ +11.5% (regression) 
 │
     
├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
     │ mid_groups              │ 3.61 ms │ 789 µs         │ -78.1% (4.6x 
faster) │
     
├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
     │ many_groups             │ 26.0 ms │ 1.04 ms        │ -96.0% (25x faster) 
 │
     
├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
     │ struct_mid_groups (new) │ 9.10 ms │ 996 µs         │ -89.1% (9.1x 
faster) │
     
└─────────────────────────┴─────────┴────────────────┴──────────────────────┘
     ```
   
   So this approach is ~2x my initial approach for the many-groups case. It is 
slightly slower for the few-groups case, but intuitively I'd guess that is 
tolerable: `array_agg` on a small number of groups is fast anyway, and the 
regression is relatively modest. But LMK if folks disagree on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Optimize `array_agg()` using `GroupsAccumulator` [datafusion]

Reply via email to