[PR] perf: Optimize heap handling in TopK operator [datafusion]

via GitHub Wed, 25 Feb 2026 10:14:13 -0800


AdamGS opened a new pull request, #20556:
URL: https://github.com/apache/datafusion/pull/20556


   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   - Closes #.
   
   ## Rationale for this change
   
   This change to make a significant performance impact in the `TopK` operator, 
which is a commonly used operator.
   
   ## What changes are included in this PR?
   
   Instead of doing two operations on the inner heap (pop than push), we use 
`Binary::peek_mut`, which allows us to replace the heap item in-place and then 
sift it to its proper location in the heap.
   
   Some SLT results seem to change, the only explanation I can find for it is 
that pop/push vs the sift_down that `PeekMut` uses have some subtle differences 
that resolve ties in a different way, ending up with a slightly different 
result.
   
   On my macbook, running the `topk_aggregate` benchmark, most benchmarks are 
not changed significantly, aside from the following:
   ```
   distinct 10000000 rows desc [no TopK]
                           time:   [554.69 ms 903.25 ms 1.3318 s]
                           change: [−82.888% −69.587% −47.591%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 17 outliers among 100 measurements (17.00%)
     5 (5.00%) high mild
     12 (12.00%) high severe
   
   Benchmarking distinct 10000000 rows asc [no TopK]: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 113.7s, or reduce sample count to 10.
   distinct 10000000 rows asc [no TopK]
                           time:   [405.87 ms 702.47 ms 1.0583 s]
                           change: [−86.490% −75.215% −51.486%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 17 outliers among 100 measurements (17.00%)
     3 (3.00%) high mild
     14 (14.00%) high severe
   
   distinct 10000000 rows desc [TopK]
                           time:   [6.8372 ms 6.9933 ms 7.1523 ms]
                           change: [−0.5254% +2.2409% +5.0920%] (p = 0.13 > 
0.05)
                           No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   distinct 10000000 rows asc [TopK]
                           time:   [6.8731 ms 6.9952 ms 7.1226 ms]
                           change: [+3.3252% +5.3824% +7.5131%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   ```
   
   ## Are these changes tested?
   
   Existing test suite.
   
   ## Are there any user-facing changes?
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: Optimize heap handling in TopK operator [datafusion]

Reply via email to