Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

via GitHub Tue, 01 Apr 2025 07:35:35 -0700


geoffreyclaude commented on issue #15529:
URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2769593513


   I ran some quick [experiments on my 
fork](https://github.com/geoffreyclaude/datafusion/pull/3) by checking for 
early termination after each batch processed in the "topK" on the example TPCH 
query above:
   - Elapsed dropped from `16s` to `800ms`: 20x speedup
   - The Parquet DataSource `output_rows` metric dropped from `17135217` to 
`81920` (81920 because it read 1 batch of 8192 rows in parallel on 10 
partitions): 200x reduction
   - The Parquet DataSource `bytes_scanned` metric dropped from `130MB` to 
`23MB`: 5x reduction (which doesn't align at all with the `output_rows` 
reduction for some reason...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

Reply via email to