geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2769593513
I ran some quick [experiments on my fork](https://github.com/geoffreyclaude/datafusion/pull/3) by checking for early termination after each batch processed in the "topK" on the example TPCH query above: - Elapsed dropped from `16s` to `800ms`: 20x speedup - The Parquet DataSource `output_rows` metric dropped from `17135217` to `81920` (81920 because it read 1 batch of 8192 rows in parallel on 10 partitions): 200x reduction - The Parquet DataSource `bytes_scanned` metric dropped from `130MB` to `23MB`: 5x reduction (which doesn't align at all with the `output_rows` reduction for some reason...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org