Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-09 Thread via GitHub
berkaysynnada closed issue #15529: Extend TopK early termination to partially sorted inputs URL: https://github.com/apache/datafusion/issues/15529 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-06 Thread via GitHub
alamb commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2781385443 @NGA-TRAN and @gabotechs can you please help review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-05 Thread via GitHub
geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2773087938 > FWIW my view is that https://github.com/apache/datafusion/pull/15301 tries to implement data skipping for partially sorted / globally roughly clustered inputs where the

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-05 Thread via GitHub
NGA-TRAN commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2780700162 Thanks for a nice real life use case and benchmarking numbers, @geoffreyclaude -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-04 Thread via GitHub
geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2779122390 PR should be ready for review. I've included some pretty nice benchmark results from https://github.com/apache/datafusion/pull/15560: ``` > ./bench.sh compare ma

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-04 Thread via GitHub
geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2772276984 @alamb: > This may be some overlap with this work from @adriangb (though I realize you are talking about a different optimization) The two are complimentary. @ad

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-02 Thread via GitHub
adriangb commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2772802775 FWIW my view is that #15301 tries to implement data skipping for partially sorted / globally roughly clustered inputs where there is an ORDER BY LIMIT on the sortedish dimensio

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-01 Thread via GitHub
alamb commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2770458720 This may be some overlap with this work from @adriangb (though I realize you are talking about a different optimization) - https://github.com/apache/datafusion/issues/15037

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-01 Thread via GitHub
geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2769593513 I ran some quick [experiments on my fork](https://github.com/geoffreyclaude/datafusion/pull/3) by checking for early termination after each batch processed in the "topK"

[I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-01 Thread via GitHub
geoffreyclaude opened a new issue, #15529: URL: https://github.com/apache/datafusion/issues/15529 ### Is your feature request related to a problem or challenge? DataFusion currently has a "TopK early termination" optimization, which speeds up queries that involve `ORDER BY` and `LIMIT