bharath-techie commented on issue #19216: URL: https://github.com/apache/datafusion/issues/19216#issuecomment-3633472687
Hi @alamb @zhuqi-lucas , We are doing similar experiments to run clickbench queries with datafusion in lower memory instances. Not sure if we have an EPIC to track all issues in common place. What we noticed is that topK doesn't spill and hence all clickbench `groupBy` queries with `OrderBy` + `Limit` even with single target partition such as Q13 ``` SELECT "SearchPhrase", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY u DESC LIMIT 10; ``` Q33 ``` SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 10; ``` also fail with out of memory error for < 8 GB RAM allocated in DF-cli. [ github.com/apache/datafusion/issues/9417 might be relevant issue ] @alchemist51 and I've been looking into improving queries in this area. @alchemist51 is looking into reviving https://github.com/apache/datafusion/pull/15591 and I was able to get a working spill in my fork for `topK` operator - https://github.com/bharath-techie/datafusion/tree/spilltest Can you please share your views / suggestions on the same ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
