alamb commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2779705535
> My current intuition about it is twofold: I agree with this assesment -- so we save CPU work with this change but the total query time doesn't really decrease because we are not fully using all of the cores I also have found Samply super awesome -- thank you @comphead for showing that Something else I have observed when looking at Samply is that on my laptop at least, it appears there are several times where processing stalls due to parsing parquet metadata:  I was running this ``` ./datafusion-cli-filter-pushdown -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM hits WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;" ``` It would be an interesting experiment to see if we can improve performance by caching the metadata. I will file a ticket to investigat this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org