alamb opened a new issue, #15582: URL: https://github.com/apache/datafusion/issues/15582
### Is your feature request related to a problem or challenge? When looking at some Samply profiles of ClickBench queries on my laptop, it appears there are several times where processing stalls due to parsing parquet metadata:  To reproduce, [profile using Samply](https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/profiling.md#profiling-using-samply-cross-platform-profiler) To reproduce, get the ClickBench dataset ```shell cd benchmarks ./bench.sh data clickbench_1 ``` Then run ```shell datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;" ``` Profile wiht samply (you must build datafusion-cli with `--profile profiling): ``` samply record datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;" ``` ### Describe the solution you'd like I think we should look into caching this meta There is a bunch of prior art like - https://github.com/apache/datafusion/issues/11719 Also in theory this API should allow metadata caching: - https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManager.html But I don't think there is a default implementation and it isn't hooked up ### Describe alternatives you've considered What I would suggest doing first is 1. Do profiling / confirm you see the same thing 2. Make a quick and dirty global parquet metadata cache (just put it into some global variable key on filename) If you see significant performance improvements with 2, then we can figure out how to get it in for real ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org