alamb opened a new issue, #15582:
URL: https://github.com/apache/datafusion/issues/15582

   ### Is your feature request related to a problem or challenge?
   
   When  looking at some Samply profiles of ClickBench queries on my laptop, it 
appears there are several times where processing stalls due to parsing parquet 
metadata:
   
   
   ![Screenshot 2025-04-04 at 4 47 14 
PM](https://github.com/user-attachments/assets/0b6d5023-4bda-4fc4-89e6-b5d4f83a39fe)
   
   
   To reproduce, [profile using 
Samply](https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/profiling.md#profiling-using-samply-cross-platform-profiler)
   
   
   To reproduce, get the ClickBench dataset
   ```shell
   cd benchmarks
   ./bench.sh data clickbench_1
   ```
   
   Then run
   
   ```shell
   datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, 
SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE 
\"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 
10;"
   ```
   
   Profile wiht samply (you must build datafusion-cli with `--profile 
profiling):
   ```
   samply record datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) 
AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' 
WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC 
LIMIT 10;"
   ```
   
   
   
   
   ### Describe the solution you'd like
   
   I think we should look into caching this meta
   
   There is a bunch of prior art like
   - https://github.com/apache/datafusion/issues/11719
   
   Also in theory this API should allow metadata caching:
   - 
https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManager.html
   
   But I don't think there is a default implementation and it isn't hooked up 
   
   
   ### Describe alternatives you've considered
   
   What I would suggest doing first is
   1. Do profiling / confirm you see the same thing
   2. Make a quick and dirty global parquet metadata cache (just put it into 
some global variable key on filename)
   
   If you see significant performance improvements with 2, then we can figure 
out how to get it in for real
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to