alamb commented on PR #15562:
URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2779705535

   > My current intuition about it is twofold:
   
   I agree with this assesment -- so we save CPU work with this change but the 
total query time doesn't really decrease because we are not fully using all of 
the cores
   
   I also have found Samply super awesome -- thank you @comphead  for showing 
that
   
   Something else I have observed when looking at Samply is that on my laptop 
at least, it appears there are several times where processing stalls due to 
parsing parquet metadata:
   
   ![Screenshot 2025-04-04 at 4 47 14 
PM](https://github.com/user-attachments/assets/0b6d5023-4bda-4fc4-89e6-b5d4f83a39fe)
   
   
   I was running this
   ```
   ./datafusion-cli-filter-pushdown -c "SELECT \"WatchID\", \"ClientIP\", 
COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM hits WHERE 
\"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 
10;"
   ```
   
   It would be an interesting experiment to see if we can improve performance 
by caching the metadata.
   
   I will file a ticket to investigat this
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to