matthewmturner commented on issue #15425:
URL: https://github.com/apache/datafusion/issues/15425#issuecomment-2752619613

   I think this is similar to something I recently asked on Discord - except I 
had in mind using only the metadata stats for queries like "SELECT 
MAX(timestamp) FROM quotes"
   
   This was my full comment / question
   
   "Im doing some data exploration on a table in datafusion where im running 
the following `SELECT MAX(timestamp) FROM quotes`.  The `quotes` table is about 
100GB of data.   When i run `EXPLAIN ANALYZE` on this plan i see from the 
`ParquetExec` 6B+ output rows and 30GB+ of bytes scanned.  Given that I'm only 
getting the MAX for the column shouldnt I be able to get this by doing much 
less work and only looking at the row group metadata stats and not scanning any 
data?  That would give me a huge performance improvement (the metadata load 
time is < 1% of the total time scanning)."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to