matthewmturner commented on issue #15425: URL: https://github.com/apache/datafusion/issues/15425#issuecomment-2752619613
I think this is similar to something I recently asked on Discord - except I had in mind using only the metadata stats for queries like "SELECT MAX(timestamp) FROM quotes" This was my full comment / question "Im doing some data exploration on a table in datafusion where im running the following `SELECT MAX(timestamp) FROM quotes`. The `quotes` table is about 100GB of data. When i run `EXPLAIN ANALYZE` on this plan i see from the `ParquetExec` 6B+ output rows and 30GB+ of bytes scanned. Given that I'm only getting the MAX for the column shouldnt I be able to get this by doing much less work and only looking at the row group metadata stats and not scanning any data? That would give me a huge performance improvement (the metadata load time is < 1% of the total time scanning)." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org