Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

via GitHub Tue, 27 May 2025 13:08:00 -0700


alamb commented on issue #16200:
URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913885348


   > Yes very neat. I was actually thinking this would be along the other axis: 
loading metadata only for the _columns_ that are needed. My gut feeling is that 
a lot of compute is spent loading metadata for columns that aren't being 
filtered on. But I don't know if that's possible given the structure of the row 
group / page metadata.
   
   I think we could certainly avoid loading page metadata for columns 
   
   We would probably have to add some sort of new API to 
[`ParquetMetadataLoader`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataReader.html)
   
   One challenge / tradeoff that would be interesting/required is that doing 
another async load to read more of the metdata will be very bad if that has to 
actually go to object store again. Influx has it all cached in memory so it 
doesn't matter, but in general we need to be careful of adding additional 
requests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

Reply via email to