alamb commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3025234584
@XiangpengHao perhaps has implemented code that could be useful here: https://github.com/XiangpengHao/liquid-cache/issues/227#issuecomment-3019766010 Basically it shows how to implement paruqet metadata caching with just the external APIs It would be great to make this easier > [@JigaoLuo](https://github.com/JigaoLuo) has a great write up of why we need to cache metadata: [#227 (comment)](https://github.com/XiangpengHao/liquid-cache/issues/227#issuecomment-3008600193) > > # Metadata cache update > I have a metadata cache implementation a while ago, but have procrastinated forever to actually polish it. > > Maybe perfect is the enemy of good, I'll just show whatever I already have: https://github.com/XiangpengHao/parquet-study. Copy some of the readme here: > > So you want to cache Parquet metadata in DataFusion? (so that one Parquet metadata is read/decoded once and only once). > > It's not easy (a blog post is coming soon), but not impossible. > > ## Usage > Copy the `src/metedata_cache.rs` to your project, and use it like below. > > ### Option 1 > use datafusion::prelude::*; > use crate::metadata_cache::RegisterParquetWithMetaCache; > > let ctx = SessionContext::new(); > > // Instead of: > // ctx.register_parquet("table", "file.parquet", ParquetReadOptions::default()).await?; > ctx.register_parquet_with_meta_cache( > "table", > "path/to/file.parquet", > ParquetReadOptions::default() > ).await?; > ### Option 2 > If you're low-level listing table users: > > use crate::metadata_cache::{ParquetFormatMetadataCacheFactory, ToListingOptionsWithMetaCache}; > > let parquet_options = ParquetReadOptions::default(); > let listing_options = parquet_options.to_listing_options_with_meta_cache(&ctx.copied_config(), ctx.copied_table_options()); > > ctx.register_listing_table( > "table", > "path/to/file.parquet", > listing_options, > parquet_options.schema.map(|s| Arc::new(s.to_owned())), > None, > ).await?; > ## More writings > Basically there are three places we read Parquet metadata: > > 1. Infer schema > 2. Infer stats > 3. Open Parquet files > > Reading metadata has two costs: > > 1. IO cost to read the data. Each of the metadata read can cost up to 2 network request: first to load the Parquet footer, which decides the metadata size; second to load the actual metadata. But most of the case we only need 1, e.g., we optimistically fetch the last 4MB of the parquet file, which likely already covers the all metadata. > 2. CPU cost to decode metadata. Metadata is not the fastest thing to decode, it's usually not a big problem, but some people complain about it. > > (Since I have written so many things here, maybe just to finish that blog post?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org