Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

via GitHub Tue, 01 Jul 2025 12:19:03 -0700


alamb commented on issue #16365:
URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3025234584


   @XiangpengHao  perhaps has implemented code that could be useful here: 
https://github.com/XiangpengHao/liquid-cache/issues/227#issuecomment-3019766010
   
   Basically it shows how to implement paruqet metadata caching with just the 
external APIs
   
   It would be great to make this easier
   
   > [@JigaoLuo](https://github.com/JigaoLuo) has a great write up of why we 
need to cache metadata: [#227 
(comment)](https://github.com/XiangpengHao/liquid-cache/issues/227#issuecomment-3008600193)
   > 
   > # Metadata cache update
   > I have a metadata cache implementation a while ago, but have 
procrastinated forever to actually polish it.
   > 
   > Maybe perfect is the enemy of good, I'll just show whatever I already 
have: https://github.com/XiangpengHao/parquet-study. Copy some of the readme 
here:
   > 
   > So you want to cache Parquet metadata in DataFusion? (so that one Parquet 
metadata is read/decoded once and only once).
   > 
   > It's not easy (a blog post is coming soon), but not impossible.
   > 
   > ## Usage
   > Copy the `src/metedata_cache.rs` to your project, and use it like below.
   > 
   > ### Option 1
   > use datafusion::prelude::*;
   > use crate::metadata_cache::RegisterParquetWithMetaCache;
   > 
   > let ctx = SessionContext::new();
   > 
   > // Instead of: 
   > // ctx.register_parquet("table", "file.parquet", 
ParquetReadOptions::default()).await?;
   > ctx.register_parquet_with_meta_cache(
   >     "table", 
   >     "path/to/file.parquet", 
   >     ParquetReadOptions::default()
   > ).await?;
   > ### Option 2
   > If you're low-level listing table users:
   > 
   > use crate::metadata_cache::{ParquetFormatMetadataCacheFactory, 
ToListingOptionsWithMetaCache};
   > 
   > let parquet_options = ParquetReadOptions::default();
   > let listing_options = 
parquet_options.to_listing_options_with_meta_cache(&ctx.copied_config(), 
ctx.copied_table_options());
   > 
   > ctx.register_listing_table(
   >     "table",
   >     "path/to/file.parquet",
   >     listing_options,
   >     parquet_options.schema.map(|s| Arc::new(s.to_owned())),
   >     None,
   > ).await?;
   > ## More writings
   > Basically there are three places we read Parquet metadata:
   > 
   > 1. Infer schema
   > 2. Infer stats
   > 3. Open Parquet files
   > 
   > Reading metadata has two costs:
   > 
   > 1. IO cost to read the data. Each of the metadata read can cost up to 2 
network request: first to load the Parquet footer, which decides the metadata 
size; second to load the actual metadata. But most of the case we only need 1, 
e.g., we optimistically fetch the last 4MB of the parquet file, which likely 
already covers the all metadata.
   > 2. CPU cost to decode metadata. Metadata is not the fastest thing to 
decode, it's usually not a big problem, but some people complain about it.
   > 
   > (Since I have written so many things here, maybe just to finish that blog 
post?)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

Reply via email to