I would actually prefer that we don't rely too much on the analytics accelerator and rather focus on improving the native implementation.
I'm not opposed to the accelerator but there's a lot of hidden behaviors that have other tradeoffs in terms of requests and memory usage that aren't necessarily apparent. Something like this where you have a solution that works across multiple implementations is a generally good improvement. I am interested to see how big the performance difference is though. -Dan On Thu, Jun 25, 2026 at 4:08 AM Steve Loughran <[email protected]> wrote: > commented on the PR. > > you should be benchmarking against the aws accelerator as it is likely to > show less dramatic speedups, and be more honest in the process. > > IF you want to do some serious measurement of cost of measurement of s3 > head/get requests in benchmarks, > > 1. turn on s3 bucket logging to collect logs for requests > 2. set the user agent on your test processes to be unique > 3. grab the logs and count the requests after > > tool to take the aws logs, convert to avro record, after which you can > pull into spark > https://github.com/apache/hadoop-cloudstore/blob/main/src/site/markdown/auditlogs.md > > doing that as a before/after of any change assesses the real savings of > the work, independent of execution time. > > > On Tue, 23 Jun 2026 at 23:08, Varun Lakhyani <[email protected]> > wrote: > >> I have a PR [1] which doesn't affect current encryption or metrics or any >> other things. >> It just fetches the whole file as a bytes array and lets parquet or any >> format call to in memory rather than cloud that could be the only change >> here. >> >> Also, I will benchmark with the S3 accelerator enabled and will try to >> understand it further. >> That said, for small files the approaches are complementary - the >> accelerator does predictive prefetching which is valuable for large files, >> but for small files below a threshold a single whole-file fetch >> eliminates all prediction overhead entirely with bounded and predictable >> memory usage (capped at the threshold). >> >> The implementation is not tied to Parquet or S3 - EagerInputFile wraps >> any InputFile and works with any format (haven't tested but should work >> fine) >> I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure >> but the same benefit should be present for ADLS and GCS. >> >> [1] https://github.com/apache/iceberg/pull/16729 >> >> On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]> >> wrote: >> >>> Hello everyone, >>> >>> I would like to discuss an optimization for Iceberg's Parquet read path, >>> specifically around reducing S3 GET requests for small file workloads - >>> Root Manifest, Datafiles, and small file compaction. >>> >>> *Problem* >>> The current Iceberg flow for Spark readers uses parquet-mr. For each >>> FileScanTask, it issues 3 GET requests: >>> >>> 1. Footer size discovery - 1 GET reads the last 8 bytes of the >>> Parquet file to find the actual footer size (this.currentIterator = >>> open(currentTask) in BaseReader.next) >>> 2. Footer fetch - 1 GET reads the footer (this.currentIterator = >>> open(currentTask) in BaseReader.next) >>> 3. Row group fetch - 1 GET per row group to fetch actual data >>> (this.current = currentIterator.next() in BaseReader.next) >>> >>> >>> *Background* - arrows-rs (parquet rust implementation) >>> arrow-rs already addresses the first two calls via >>> `with_footer_size_hint`. It fetches a size hint from the bottom of the file >>> containing the actual footer size - if the footer already falls within that >>> fetched range, 1 GET is eliminated. if not, a second GET fetches the >>> footer. DataFusion builds on this today. >>> For our use case, we can go further: since the files are small, instead >>> of a hint we can fetch the whole file at once in a single GET - no memory >>> concern in parquet-mr - eliminating all 3 calls entirely. >>> As the number of files grows, footer request time starts dominating over >>> actual data request time - clearly visible in benchmarks below. >>> >>> *Two Approaches* >>> >>> 1. Implement directly in Iceberg - I have a high-level PR for this >>> implementation - complete workaround in Iceberg codebase. ( >>> https://github.com/apache/iceberg/pull/16729) >>> 2. Fix upstream in parquet-mr - The architecturally correct path: >>> add this functionality to parquet-mr itself and use it entirely, >>> mirroring >>> what the Rust implementation does natively. >>> >>> >>> *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement >>> iterations) >>> Combining S3 GET requests alone gives 60-65% improvement, with further >>> gains possible by parallelising them. >>> >>> [image: image.png] >>> >>> >>> As focus shifts towards Root Manifest, Datafiles in Parquet, and >>> multiple small file requirements, a dedicated effort here seems worth >>> pursuing. >>> I would be happy to hear any thoughts on this. Points to discuss are >>> which approach seems more convincing - Iceberg implementation or upstream >>> parquet-mr implementation and further thoughts on the gaps between >>> parquet-mr and arrow-rs specifically around getting footer. >>> >>> [1] PR for high level implementation - >>> https://github.com/apache/iceberg/pull/16729 >>> -- >>> -- >>> Lakhyani Varun >>> Indian Institute of Technology Roorkee >>> Contact: +91 96246 46174 >>> >>>
