Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Steve Loughran Mon, 29 Jun 2026 04:40:13 -0700

On Sun, 28 Jun 2026 at 12:05, Varun Lakhyani <[email protected]>
wrote:


> Overall conclusion from my side is, in ideal condition running benchmark
> on EC2 within the same region as S3 gives around 40-42% improvement against
> default iceberg and around 8-10% improvement against AWS analytics
> accelerator (almost similar time).
>
> From numbers, I agree it's not great improvement as compared to analytics
> accelerator but it still reduces a HEAD call per file and enables this
> optimization for other object stores and implementation keeping it more
> visible and controllable as discussed in previous conversations.
> I will be eager and happy to hear everyone's thoughts on this.
>

these are good numbers, the graph really shows the differences.

What you have shown is that the speedup is there against the AAL library,
just not as dramatic. Probably due to the cost of the HEAD FWIW...anything
which can be done to cache file length and preserve it rather than ask for
it again saves time and money. What's key there is


   1. iceberg saves correct file size in manifests
   2. passes it down when opening files (no repeat probes for file length)
   3. passes it down to file format libraries (they need apis to support
   this)

#1 is key, https://github.com/apache/iceberg/pull/15470 has just fixed
delete file sizes, but https://github.com/apache/iceberg/pull/16910 is
still open. If ever that file length gets stale those HEAD requests get
skipped "because the file length is known" things will fail.

To conclude then

Production systems right now

   - Turn on AAL in production (default in s3a for 3.4.3+; needs a property
   fix in iceberg S3 client

Code reviews/merges needed for future releases

   1. Varun's PR for small files everywhere
   2. Danny's PR to wire up AAL vector reads for large parquet files on s3
   3. everything related to caching of filesize in manifests,


Longer term improvements

   1.  passing length in intermediate results to avoid recalcuation,
   calculation off critical path where possible.
   2. make sure that length is passed all the way from manifests/workers to
   file opening operations so they can save on the HEAD. AAL has to support
   that too.





> PS
> Just for reference per file parquet size created in s3 for this for each
> numFiles conditions (keeping total rows constant):
>
>    - 250 files - 594.4 KB/file
>    - 500 files - 301.1 KB/file
>    - 1000 files - 153.2 KB/file
>    - 2000 files - 76.4 KB/file
>
>
>
> On Sun, Jun 28, 2026 at 4:16 PM Varun Lakhyani <[email protected]>
> wrote:
>
>> I did Benchmarks against AAL and also got s3 bucket logs to get the
>> actual number of requests as suggested.
>> I tried to follow the best procedure and conditions according to me
>> as mentioned below, Please let me know if anything is not appropriate.
>>
>> *Benchmark Setup (JMH)*
>> Machine: AWS EC2 (same region as the S3 bucket to minimize network
>> latency)
>>
>> Property
>>
>> Value
>>
>> Instance type
>>
>> r5.4xlarge
>>
>> vCPUs
>>
>> 16
>>
>> Memory
>>
>> 128 GB
>>
>> Network
>>
>> Up to 10 Gbps
>>
>> Storage
>>
>> 50 GB gp3 EBS
>>
>> AMI
>>
>> Amazon Linux 2023
>>
>> Region
>>
>> ap-south-1 (Mumbai)
>>
>> Benchmark: IcebergDataCompactionBenchmark extended with aalEnabled and
>> eagerThreshold params for this comparison
>>
>>    - Warmup iterations: 5
>>    - Measurement iterations: 10
>>    - Total rows: 20,000,000 (20 million)
>>
>> *Results*
>>
>> Number of Files
>>
>> Default (s)
>>
>> AAL = true (s)
>>
>> EagerInputFile = true (s)
>>
>> EagerInputFile Improvement vs Default
>>
>> EagerInputFile Improvement vs AAL
>>
>> 250
>>
>> 45.104
>>
>> 29.671
>>
>> 27.311
>>
>> 39.45%
>>
>> 7.95%
>>
>> 500
>>
>> 77.788
>>
>> 54.030
>>
>> 46.553
>>
>> 40.15%
>>
>> 13.84%
>>
>> 1000
>>
>> 163.238
>>
>> 107.138
>>
>> 95.429
>>
>> 41.54%
>>
>> 10.93%
>>
>> 2000
>>
>> 312.252
>>
>> 195.158
>>
>> 179.351
>>
>> 42.56%
>>
>> 8.10%
>>
>>
>> *Graphical Comparison [1] *(I insist to see this as it makes things
>> clear visually)
>>
>

>
>> *S3 Request Analysis*
>> The number of GET/HEAD requests observed from the S3 bucket logs is
>> computed as:
>> Total Requests / (Total Iterations × Number of Files)
>>
>> Configuration
>>
>> GET/file
>>
>> HEAD/file
>>
>> Total Calls/file
>>
>> Default
>>
>> 3
>>
>> 0
>>
>> 3
>>
>> AAL = true
>>
>> 1
>>
>> 1
>>
>> 2
>>
>> EagerInputFile = true
>>
>> 1
>>
>> 0
>>
>> 1
>>
>>
>> Also, attaching all s3 raw logs and jmh outputs [2]
>>
>>
>> [1]
>> http://github.com/varun-lakhyani/iceberg-default-aal-eagerinputfile/blob/main/benchmark-results.png
>> [2]
>> https://github.com/varun-lakhyani/iceberg-default-aal-eagerinputfile/blob/main/README.md
>>
>> On Fri, Jun 26, 2026 at 12:35 AM Jones, Danny <[email protected]>
>> wrote:
>>
>>> I have been meaning to chime in on this thread, I’m part of the S3 team
>>> and caught up with a few folks who have better context than me on the
>>> analytics accelerator (AAL for S3). (I’m usually having fun with
>>> iceberg-rust day-to-day.)
>>>
>>>
>>>
>>> I think it’s great to see optimizations upstream, either to iceberg-java
>>> or to parquet-mr. One of driving reasons behind AAL was to be able to
>>> deliver a lot of meaningful improvements across different analytics
>>> libraries (primarily S3FileIO and S3A), but ultimately I would second Dan’s
>>> point that it will be great to see these sorts of optimizations made
>>> accessible to all users of iceberg-java (and parquet-mr even!). In the
>>> meantime, users can opt-in to the accelerator for S3-based workloads.
>>>
>>> The changes proposed by Varun sound good. There are a few others we had
>>> in mind – Steve L mentioned integration with vectored IO APIs which would
>>> deliver read optimizations in the right layer without the IO stream needing
>>> to understand the data format.
>>>
>>>
>>>
>>> There are two things I’d recommend as further reading (though this is a
>>> bit beyond the 3 GET optimization that was the original purpose for this
>>> thread):
>>>
>>>
>>>
>>>    - This doc explored the optimizations made in AAL:
>>>    
>>> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/
>>>    - This e-mail thread proposed making AAL generic, as a central way
>>>    to optimize streams across many Apache projects. There was interesting
>>>    discussion around pushing the optimizations instead into the iceberg or
>>>    parquet layers.
>>>    https://lists.apache.org/thread/cy6y5xf5gg8fr12pg64f77gxdrtv52fn
>>>
>>>
>>>
>>> Danny
>>>
>>>
>>>
>>> *From: *Daniel Weeks <[email protected]>
>>> *Reply to: *"[email protected]" <[email protected]>
>>> *Date: *Thursday, 25 June 2026 at 18:56
>>> *To: *"[email protected]" <[email protected]>
>>> *Subject: *RE: [EXTERNAL] [DISCUSS] Combine 3 GET calls for parquet
>>> reads - Root Manifest, Datafiles and compaction of small files
>>>
>>>
>>>
>>> *CAUTION*: This email originated from outside of the organization. Do
>>> not click links or open attachments unless you can confirm the sender and
>>> know the content is safe.
>>>
>>>
>>>
>>> I would actually prefer that we don't rely too much on the analytics
>>> accelerator and rather focus on improving the native implementation.
>>>
>>>
>>>
>>> I'm not opposed to the accelerator but there's a lot of hidden behaviors
>>> that have other tradeoffs in terms of requests and memory usage that aren't
>>> necessarily apparent.
>>>
>>>
>>>
>>> Something like this where you have a solution that works across multiple
>>> implementations is a generally good improvement.
>>>
>>>
>>>
>>> I am interested to see how big the performance difference is though.
>>>
>>>
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Thu, Jun 25, 2026 at 4:08 AM Steve Loughran <[email protected]>
>>> wrote:
>>>
>>> commented on the PR.
>>>
>>>
>>>
>>> you should be benchmarking against the aws accelerator as it is likely
>>> to show less dramatic speedups, and be more honest in the process.
>>>
>>>
>>>
>>> IF you want to do some serious measurement of cost of measurement of s3
>>> head/get requests in benchmarks,
>>>
>>>    1. turn on s3 bucket logging to collect logs for requests
>>>    2. set the user agent on your test processes to be unique
>>>    3. grab the logs and count the requests after
>>>
>>> tool to take the aws logs, convert to avro record, after which you can
>>> pull into spark
>>> https://github.com/apache/hadoop-cloudstore/blob/main/src/site/markdown/auditlogs.md
>>>
>>>
>>>
>>> doing that as a before/after of any change assesses the real savings of
>>> the work, independent of execution time.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 23 Jun 2026 at 23:08, Varun Lakhyani <[email protected]>
>>> wrote:
>>>
>>> I have a PR [1] which doesn't affect current encryption or metrics or
>>> any other things.
>>> It just fetches the whole file as a bytes array and lets parquet or any
>>> format call to in memory rather than cloud that could be the only change
>>> here.
>>>
>>> Also, I will benchmark with the S3 accelerator enabled and will try to
>>> understand it further.
>>> That said, for small files the approaches are complementary - the
>>> accelerator does predictive prefetching which is valuable for large files,
>>> but for small files below a threshold a single whole-file fetch
>>> eliminates all prediction overhead entirely with bounded and predictable
>>> memory usage (capped at the threshold).
>>>
>>> The implementation is not tied to Parquet or S3 - EagerInputFile wraps
>>> any InputFile and works with any format (haven't tested but should work
>>> fine)
>>> I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure
>>> but the same benefit should be present for ADLS and GCS.
>>>
>>> [1] https://github.com/apache/iceberg/pull/16729
>>>
>>>
>>>
>>> On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]>
>>> wrote:
>>>
>>> Hello everyone,
>>>
>>> I would like to discuss an optimization for Iceberg's Parquet read path,
>>> specifically around reducing S3 GET requests for small file workloads -
>>> Root Manifest, Datafiles, and small file compaction.
>>>
>>> *Problem*
>>> The current Iceberg flow for Spark readers uses parquet-mr. For each
>>> FileScanTask, it issues 3 GET requests:
>>>
>>> 1.      Footer size discovery - 1 GET reads the last 8 bytes of the
>>> Parquet file to find the actual footer size (this.currentIterator =
>>> open(currentTask) in BaseReader.next)
>>>
>>> 2.      Footer fetch - 1 GET reads the footer (this.currentIterator =
>>> open(currentTask) in BaseReader.next)
>>>
>>> 3.      Row group fetch - 1 GET per row group to fetch actual data
>>> (this.current = currentIterator.next() in BaseReader.next)
>>>
>>>
>>> * Background* - arrows-rs (parquet rust implementation)
>>>
>>> arrow-rs already addresses the first two calls via
>>> `with_footer_size_hint`. It fetches a size hint from the bottom of the file
>>> containing the actual footer size - if the footer already falls within that
>>> fetched range, 1 GET is eliminated. if not, a second GET fetches the
>>> footer. DataFusion builds on this today.
>>> For our use case, we can go further: since the files are small, instead
>>> of a hint we can fetch the whole file at once in a single GET - no memory
>>> concern in parquet-mr - eliminating all 3 calls entirely.
>>> As the number of files grows, footer request time starts dominating over
>>> actual data request time - clearly visible in benchmarks below.
>>>
>>> *Two Approaches*
>>>
>>> 1.      Implement directly in Iceberg - I have a high-level PR for this
>>> implementation - complete workaround in Iceberg codebase. (
>>> https://github.com/apache/iceberg/pull/16729)
>>>
>>> 2.      Fix upstream in parquet-mr - The architecturally correct path:
>>> add this functionality to parquet-mr itself and use it entirely, mirroring
>>> what the Rust implementation does natively.
>>>
>>>
>>>
>>> *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement
>>> iterations)
>>>
>>> Combining S3 GET requests alone gives 60-65% improvement, with further
>>> gains possible by parallelising them.
>>>
>>>
>>>
>>> *Error! Filename not specified.*
>>>
>>>
>>> As focus shifts towards Root Manifest, Datafiles in Parquet, and
>>> multiple small file requirements, a dedicated effort here seems worth
>>> pursuing.
>>> I would be happy to hear any thoughts on this. Points to discuss are
>>> which approach seems more convincing - Iceberg implementation or upstream
>>> parquet-mr implementation and further thoughts on the gaps between
>>> parquet-mr and arrow-rs specifically around getting footer.
>>>
>>> [1] PR for high level implementation -
>>> https://github.com/apache/iceberg/pull/16729
>>>
>>> --
>>>
>>> --
>>>
>>> Lakhyani Varun
>>>
>>> Indian Institute of Technology Roorkee
>>>
>>> Contact: +91 96246 46174
>>>
>>>
>>>
>>>

Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Reply via email to