Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Daniel Weeks Thu, 25 Jun 2026 10:55:48 -0700

I would actually prefer that we don't rely too much on the analytics
accelerator and rather focus on improving the native implementation.


I'm not opposed to the accelerator but there's a lot of hidden behaviors
that have other tradeoffs in terms of requests and memory usage that aren't
necessarily apparent.

Something like this where you have a solution that works across multiple
implementations is a generally good improvement.

I am interested to see how big the performance difference is though.

-Dan

On Thu, Jun 25, 2026 at 4:08 AM Steve Loughran <[email protected]> wrote:

> commented on the PR.
>
> you should be benchmarking against the aws accelerator as it is likely to
> show less dramatic speedups, and be more honest in the process.
>
> IF you want to do some serious measurement of cost of measurement of s3
> head/get requests in benchmarks,
>
>    1. turn on s3 bucket logging to collect logs for requests
>    2. set the user agent on your test processes to be unique
>    3. grab the logs and count the requests after
>
> tool to take the aws logs, convert to avro record, after which you can
> pull into spark
> https://github.com/apache/hadoop-cloudstore/blob/main/src/site/markdown/auditlogs.md
>
> doing that as a before/after of any change assesses the real savings of
> the work, independent of execution time.
>
>
> On Tue, 23 Jun 2026 at 23:08, Varun Lakhyani <[email protected]>
> wrote:
>
>> I have a PR [1] which doesn't affect current encryption or metrics or any
>> other things.
>> It just fetches the whole file as a bytes array and lets parquet or any
>> format call to in memory rather than cloud that could be the only change
>> here.
>>
>> Also, I will benchmark with the S3 accelerator enabled and will try to
>> understand it further.
>> That said, for small files the approaches are complementary - the
>> accelerator does predictive prefetching which is valuable for large files,
>> but for small files below a threshold a single whole-file fetch
>> eliminates all prediction overhead entirely with bounded and predictable
>> memory usage (capped at the threshold).
>>
>> The implementation is not tied to Parquet or S3 - EagerInputFile wraps
>> any InputFile and works with any format (haven't tested but should work
>> fine)
>> I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure
>> but the same benefit should be present for ADLS and GCS.
>>
>> [1] https://github.com/apache/iceberg/pull/16729
>>
>> On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I would like to discuss an optimization for Iceberg's Parquet read path,
>>> specifically around reducing S3 GET requests for small file workloads -
>>> Root Manifest, Datafiles, and small file compaction.
>>>
>>> *Problem*
>>> The current Iceberg flow for Spark readers uses parquet-mr. For each
>>> FileScanTask, it issues 3 GET requests:
>>>
>>>    1. Footer size discovery - 1 GET reads the last 8 bytes of the
>>>    Parquet file to find the actual footer size (this.currentIterator =
>>>    open(currentTask) in BaseReader.next)
>>>    2. Footer fetch - 1 GET reads the footer (this.currentIterator =
>>>    open(currentTask) in BaseReader.next)
>>>    3. Row group fetch - 1 GET per row group to fetch actual data
>>>    (this.current = currentIterator.next() in BaseReader.next)
>>>
>>>
>>> *Background* - arrows-rs (parquet rust implementation)
>>> arrow-rs already addresses the first two calls via
>>> `with_footer_size_hint`. It fetches a size hint from the bottom of the file
>>> containing the actual footer size - if the footer already falls within that
>>> fetched range, 1 GET is eliminated. if not, a second GET fetches the
>>> footer. DataFusion builds on this today.
>>> For our use case, we can go further: since the files are small, instead
>>> of a hint we can fetch the whole file at once in a single GET - no memory
>>> concern in parquet-mr - eliminating all 3 calls entirely.
>>> As the number of files grows, footer request time starts dominating over
>>> actual data request time - clearly visible in benchmarks below.
>>>
>>> *Two Approaches*
>>>
>>>    1. Implement directly in Iceberg - I have a high-level PR for this
>>>    implementation - complete workaround in Iceberg codebase. (
>>>    https://github.com/apache/iceberg/pull/16729)
>>>    2. Fix upstream in parquet-mr - The architecturally correct path:
>>>    add this functionality to parquet-mr itself and use it entirely, 
>>> mirroring
>>>    what the Rust implementation does natively.
>>>
>>>
>>> *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement
>>> iterations)
>>> Combining S3 GET requests alone gives 60-65% improvement, with further
>>> gains possible by parallelising them.
>>>
>>> [image: image.png]
>>>
>>>
>>> As focus shifts towards Root Manifest, Datafiles in Parquet, and
>>> multiple small file requirements, a dedicated effort here seems worth
>>> pursuing.
>>> I would be happy to hear any thoughts on this. Points to discuss are
>>> which approach seems more convincing - Iceberg implementation or upstream
>>> parquet-mr implementation and further thoughts on the gaps between
>>> parquet-mr and arrow-rs specifically around getting footer.
>>>
>>> [1] PR for high level implementation -
>>> https://github.com/apache/iceberg/pull/16729
>>> --
>>> --
>>> Lakhyani Varun
>>> Indian Institute of Technology Roorkee
>>> Contact: +91 96246 46174
>>>
>>>

Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Reply via email to