+1

On Sat, Mar 21, 2026, 06:18 Neelesh Salian <[email protected]> wrote:

> +1. Thanks for doing this.
>
> On Tue, Mar 17, 2026 at 11:49 Varun Lakhyani <[email protected]>
> wrote:
>
>> Hey All,
>>
>> I previously started a discussion on making Spark readers work in
>> parallel (asynchronously), which is beneficial in cases with large numbers
>> of small files such as compaction, and I have worked on a POC, high-level
>> design, implementation, and benchmarking for various scenarios. I presented
>> my approach and benchmarking results in the Iceberg Spark sync; the
>> recording may be available in the Iceberg Spark Community Sync Notes [0].
>>
>> I am planning to submit this work as a GSoC 2026 proposal based on this
>> idea and was advised to seek formal community vetting on the dev mailing
>> list.
>>
>> Previous DISCUSS thread:
>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn
>>
>> Issue:
>> https://github.com/apache/iceberg/issues/15287
>>
>> Prototype implementation:
>> https://github.com/apache/iceberg/pull/15341
>>
>> Design document and benchmarking details:
>>
>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing
>>
>> Initial benchmarking shows noticeable improvements for workloads
>> involving many small files, particularly when IO latency is present
>> (details in the design document).
>>
>> Any feedback (+1 / concerns / suggestions) would be appreciated.
>> I am specifically looking for community consensus on whether this is a
>> viable direction for Iceberg before formalizing the GSoC proposal. The GSoC
>> 2026 proposal deadline is March 31 - early feedback would be especially
>> appreciated.
>>
>> [0] Iceberg Spark Community Sync Notes:
>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing
>> --
>> Lakhyani Varun
>> Indian Institute of Technology Roorkee
>>
>>

Reply via email to