Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Yufei Gu Fri, 20 Mar 2026 15:03:40 -0700

+1 from me. This looks like a useful improvement, especially for small
files and IO-heavy workloads.


Yufei


On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada <
[email protected]> wrote:

> +1. This will be a good improvement for compaction and other use-cases and
> initial numbers are already promising. I support this project for GSoC 2026.
>
> ~ Anurag
>
> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer <[email protected]>
> wrote:
>
>> +1 I think this would be a great project to work on and I would be glad
>> to support working on it for GSoC 2026
>>
>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani <[email protected]>
>> wrote:
>>
>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 Kb
>>> each) :
>>>
>>>    - Sync time = 219.694 s
>>>    - Async time = 51.853 s
>>>
>>> % Improvement = 76.4%
>>> It can be seen as cloud storage has high IO overheads so, async flow can
>>> be beneficial for small files.
>>>
>>> I would really appreciate any feedback on this.
>>>
>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani <
>>> [email protected]> wrote:
>>>
>>>> Hey All,
>>>>
>>>> I previously started a discussion on making Spark readers work in
>>>> parallel (asynchronously), which is beneficial in cases with large numbers
>>>> of small files such as compaction, and I have worked on a POC, high-level
>>>> design, implementation, and benchmarking for various scenarios. I presented
>>>> my approach and benchmarking results in the Iceberg Spark sync; the
>>>> recording may be available in the Iceberg Spark Community Sync Notes [0].
>>>>
>>>> I am planning to submit this work as a GSoC 2026 proposal based on this
>>>> idea and was advised to seek formal community vetting on the dev mailing
>>>> list.
>>>>
>>>> Previous DISCUSS thread:
>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn
>>>>
>>>> Issue:
>>>> https://github.com/apache/iceberg/issues/15287
>>>>
>>>> Prototype implementation:
>>>> https://github.com/apache/iceberg/pull/15341
>>>>
>>>> Design document and benchmarking details:
>>>>
>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing
>>>>
>>>> Initial benchmarking shows noticeable improvements for workloads
>>>> involving many small files, particularly when IO latency is present
>>>> (details in the design document).
>>>>
>>>> Any feedback (+1 / concerns / suggestions) would be appreciated.
>>>> I am specifically looking for community consensus on whether this is a
>>>> viable direction for Iceberg before formalizing the GSoC proposal. The GSoC
>>>> 2026 proposal deadline is March 31 - early feedback would be especially
>>>> appreciated.
>>>>
>>>> [0] Iceberg Spark Community Sync Notes:
>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing
>>>> --
>>>> Lakhyani Varun
>>>> Indian Institute of Technology Roorkee
>>>>
>>>>

Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Reply via email to