Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Prashant Singh Fri, 20 Mar 2026 19:12:28 -0700

+1 from me too.

On Fri, Mar 20, 2026 at 3:03 PM Yufei Gu <[email protected]> wrote:


> +1 from me. This looks like a useful improvement, especially for small
> files and IO-heavy workloads.
>
> Yufei
>
>
> On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada <
> [email protected]> wrote:
>
>> +1. This will be a good improvement for compaction and other use-cases
>> and initial numbers are already promising. I support this project for GSoC
>> 2026.
>>
>> ~ Anurag
>>
>> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> +1 I think this would be a great project to work on and I would be glad
>>> to support working on it for GSoC 2026
>>>
>>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani <
>>> [email protected]> wrote:
>>>
>>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 Kb
>>>> each) :
>>>>
>>>>    - Sync time = 219.694 s
>>>>    - Async time = 51.853 s
>>>>
>>>> % Improvement = 76.4%
>>>> It can be seen as cloud storage has high IO overheads so, async flow
>>>> can be beneficial for small files.
>>>>
>>>> I would really appreciate any feedback on this.
>>>>
>>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani <
>>>> [email protected]> wrote:
>>>>
>>>>> Hey All,
>>>>>
>>>>> I previously started a discussion on making Spark readers work in
>>>>> parallel (asynchronously), which is beneficial in cases with large numbers
>>>>> of small files such as compaction, and I have worked on a POC, high-level
>>>>> design, implementation, and benchmarking for various scenarios. I 
>>>>> presented
>>>>> my approach and benchmarking results in the Iceberg Spark sync; the
>>>>> recording may be available in the Iceberg Spark Community Sync Notes [0].
>>>>>
>>>>> I am planning to submit this work as a GSoC 2026 proposal based on
>>>>> this idea and was advised to seek formal community vetting on the dev
>>>>> mailing list.
>>>>>
>>>>> Previous DISCUSS thread:
>>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn
>>>>>
>>>>> Issue:
>>>>> https://github.com/apache/iceberg/issues/15287
>>>>>
>>>>> Prototype implementation:
>>>>> https://github.com/apache/iceberg/pull/15341
>>>>>
>>>>> Design document and benchmarking details:
>>>>>
>>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing
>>>>>
>>>>> Initial benchmarking shows noticeable improvements for workloads
>>>>> involving many small files, particularly when IO latency is present
>>>>> (details in the design document).
>>>>>
>>>>> Any feedback (+1 / concerns / suggestions) would be appreciated.
>>>>> I am specifically looking for community consensus on whether this is a
>>>>> viable direction for Iceberg before formalizing the GSoC proposal. The 
>>>>> GSoC
>>>>> 2026 proposal deadline is March 31 - early feedback would be especially
>>>>> appreciated.
>>>>>
>>>>> [0] Iceberg Spark Community Sync Notes:
>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing
>>>>> --
>>>>> Lakhyani Varun
>>>>> Indian Institute of Technology Roorkee
>>>>>
>>>>>

Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Reply via email to