Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Holden Karau Fri, 20 Mar 2026 21:36:08 -0700

This seems like a great idea to explore.

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her



On Fri, Mar 20, 2026 at 7:12 PM Prashant Singh <[email protected]>
wrote:

> +1 from me too.
>
> On Fri, Mar 20, 2026 at 3:03 PM Yufei Gu <[email protected]> wrote:
>
>> +1 from me. This looks like a useful improvement, especially for small
>> files and IO-heavy workloads.
>>
>> Yufei
>>
>>
>> On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada <
>> [email protected]> wrote:
>>
>>> +1. This will be a good improvement for compaction and other use-cases
>>> and initial numbers are already promising. I support this project for GSoC
>>> 2026.
>>>
>>> ~ Anurag
>>>
>>> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>>> +1 I think this would be a great project to work on and I would be glad
>>>> to support working on it for GSoC 2026
>>>>
>>>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani <
>>>> [email protected]> wrote:
>>>>
>>>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 Kb
>>>>> each) :
>>>>>
>>>>>    - Sync time = 219.694 s
>>>>>    - Async time = 51.853 s
>>>>>
>>>>> % Improvement = 76.4%
>>>>> It can be seen as cloud storage has high IO overheads so, async flow
>>>>> can be beneficial for small files.
>>>>>
>>>>> I would really appreciate any feedback on this.
>>>>>
>>>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hey All,
>>>>>>
>>>>>> I previously started a discussion on making Spark readers work in
>>>>>> parallel (asynchronously), which is beneficial in cases with large 
>>>>>> numbers
>>>>>> of small files such as compaction, and I have worked on a POC, high-level
>>>>>> design, implementation, and benchmarking for various scenarios. I 
>>>>>> presented
>>>>>> my approach and benchmarking results in the Iceberg Spark sync; the
>>>>>> recording may be available in the Iceberg Spark Community Sync Notes [0].
>>>>>>
>>>>>> I am planning to submit this work as a GSoC 2026 proposal based on
>>>>>> this idea and was advised to seek formal community vetting on the dev
>>>>>> mailing list.
>>>>>>
>>>>>> Previous DISCUSS thread:
>>>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn
>>>>>>
>>>>>> Issue:
>>>>>> https://github.com/apache/iceberg/issues/15287
>>>>>>
>>>>>> Prototype implementation:
>>>>>> https://github.com/apache/iceberg/pull/15341
>>>>>>
>>>>>> Design document and benchmarking details:
>>>>>>
>>>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing
>>>>>>
>>>>>> Initial benchmarking shows noticeable improvements for workloads
>>>>>> involving many small files, particularly when IO latency is present
>>>>>> (details in the design document).
>>>>>>
>>>>>> Any feedback (+1 / concerns / suggestions) would be appreciated.
>>>>>> I am specifically looking for community consensus on whether this is
>>>>>> a viable direction for Iceberg before formalizing the GSoC proposal. The
>>>>>> GSoC 2026 proposal deadline is March 31 - early feedback would be
>>>>>> especially appreciated.
>>>>>>
>>>>>> [0] Iceberg Spark Community Sync Notes:
>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing
>>>>>> --
>>>>>> Lakhyani Varun
>>>>>> Indian Institute of Technology Roorkee
>>>>>>
>>>>>>

Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Reply via email to