Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

huaxin gao Fri, 20 Mar 2026 22:05:19 -0700

+1 from me as well.

On Fri, Mar 20, 2026 at 9:37 PM Holden Karau <[email protected]> wrote:


> This seems like a great idea to explore.
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Fri, Mar 20, 2026 at 7:12 PM Prashant Singh <[email protected]>
> wrote:
>
>> +1 from me too.
>>
>> On Fri, Mar 20, 2026 at 3:03 PM Yufei Gu <[email protected]> wrote:
>>
>>> +1 from me. This looks like a useful improvement, especially for small
>>> files and IO-heavy workloads.
>>>
>>> Yufei
>>>
>>>
>>> On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada <
>>> [email protected]> wrote:
>>>
>>>> +1. This will be a good improvement for compaction and other use-cases
>>>> and initial numbers are already promising. I support this project for GSoC
>>>> 2026.
>>>>
>>>> ~ Anurag
>>>>
>>>> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> +1 I think this would be a great project to work on and I would be
>>>>> glad to support working on it for GSoC 2026
>>>>>
>>>>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6
>>>>>> Kb each) :
>>>>>>
>>>>>>    - Sync time = 219.694 s
>>>>>>    - Async time = 51.853 s
>>>>>>
>>>>>> % Improvement = 76.4%
>>>>>> It can be seen as cloud storage has high IO overheads so, async flow
>>>>>> can be beneficial for small files.
>>>>>>
>>>>>> I would really appreciate any feedback on this.
>>>>>>
>>>>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hey All,
>>>>>>>
>>>>>>> I previously started a discussion on making Spark readers work in
>>>>>>> parallel (asynchronously), which is beneficial in cases with large 
>>>>>>> numbers
>>>>>>> of small files such as compaction, and I have worked on a POC, 
>>>>>>> high-level
>>>>>>> design, implementation, and benchmarking for various scenarios. I 
>>>>>>> presented
>>>>>>> my approach and benchmarking results in the Iceberg Spark sync; the
>>>>>>> recording may be available in the Iceberg Spark Community Sync Notes 
>>>>>>> [0].
>>>>>>>
>>>>>>> I am planning to submit this work as a GSoC 2026 proposal based on
>>>>>>> this idea and was advised to seek formal community vetting on the dev
>>>>>>> mailing list.
>>>>>>>
>>>>>>> Previous DISCUSS thread:
>>>>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn
>>>>>>>
>>>>>>> Issue:
>>>>>>> https://github.com/apache/iceberg/issues/15287
>>>>>>>
>>>>>>> Prototype implementation:
>>>>>>> https://github.com/apache/iceberg/pull/15341
>>>>>>>
>>>>>>> Design document and benchmarking details:
>>>>>>>
>>>>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing
>>>>>>>
>>>>>>> Initial benchmarking shows noticeable improvements for workloads
>>>>>>> involving many small files, particularly when IO latency is present
>>>>>>> (details in the design document).
>>>>>>>
>>>>>>> Any feedback (+1 / concerns / suggestions) would be appreciated.
>>>>>>> I am specifically looking for community consensus on whether this is
>>>>>>> a viable direction for Iceberg before formalizing the GSoC proposal. The
>>>>>>> GSoC 2026 proposal deadline is March 31 - early feedback would be
>>>>>>> especially appreciated.
>>>>>>>
>>>>>>> [0] Iceberg Spark Community Sync Notes:
>>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing
>>>>>>> --
>>>>>>> Lakhyani Varun
>>>>>>> Indian Institute of Technology Roorkee
>>>>>>>
>>>>>>>

Re: [DISCUSS] GSoC 2026 idea vetting: Parallel scan task execution in Iceberg Spark readers

Reply via email to