+1 from me as well. On Fri, Mar 20, 2026 at 9:37 PM Holden Karau <[email protected]> wrote:
> This seems like a great idea to explore. > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Fri, Mar 20, 2026 at 7:12 PM Prashant Singh <[email protected]> > wrote: > >> +1 from me too. >> >> On Fri, Mar 20, 2026 at 3:03 PM Yufei Gu <[email protected]> wrote: >> >>> +1 from me. This looks like a useful improvement, especially for small >>> files and IO-heavy workloads. >>> >>> Yufei >>> >>> >>> On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada < >>> [email protected]> wrote: >>> >>>> +1. This will be a good improvement for compaction and other use-cases >>>> and initial numbers are already promising. I support this project for GSoC >>>> 2026. >>>> >>>> ~ Anurag >>>> >>>> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer < >>>> [email protected]> wrote: >>>> >>>>> +1 I think this would be a great project to work on and I would be >>>>> glad to support working on it for GSoC 2026 >>>>> >>>>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani < >>>>> [email protected]> wrote: >>>>> >>>>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 >>>>>> Kb each) : >>>>>> >>>>>> - Sync time = 219.694 s >>>>>> - Async time = 51.853 s >>>>>> >>>>>> % Improvement = 76.4% >>>>>> It can be seen as cloud storage has high IO overheads so, async flow >>>>>> can be beneficial for small files. >>>>>> >>>>>> I would really appreciate any feedback on this. >>>>>> >>>>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hey All, >>>>>>> >>>>>>> I previously started a discussion on making Spark readers work in >>>>>>> parallel (asynchronously), which is beneficial in cases with large >>>>>>> numbers >>>>>>> of small files such as compaction, and I have worked on a POC, >>>>>>> high-level >>>>>>> design, implementation, and benchmarking for various scenarios. I >>>>>>> presented >>>>>>> my approach and benchmarking results in the Iceberg Spark sync; the >>>>>>> recording may be available in the Iceberg Spark Community Sync Notes >>>>>>> [0]. >>>>>>> >>>>>>> I am planning to submit this work as a GSoC 2026 proposal based on >>>>>>> this idea and was advised to seek formal community vetting on the dev >>>>>>> mailing list. >>>>>>> >>>>>>> Previous DISCUSS thread: >>>>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn >>>>>>> >>>>>>> Issue: >>>>>>> https://github.com/apache/iceberg/issues/15287 >>>>>>> >>>>>>> Prototype implementation: >>>>>>> https://github.com/apache/iceberg/pull/15341 >>>>>>> >>>>>>> Design document and benchmarking details: >>>>>>> >>>>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing >>>>>>> >>>>>>> Initial benchmarking shows noticeable improvements for workloads >>>>>>> involving many small files, particularly when IO latency is present >>>>>>> (details in the design document). >>>>>>> >>>>>>> Any feedback (+1 / concerns / suggestions) would be appreciated. >>>>>>> I am specifically looking for community consensus on whether this is >>>>>>> a viable direction for Iceberg before formalizing the GSoC proposal. The >>>>>>> GSoC 2026 proposal deadline is March 31 - early feedback would be >>>>>>> especially appreciated. >>>>>>> >>>>>>> [0] Iceberg Spark Community Sync Notes: >>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing >>>>>>> -- >>>>>>> Lakhyani Varun >>>>>>> Indian Institute of Technology Roorkee >>>>>>> >>>>>>>
