+1 from me. This looks like a useful improvement, especially for small files and IO-heavy workloads.
Yufei On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada < [email protected]> wrote: > +1. This will be a good improvement for compaction and other use-cases and > initial numbers are already promising. I support this project for GSoC 2026. > > ~ Anurag > > On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer <[email protected]> > wrote: > >> +1 I think this would be a great project to work on and I would be glad >> to support working on it for GSoC 2026 >> >> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani <[email protected]> >> wrote: >> >>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 Kb >>> each) : >>> >>> - Sync time = 219.694 s >>> - Async time = 51.853 s >>> >>> % Improvement = 76.4% >>> It can be seen as cloud storage has high IO overheads so, async flow can >>> be beneficial for small files. >>> >>> I would really appreciate any feedback on this. >>> >>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani < >>> [email protected]> wrote: >>> >>>> Hey All, >>>> >>>> I previously started a discussion on making Spark readers work in >>>> parallel (asynchronously), which is beneficial in cases with large numbers >>>> of small files such as compaction, and I have worked on a POC, high-level >>>> design, implementation, and benchmarking for various scenarios. I presented >>>> my approach and benchmarking results in the Iceberg Spark sync; the >>>> recording may be available in the Iceberg Spark Community Sync Notes [0]. >>>> >>>> I am planning to submit this work as a GSoC 2026 proposal based on this >>>> idea and was advised to seek formal community vetting on the dev mailing >>>> list. >>>> >>>> Previous DISCUSS thread: >>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn >>>> >>>> Issue: >>>> https://github.com/apache/iceberg/issues/15287 >>>> >>>> Prototype implementation: >>>> https://github.com/apache/iceberg/pull/15341 >>>> >>>> Design document and benchmarking details: >>>> >>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing >>>> >>>> Initial benchmarking shows noticeable improvements for workloads >>>> involving many small files, particularly when IO latency is present >>>> (details in the design document). >>>> >>>> Any feedback (+1 / concerns / suggestions) would be appreciated. >>>> I am specifically looking for community consensus on whether this is a >>>> viable direction for Iceberg before formalizing the GSoC proposal. The GSoC >>>> 2026 proposal deadline is March 31 - early feedback would be especially >>>> appreciated. >>>> >>>> [0] Iceberg Spark Community Sync Notes: >>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing >>>> -- >>>> Lakhyani Varun >>>> Indian Institute of Technology Roorkee >>>> >>>>
