+1 from me too. On Fri, Mar 20, 2026 at 3:03 PM Yufei Gu <[email protected]> wrote:
> +1 from me. This looks like a useful improvement, especially for small > files and IO-heavy workloads. > > Yufei > > > On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada < > [email protected]> wrote: > >> +1. This will be a good improvement for compaction and other use-cases >> and initial numbers are already promising. I support this project for GSoC >> 2026. >> >> ~ Anurag >> >> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer < >> [email protected]> wrote: >> >>> +1 I think this would be a great project to work on and I would be glad >>> to support working on it for GSoC 2026 >>> >>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani < >>> [email protected]> wrote: >>> >>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 Kb >>>> each) : >>>> >>>> - Sync time = 219.694 s >>>> - Async time = 51.853 s >>>> >>>> % Improvement = 76.4% >>>> It can be seen as cloud storage has high IO overheads so, async flow >>>> can be beneficial for small files. >>>> >>>> I would really appreciate any feedback on this. >>>> >>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani < >>>> [email protected]> wrote: >>>> >>>>> Hey All, >>>>> >>>>> I previously started a discussion on making Spark readers work in >>>>> parallel (asynchronously), which is beneficial in cases with large numbers >>>>> of small files such as compaction, and I have worked on a POC, high-level >>>>> design, implementation, and benchmarking for various scenarios. I >>>>> presented >>>>> my approach and benchmarking results in the Iceberg Spark sync; the >>>>> recording may be available in the Iceberg Spark Community Sync Notes [0]. >>>>> >>>>> I am planning to submit this work as a GSoC 2026 proposal based on >>>>> this idea and was advised to seek formal community vetting on the dev >>>>> mailing list. >>>>> >>>>> Previous DISCUSS thread: >>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn >>>>> >>>>> Issue: >>>>> https://github.com/apache/iceberg/issues/15287 >>>>> >>>>> Prototype implementation: >>>>> https://github.com/apache/iceberg/pull/15341 >>>>> >>>>> Design document and benchmarking details: >>>>> >>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing >>>>> >>>>> Initial benchmarking shows noticeable improvements for workloads >>>>> involving many small files, particularly when IO latency is present >>>>> (details in the design document). >>>>> >>>>> Any feedback (+1 / concerns / suggestions) would be appreciated. >>>>> I am specifically looking for community consensus on whether this is a >>>>> viable direction for Iceberg before formalizing the GSoC proposal. The >>>>> GSoC >>>>> 2026 proposal deadline is March 31 - early feedback would be especially >>>>> appreciated. >>>>> >>>>> [0] Iceberg Spark Community Sync Notes: >>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing >>>>> -- >>>>> Lakhyani Varun >>>>> Indian Institute of Technology Roorkee >>>>> >>>>>
