+1 On Sat, Mar 21, 2026, 06:18 Neelesh Salian <[email protected]> wrote:
> +1. Thanks for doing this. > > On Tue, Mar 17, 2026 at 11:49 Varun Lakhyani <[email protected]> > wrote: > >> Hey All, >> >> I previously started a discussion on making Spark readers work in >> parallel (asynchronously), which is beneficial in cases with large numbers >> of small files such as compaction, and I have worked on a POC, high-level >> design, implementation, and benchmarking for various scenarios. I presented >> my approach and benchmarking results in the Iceberg Spark sync; the >> recording may be available in the Iceberg Spark Community Sync Notes [0]. >> >> I am planning to submit this work as a GSoC 2026 proposal based on this >> idea and was advised to seek formal community vetting on the dev mailing >> list. >> >> Previous DISCUSS thread: >> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn >> >> Issue: >> https://github.com/apache/iceberg/issues/15287 >> >> Prototype implementation: >> https://github.com/apache/iceberg/pull/15341 >> >> Design document and benchmarking details: >> >> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing >> >> Initial benchmarking shows noticeable improvements for workloads >> involving many small files, particularly when IO latency is present >> (details in the design document). >> >> Any feedback (+1 / concerns / suggestions) would be appreciated. >> I am specifically looking for community consensus on whether this is a >> viable direction for Iceberg before formalizing the GSoC proposal. The GSoC >> 2026 proposal deadline is March 31 - early feedback would be especially >> appreciated. >> >> [0] Iceberg Spark Community Sync Notes: >> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing >> -- >> Lakhyani Varun >> Indian Institute of Technology Roorkee >> >>
