Hey All, I previously started a discussion on making Spark readers work in parallel (asynchronously), which is beneficial in cases with large numbers of small files such as compaction, and I have worked on a POC, high-level design, implementation, and benchmarking for various scenarios. I presented my approach and benchmarking results in the Iceberg Spark sync; the recording may be available in the Iceberg Spark Community Sync Notes [0].
I am planning to submit this work as a GSoC 2026 proposal based on this idea and was advised to seek formal community vetting on the dev mailing list. Previous DISCUSS thread: https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn Issue: https://github.com/apache/iceberg/issues/15287 Prototype implementation: https://github.com/apache/iceberg/pull/15341 Design document and benchmarking details: https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing Initial benchmarking shows noticeable improvements for workloads involving many small files, particularly when IO latency is present (details in the design document). Any feedback (+1 / concerns / suggestions) would be appreciated. I am specifically looking for community consensus on whether this is a viable direction for Iceberg before formalizing the GSoC proposal. The GSoC 2026 proposal deadline is March 31 - early feedback would be especially appreciated. [0] Iceberg Spark Community Sync Notes: https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing -- Lakhyani Varun Indian Institute of Technology Roorkee
