This seems like a great idea to explore. Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
On Fri, Mar 20, 2026 at 7:12 PM Prashant Singh <[email protected]> wrote: > +1 from me too. > > On Fri, Mar 20, 2026 at 3:03 PM Yufei Gu <[email protected]> wrote: > >> +1 from me. This looks like a useful improvement, especially for small >> files and IO-heavy workloads. >> >> Yufei >> >> >> On Fri, Mar 20, 2026 at 2:27 PM Anurag Mantripragada < >> [email protected]> wrote: >> >>> +1. This will be a good improvement for compaction and other use-cases >>> and initial numbers are already promising. I support this project for GSoC >>> 2026. >>> >>> ~ Anurag >>> >>> On Fri, Mar 20, 2026 at 1:32 PM Russell Spitzer < >>> [email protected]> wrote: >>> >>>> +1 I think this would be a great project to work on and I would be glad >>>> to support working on it for GSoC 2026 >>>> >>>> On Fri, Mar 20, 2026 at 2:12 PM Varun Lakhyani < >>>> [email protected]> wrote: >>>> >>>>> Benchmarked it against real cloud storage AWS S3 (1000 files - 14.6 Kb >>>>> each) : >>>>> >>>>> - Sync time = 219.694 s >>>>> - Async time = 51.853 s >>>>> >>>>> % Improvement = 76.4% >>>>> It can be seen as cloud storage has high IO overheads so, async flow >>>>> can be beneficial for small files. >>>>> >>>>> I would really appreciate any feedback on this. >>>>> >>>>> On Wed, Mar 18, 2026 at 12:19 AM Varun Lakhyani < >>>>> [email protected]> wrote: >>>>> >>>>>> Hey All, >>>>>> >>>>>> I previously started a discussion on making Spark readers work in >>>>>> parallel (asynchronously), which is beneficial in cases with large >>>>>> numbers >>>>>> of small files such as compaction, and I have worked on a POC, high-level >>>>>> design, implementation, and benchmarking for various scenarios. I >>>>>> presented >>>>>> my approach and benchmarking results in the Iceberg Spark sync; the >>>>>> recording may be available in the Iceberg Spark Community Sync Notes [0]. >>>>>> >>>>>> I am planning to submit this work as a GSoC 2026 proposal based on >>>>>> this idea and was advised to seek formal community vetting on the dev >>>>>> mailing list. >>>>>> >>>>>> Previous DISCUSS thread: >>>>>> https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn >>>>>> >>>>>> Issue: >>>>>> https://github.com/apache/iceberg/issues/15287 >>>>>> >>>>>> Prototype implementation: >>>>>> https://github.com/apache/iceberg/pull/15341 >>>>>> >>>>>> Design document and benchmarking details: >>>>>> >>>>>> https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing >>>>>> >>>>>> Initial benchmarking shows noticeable improvements for workloads >>>>>> involving many small files, particularly when IO latency is present >>>>>> (details in the design document). >>>>>> >>>>>> Any feedback (+1 / concerns / suggestions) would be appreciated. >>>>>> I am specifically looking for community consensus on whether this is >>>>>> a viable direction for Iceberg before formalizing the GSoC proposal. The >>>>>> GSoC 2026 proposal deadline is March 31 - early feedback would be >>>>>> especially appreciated. >>>>>> >>>>>> [0] Iceberg Spark Community Sync Notes: >>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing >>>>>> -- >>>>>> Lakhyani Varun >>>>>> Indian Institute of Technology Roorkee >>>>>> >>>>>>
