This is a long-awaited discussion! BTW, does it make sense to take metadata json file into consideration as well? Currently it is just a large json string containing all snapshots. Since it is also on the critical path of a commit, I'm not sure if we can explore incremental semantics on it together with manifest list files to reduce the commit overhead.
Best, Gang On Fri, May 30, 2025 at 7:10 AM Steven Wu <stevenz...@gmail.com> wrote: > This will be great for users. metadata can self adapt. Start with a > compacted one file. As the table grows in size, the metadata can adapt to a > tree or linked structure. > > On Thu, May 29, 2025 at 3:44 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I’m also super excited about this idea >> >> On Thu, May 29, 2025 at 3:37 PM Amogh Jahagirdar <2am...@gmail.com> >> wrote: >> >>> Thanks for kicking this thread off Ryan, I'm interested in helping out >>> here! I've been working on a proposal in this area and it would be great to >>> collaborate with different folks and exchange ideas here, since I think a >>> lot of people are interested in solving this problem. >>> >>> Thanks, >>> Amogh Jahagirdar >>> >>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <rdb...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> Like Russell’s recent note, I’m starting a thread to connect those of >>>> us that are interested in the idea of changing Iceberg’s metadata in v4 so >>>> that in most cases committing a change only requires writing one additional >>>> metadata file. >>>> >>>> *Idea: One-file commits* >>>> >>>> The current Iceberg metadata structure requires writing at least one >>>> manifest and a new manifest list to produce a new snapshot. The goal of >>>> this work is to allow more flexibility by allowing the manifest list layer >>>> to store data and delete files. As a result, only one file write would be >>>> needed before committing the new snapshot. In addition, this work will also >>>> try to explore: >>>> >>>> - Avoiding small manifests that must be read in parallel and later >>>> compacted (metadata maintenance changes) >>>> - Extend metadata skipping to use aggregated column ranges that are >>>> compatible with geospatial data (manifest metadata) >>>> - Using soft deletes to avoid rewriting existing manifests >>>> (metadata DVs) >>>> >>>> If you’re interested in these problems, please reply! >>>> >>>> Ryan >>>> >>>