I am also super excited about the idea ! I would love to contribute. On Thu, May 29, 2025 at 6:54 PM Yufei Gu <flyrain...@gmail.com> wrote:
> BTW, does it make sense to take metadata json file into consideration as >> well? Currently it is just a large json string containing all snapshots. >> Since it is also on the critical path of a commit, I'm not sure if we can >> explore incremental semantics on it together with manifest list files to >> reduce the commit overhead. > > > For metadata.json file, the REST APIs have provided an incremental style > update already via a variety of table update requests. The community is > also working on the lift of a mandatory physical metadata.json file in the > storage, in which case, the REST catalog doesn't have to deal with file IO > anymore. Metadata.json could live within a key-value, RDMS or even just in > memory. > > Yufei > > > On Thu, May 29, 2025 at 6:35 PM Gang Wu <ust...@gmail.com> wrote: > >> This is a long-awaited discussion! >> >> BTW, does it make sense to take metadata json file into consideration as >> well? Currently it is just a large json string containing all snapshots. >> Since it is also on the critical path of a commit, I'm not sure if we can >> explore incremental semantics on it together with manifest list files to >> reduce the commit overhead. >> >> Best, >> Gang >> >> On Fri, May 30, 2025 at 7:10 AM Steven Wu <stevenz...@gmail.com> wrote: >> >>> This will be great for users. metadata can self adapt. Start with a >>> compacted one file. As the table grows in size, the metadata can adapt to a >>> tree or linked structure. >>> >>> On Thu, May 29, 2025 at 3:44 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I’m also super excited about this idea >>>> >>>> On Thu, May 29, 2025 at 3:37 PM Amogh Jahagirdar <2am...@gmail.com> >>>> wrote: >>>> >>>>> Thanks for kicking this thread off Ryan, I'm interested in helping out >>>>> here! I've been working on a proposal in this area and it would be great >>>>> to >>>>> collaborate with different folks and exchange ideas here, since I think a >>>>> lot of people are interested in solving this problem. >>>>> >>>>> Thanks, >>>>> Amogh Jahagirdar >>>>> >>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> Like Russell’s recent note, I’m starting a thread to connect those of >>>>>> us that are interested in the idea of changing Iceberg’s metadata in v4 >>>>>> so >>>>>> that in most cases committing a change only requires writing one >>>>>> additional >>>>>> metadata file. >>>>>> >>>>>> *Idea: One-file commits* >>>>>> >>>>>> The current Iceberg metadata structure requires writing at least one >>>>>> manifest and a new manifest list to produce a new snapshot. The goal of >>>>>> this work is to allow more flexibility by allowing the manifest list >>>>>> layer >>>>>> to store data and delete files. As a result, only one file write would be >>>>>> needed before committing the new snapshot. In addition, this work will >>>>>> also >>>>>> try to explore: >>>>>> >>>>>> - Avoiding small manifests that must be read in parallel and >>>>>> later compacted (metadata maintenance changes) >>>>>> - Extend metadata skipping to use aggregated column ranges that >>>>>> are compatible with geospatial data (manifest metadata) >>>>>> - Using soft deletes to avoid rewriting existing manifests >>>>>> (metadata DVs) >>>>>> >>>>>> If you’re interested in these problems, please reply! >>>>>> >>>>>> Ryan >>>>>> >>>>>