Hey Everyone, We put in a feature request / proposal on this topic a few days ago, with the idea of storing the schemas in files that are external to metadata.json https://github.com/apache/iceberg/issues/9734 - would be really interested in getting some feedback on it and seeing if folks think it's a viable solution!
On Feb 20, 2024, at 4:53 PM, Russell Spitzer <russell.spit...@gmail.com> wrote: I believe I actually wrote a PR to do some of this a long time ago, I specifically wrote a tool for reducing partition specs, https://github.com/apache/iceberg/pull/3462 API: Add function for removing Specs from Metadata.json which are no … by RussellSpitzer · Pull Request #3462 · apache/iceberg github.com > On Feb 20, 2024, at 3:26 PM, Jack Ye <yezhao...@gmail.com> wrote: > > The feature sounds reasonable to me, if a schema or partition spec is no longer referenced and used for any time travel purpose, then it seems to me that it could be safely pruned through some utility actions. If schema changes frequently and there are many columns it might be helpful in reducing metadata size. > > +1 for reopening the issue to discuss further. We should probably also make the title more specific than "The metadata file is too large". > > Best, > Jack Ye > > On Tue, Feb 20, 2024 at 9:55 AM Sung Yun (BLOOMBERG/ 120 PARK) <syu...@bloomberg.net <mailto:syu...@bloomberg.net>> wrote: >> Hi Barron, we've noticed the same issue as well since this PR was merged in to introduce schema versions: https://github.com/apache/iceberg/pull/2096 >> >> There's a closed issue where folks were discussions options in remediating this problem, that also has links to other related PRs and Issues: https://github.com/apache/iceberg/issues/5219. Should we reopen this issue and converge our discussion points on the main discussion thread? >> >> Sung >> >> From: barron....@twosigma.com <mailto:barron....@twosigma.com> At: 02/20/24 12:34:57 UTC-5:00 >> To: dev@iceberg.apache.org <mailto:dev@iceberg.apache.org> >> Subject: Table Schema History Pruning >> >> Hi folks, >> >> I have a few questions regarding the schema history of an Iceberg table. >> >> The table metadata file keeps track of every table schema version (at least in >> v2). Depending on the size of the schema, this history can become large in >> terms of byte size. >> >> >> 1. Is removing a schema from the history correct when the table does not >> have a snapshot referencing that schema version? >> 2. Does the Iceberg spec guarantee that the history is not pruned? >> 3. Are there any plans for Iceberg to support pruning the schema history? >> >> Thanks, >> Barron Wei >>