Hi Yufei, > If one engine started to rely on a tag for certain reasons(like clustering > algorithm), would data file rewrite(compaction) by another engine remove > the tag, and break the engine relying on it.
The intent here is that dropping tags should never break an engine. But it could cause suboptimal operations. For instance, one example I brought in the docs is using tags to cache parquet footer size, to make sure it is fetched in 1 I/O. In this case the following would occur. 1. Engine 1 does a write to file 1 and records its footer size in tags. 2. Engine 2 does a rewrite/compactions and produces File 2 without tags. 3. Engine 1 then tries to read file 2. The tag for footer length is missing so it falls back reading a reasonable number of bytes from the end of the parquet file, hoping the entire footer is retrieved (and if it isn't a second I/O is necessary). Similarly for clustering algorithms, I think the result could yield a sub-optimally clustered table, or perhaps redundant clustering operations but shouldn't break anything. This is no worse then the case today though if engine 1 and engine 2 have different clustering algorithms and they are being run in interleaved fashion on the same table. In this case it is highly likely that some amount of duplicate compaction is happening. In the current proposal, any metadata that is required for proper functioning should never be put in tags. Thanks, Micah On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]> wrote: > Thanks for the proposal! > > If one engine started to rely on a tag for certain reasons(like clustering > algorithm), would data file rewrite(compaction) by another engine remove > the tag, and break the engine relying on it. > > Yufei > > > On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Iceberg Dev, >> I added a proposal [1] to add a key-value tags field for files in V4 >> metadata [2]. More details are in the document but the intent is to allow >> engines to store optional metadata associated with these files: >> >> 1. The proposed field is optional and cannot be used for metadata >> required for reading the table correctly. >> 2. It also proposes guard-rails for not letting tags cause metadata >> bloat. >> >> Looking forward to hearing everyone's thoughts and feedback. >> >> Thanks, >> Micah >> >> [1] https://github.com/apache/iceberg/issues/14815 >> [2] >> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz >> >>
