Re: [DISCUSS] Adding Tags field to Iceberg V4

Micah Kornfield Mon, 15 Dec 2025 16:28:10 -0800

Hi Yufei,

> If one engine started to rely on a tag for certain reasons(like clustering
> algorithm), would data file rewrite(compaction) by another engine remove
> the tag, and break the engine relying on it.

The intent here is that dropping tags should never break an engine.  But it
could cause suboptimal operations.  For instance, one example I brought in
the docs is using tags to cache parquet footer size, to make sure it is
fetched in 1 I/O.

In this case the following would occur.

1.  Engine 1 does a write to file 1 and records its footer size in tags.
2.  Engine 2 does a rewrite/compactions and produces File 2 without tags.
3.  Engine 1 then tries to read file 2.  The tag for footer length is
missing so it falls back reading a reasonable number of bytes from the end
of the parquet file, hoping the entire footer is retrieved (and if it isn't
a second I/O is necessary).

Similarly for clustering algorithms, I think the result could yield a
sub-optimally clustered table, or perhaps redundant clustering operations
but shouldn't break anything. This is no worse then the case today though
if engine 1 and engine 2 have different clustering algorithms and they are
being run in interleaved fashion on the same table.  In this case it is
highly likely that some amount of duplicate compaction is happening.

In the current proposal, any metadata that is required for proper
functioning should never be put in tags.

Thanks,
Micah

On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <[email protected]> wrote:

> Thanks for the proposal!
>
> If one engine started to rely on a tag for certain reasons(like clustering
> algorithm), would data file rewrite(compaction) by another engine remove
> the tag, and break the engine relying on it.
>
> Yufei
>
>
> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Iceberg Dev,
>> I added a proposal [1] to add a key-value tags field for files in V4
>> metadata [2].  More details are in the document but the intent is to allow
>> engines to store optional metadata associated with these files:
>>
>> 1.  The proposed field is optional and cannot be used for metadata
>> required for reading the table correctly.
>> 2.  It also proposes guard-rails for not letting tags cause metadata
>> bloat.
>>
>> Looking forward to hearing everyone's thoughts and feedback.
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/iceberg/issues/14815
>> [2]
>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>
>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to