Re: [DISCUSS] Adding Tags field to Iceberg V4

Russell Spitzer Fri, 27 Mar 2026 13:43:14 -0700

I'm also leaning towards not adding this to the Spec. The more I think
about it, the more it feels like it will
just be a way to "fork" Iceberg with vendor specific functionality. If
someone wants
to do that, they can always just add fields to the metadata they generate,
but I'm not sure we should explicitly bless it.


The more I think about the copy behavior, the less I like the idea of
having an "outside definition" of the field. If we always
drop the field, then what's the point of having it in the spec. If we do
copy it, how can we assure the copy won't invalidate
the value stored? Just feels like we aren't really getting anything out of
this change.

On Thu, Mar 26, 2026 at 6:14 PM Micah Kornfield <[email protected]>
wrote:

> Hi Prashant,
> I unfortunately, I have conflicts on Wednesdays for the foreseeable future
> at that time.  Hopefully between the sync and mailing list we can figure
> out a path forward.  If anybody else has feedback please add it to the
> Google doc or reply to the thread and I can address it.
>
> Thanks,
> Micah
>
> On Thursday, March 26, 2026, Prashant Singh <[email protected]>
> wrote:
>
>> Thank you for being flexible Micah, how about we add this to the agenda
>> item in iceberg community sync which is just a day after at 9 pm, a lot of
>> folks join and we will have better participation.
>> and it seems like we would have time to talk since i see the agenda is
>> still open, if we can't conclude we can have a dedicated sync for it.
>>
>> Best,
>> Prashant Singh
>>
>> On Thu, Mar 26, 2026 at 3:23 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Thanks Kevin for accepting.  Thanks for your feedback Prashant, since
>>> you have been active reviewing, I moved the event to Tuesday at a time that
>>> you mentioned you would be available, hopefully this doesn't exclude
>>> anybody else who wants to join the conversation.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Mar 26, 2026 at 9:52 AM Prashant Singh <[email protected]>
>>> wrote:
>>>
>>>> Thanks for bumping this thread Micah and thank you for all the work ! I
>>>> missed this thread completely, apologies for that, I have so far been
>>>> responding to the design docs (would be nice to link ML to doc too).
>>>>
>>>> For the feedback, I am not supportive of this proposal and I am looking
>>>> forward to hear from other community members on despite these severe con
>>>> why we should be doing it  specially given we have clear aligned path on
>>>> how to introduce these by in backward compatible way
>>>>
>>>> Here are my reservations :
>>>> 1/ while the proposal says one can limit the default size 512B, it says
>>>> it is configurable, this can severely impact the number of entries we can
>>>> have in a manifest file, we went through the whole exercise of  whether we
>>>> should have inline manifest dv or not, and based on tradeoff we concluded
>>>> one over the other. Giving this much of size in the worst case per data
>>>> file inside the manifest can severely impact the query planning time and
>>>> query execution cost (will more IO) of the iceberg readers which may be
>>>> different than who produced the iceberg data set.
>>>> 2/ It works on an assumption we need to do spec version bump to add new
>>>> fields, which i think is not completely true we added things like partition
>>>> stats / statistic field as optional, i don't understand why cant we do the
>>>> same, specially with things like schema_id and footer_size mentioned as
>>>> motivation. I think the community
>>>> was pretty aligned to have schema_id as optional field to have writer
>>>> backward compatibility as all new writers taking the benefit of this [1]
>>>> 3/ one of motivations thats is stated is to support Vendors proprietary
>>>> metadata for supporting their proprietary clustering algorithm, this to me
>>>> looks like a way to work around spec to let iceberg metadata layout carry
>>>> these info which doesn't means anything to iceberg ecosystem and can
>>>> compromise interoperability.
>>>> Also think of a case where Vendor A starts producing  something
>>>> partnering with Vendor B and to make things worse encrypt it and not let
>>>> vendor C not in this partnership see it. IMHO we should not open up new
>>>> ways that hurt the interop.
>>>>
>>>> I also want to thank you for proposing the meeting, unfortunately the
>>>> proposed time doesn't work for me, i have a conflicting meeting, please
>>>> feel free to proceed without me, I can watch the recording later as well,
>>>> as far as my support is concerned I look forward to answers that strongly
>>>> supporting this use case and why should we be ok accepting these cons given
>>>> we already had a well thought path to move forward.
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/4898
>>>>
>>>> Best,
>>>> Prashant Singh
>>>>
>>>>
>>>>
>>>> On Wed, Mar 25, 2026 at 3:22 PM Kevin Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> I added/accepted on the dev calendar. Looking forward to it!
>>>>>
>>>>> On Tue, Mar 24, 2026 at 5:34 PM Micah Kornfield <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> It seems like we might not have full alignment on this proposal, I
>>>>>> tentatively scheduled a sync for next Monday (added to the iceberg dev
>>>>>> events calendar).  Please let me know if you are interested in joining 
>>>>>> and
>>>>>> the time doesn't work for you (we can reschedule accordingly).
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On 2026/02/09 23:15:49 Micah Kornfield wrote:
>>>>>> > As an update I've made the proposal to add this field to the Single
>>>>>> file
>>>>>> > commits doc.
>>>>>> >
>>>>>> > Please let me know if there is any additional feedback.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Micah
>>>>>> >
>>>>>> > On Wed, Jan 21, 2026 at 5:16 PM Micah Kornfield <
>>>>>> [email protected]>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > Thanks Manu, that is the right doc.
>>>>>> > >
>>>>>> > > As an update, I've incorporated feedback from the community to the
>>>>>> > > document:
>>>>>> > >
>>>>>> > > At a high level the changes are:
>>>>>> > > - Renamed the field from "tags" to "attributes"
>>>>>> > > - Clarified limits on attributes should only be enforced for new
>>>>>> data.
>>>>>> > > Existing tags must always be carried through.
>>>>>> > > - Added more details on enforcing size of tags.
>>>>>> > >
>>>>>> > > Are there any objections to folding the proposal into the V4
>>>>>> metadata
>>>>>> > > proposal?  Again, the reasons for doing so are mostly around
>>>>>> ensuring
>>>>>> > > consistent field numbering and making the spec update easier.
>>>>>> > >
>>>>>> > > If people want further discussion on this I'd be happy to discuss
>>>>>> at the
>>>>>> > > next V4 metadata sync or create a one-off meeting.  Please let me
>>>>>> know.
>>>>>> > >
>>>>>> > > Thanks,
>>>>>> > > Micah
>>>>>> > >
>>>>>> > > On Mon, Jan 5, 2026 at 5:48 PM Manu Zhang <
>>>>>> [email protected]> wrote:
>>>>>> > >
>>>>>> > >> Happy new year Micah. Are you linking the wrong doc (Iceberg
>>>>>> Single File
>>>>>> > >> Commits) ?
>>>>>> > >> I think you are referring to
>>>>>> > >>
>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>> > >>
>>>>>> > >> Best,
>>>>>> > >> Manu
>>>>>> > >>
>>>>>> > >> On Tue, Jan 6, 2026 at 2:19 AM Micah Kornfield <
>>>>>> [email protected]>
>>>>>> > >> wrote:
>>>>>> > >>
>>>>>> > >>> Happy new year everyone, I just wanted to bump this thread (most
>>>>>> > >>> discussion has been happening on the doc [1]) in case it was
>>>>>> missed over
>>>>>> > >>> the holidays.
>>>>>> > >>>
>>>>>> > >>> Thanks,
>>>>>> > >>> Micah
>>>>>> > >>>
>>>>>> > >>> [1]
>>>>>> > >>>
>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>>>> > >>>
>>>>>> > >>> On Fri, Dec 19, 2025 at 2:14 PM Micah Kornfield <
>>>>>> [email protected]>
>>>>>> > >>> wrote:
>>>>>> > >>>
>>>>>> > >>>> Sounds good, will wait until next year.
>>>>>> > >>>>
>>>>>> > >>>> On Fri, Dec 19, 2025 at 2:13 PM Steven Wu <
>>>>>> [email protected]> wrote:
>>>>>> > >>>>
>>>>>> > >>>>> Micah, many people will be OOO in the next two weeks. Can we
>>>>>> extend
>>>>>> > >>>>> the feedback deadline to at least 1-2 weeks after the new
>>>>>> year?
>>>>>> > >>>>>
>>>>>> > >>>>> On Fri, Dec 19, 2025 at 8:45 AM Micah Kornfield <
>>>>>> [email protected]>
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>
>>>>>> > >>>>>> > I have no problem with adding this discussion to the
>>>>>> single file
>>>>>> > >>>>>> work, but I'm not sure that would speed it up? Seems like
>>>>>> this is a pretty
>>>>>> > >>>>>> independent addition to the metadata layout?
>>>>>> > >>>>>>
>>>>>> > >>>>>> Yes, it is fairly independent.  The main reason I wanted to
>>>>>> > >>>>>> consolidate in the doc, it appears there is  a bit of
>>>>>> metadata
>>>>>> > >>>>>> re-arrangement and new fields.  I wanted to make sure that:
>>>>>> > >>>>>>
>>>>>> > >>>>>> 1.  We avoid field ID conflicts.
>>>>>> > >>>>>> 2.  When writing up the final spec changes it is easy to
>>>>>> manage and
>>>>>> > >>>>>> not create a dependency one way or another between the two
>>>>>> of these.
>>>>>> > >>>>>>
>>>>>> > >>>>>> Happy to keep the implementation of the guard-rails as a
>>>>>> separate
>>>>>> > >>>>>> piece of work.
>>>>>> > >>>>>>
>>>>>> > >>>>>> Cheers,
>>>>>> > >>>>>> Micah
>>>>>> > >>>>>>
>>>>>> > >>>>>> On Fri, Dec 19, 2025 at 7:31 AM Russell Spitzer <
>>>>>> > >>>>>> [email protected]> wrote:
>>>>>> > >>>>>>
>>>>>> > >>>>>>> I have no problem with adding this discussion to the single
>>>>>> file
>>>>>> > >>>>>>> work, but I'm not sure that would speed it up? Seems like
>>>>>> this is a pretty
>>>>>> > >>>>>>> independent addition to the metadata layout?
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> On Thu, Dec 18, 2025 at 6:28 PM Micah Kornfield <
>>>>>> > >>>>>>> [email protected]> wrote:
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>>>> call out
>>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags
>>>>>> must be strictly
>>>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>>>> functionality. Engines
>>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags
>>>>>> without breaking reads
>>>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>>>>> behavior (e.g.,
>>>>>> > >>>>>>>>> extra I/O), as you described.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> 100% I will also add this summary to the bottom of the
>>>>>> requirements
>>>>>> > >>>>>>>> section.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Based on mailing list discussion and doc comments (or lack
>>>>>> > >>>>>>>> thereof), it does not seem like there are strong
>>>>>> objections to adding this
>>>>>> > >>>>>>>> for V4.  Prashant seemed to maybe have concerns, so I'd
>>>>>> like to understand
>>>>>> > >>>>>>>> if they are blockers.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> If there isn't additional feedback by the end of next
>>>>>> week, I'd
>>>>>> > >>>>>>>> like to assume a lazy consensus and consolidate this with
>>>>>> the single file
>>>>>> > >>>>>>>> improvement work, which has already reorganized the
>>>>>> metadata schema [1].
>>>>>> > >>>>>>>> Please let me know if there is a different process.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Thanks,
>>>>>> > >>>>>>>> Micah
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> [1]
>>>>>> > >>>>>>>>
>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> On Wed, Dec 17, 2025 at 5:38 PM Yufei Gu <
>>>>>> [email protected]>
>>>>>> > >>>>>>>> wrote:
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>> Thanks for the clarification, Micah! I want to explicitly
>>>>>> call out
>>>>>> > >>>>>>>>> (and double-confirm) the key principle here: all tags
>>>>>> must be strictly
>>>>>> > >>>>>>>>> optional and never required for correctness or basic
>>>>>> functionality. Engines
>>>>>> > >>>>>>>>> should always be able to safely drop or ignore tags
>>>>>> without breaking reads
>>>>>> > >>>>>>>>> or writes, with the only possible impact being suboptimal
>>>>>> behavior (e.g.,
>>>>>> > >>>>>>>>> extra I/O), as you described.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> As long as this constraint is clearly stated and
>>>>>> enforced, the
>>>>>> > >>>>>>>>> trade-off feels reasonable to me.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Yufei
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> On Mon, Dec 15, 2025 at 4:28 PM Micah Kornfield <
>>>>>> > >>>>>>>>> [email protected]> wrote:
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>>> Hi Yufei,
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>>>> reasons(like
>>>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>>>> rewrite(compaction) by another
>>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on
>>>>>> it.
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> The intent here is that dropping tags should never break
>>>>>> an
>>>>>> > >>>>>>>>>> engine.  But it could cause suboptimal operations.  For
>>>>>> instance, one
>>>>>> > >>>>>>>>>> example I brought in the docs is using tags to cache
>>>>>> parquet footer size,
>>>>>> > >>>>>>>>>> to make sure it is fetched in 1 I/O.
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> In this case the following would occur.
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> 1.  Engine 1 does a write to file 1 and records its
>>>>>> footer size
>>>>>> > >>>>>>>>>> in tags.
>>>>>> > >>>>>>>>>> 2.  Engine 2 does a rewrite/compactions and produces
>>>>>> File 2
>>>>>> > >>>>>>>>>> without tags.
>>>>>> > >>>>>>>>>> 3.  Engine 1 then tries to read file 2.  The tag for
>>>>>> footer
>>>>>> > >>>>>>>>>> length is missing so it falls back reading a reasonable
>>>>>> number of bytes
>>>>>> > >>>>>>>>>> from the end of the parquet file, hoping the entire
>>>>>> footer is retrieved
>>>>>> > >>>>>>>>>> (and if it isn't a second I/O is necessary).
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> Similarly for clustering algorithms, I think the result
>>>>>> could
>>>>>> > >>>>>>>>>> yield a sub-optimally clustered table, or perhaps
>>>>>> redundant clustering
>>>>>> > >>>>>>>>>> operations but shouldn't break anything. This is no
>>>>>> worse then the case
>>>>>> > >>>>>>>>>> today though if engine 1 and engine 2 have different
>>>>>> clustering algorithms
>>>>>> > >>>>>>>>>> and they are being run in interleaved fashion on the
>>>>>> same table.  In this
>>>>>> > >>>>>>>>>> case it is highly likely that some amount of duplicate
>>>>>> compaction is
>>>>>> > >>>>>>>>>> happening.
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> In the current proposal, any metadata that is required
>>>>>> for proper
>>>>>> > >>>>>>>>>> functioning should never be put in tags.
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> Thanks,
>>>>>> > >>>>>>>>>> Micah
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> On Mon, Dec 15, 2025 at 4:02 PM Yufei Gu <
>>>>>> [email protected]>
>>>>>> > >>>>>>>>>> wrote:
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>> Thanks for the proposal!
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> If one engine started to rely on a tag for certain
>>>>>> reasons(like
>>>>>> > >>>>>>>>>>> clustering algorithm), would data file
>>>>>> rewrite(compaction) by another
>>>>>> > >>>>>>>>>>> engine remove the tag, and break the engine relying on
>>>>>> it.
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> Yufei
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> On Wed, Dec 10, 2025 at 2:58 PM Micah Kornfield <
>>>>>> > >>>>>>>>>>> [email protected]> wrote:
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> Hi Iceberg Dev,
>>>>>> > >>>>>>>>>>>> I added a proposal [1] to add a key-value tags field
>>>>>> for files
>>>>>> > >>>>>>>>>>>> in V4 metadata [2].  More details are in the document
>>>>>> but the intent is to
>>>>>> > >>>>>>>>>>>> allow engines to store optional metadata associated
>>>>>> with these files:
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> 1.  The proposed field is optional and cannot be used
>>>>>> for
>>>>>> > >>>>>>>>>>>> metadata required for reading the table correctly.
>>>>>> > >>>>>>>>>>>> 2.  It also proposes guard-rails for not letting tags
>>>>>> cause
>>>>>> > >>>>>>>>>>>> metadata bloat.
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> Looking forward to hearing everyone's thoughts and
>>>>>> feedback.
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> Thanks,
>>>>>> > >>>>>>>>>>>> Micah
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> [1] https://github.com/apache/iceberg/issues/14815
>>>>>> > >>>>>>>>>>>> [2]
>>>>>> > >>>>>>>>>>>>
>>>>>> https://docs.google.com/document/d/16flxDXjpBiAs_cF3sjCsa7GlvSHQ0Mmm74c8yvYQlSA/edit?tab=t.0#heading=h.cnpb2lth3egz
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>>
>>>>>> >
>>>>>>
>>>>>

Re: [DISCUSS] Adding Tags field to Iceberg V4

Reply via email to