Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Prashant Singh Mon, 29 Jun 2026 08:32:37 -0700

Hey Folks,
This meeting is not open to everyone ? Have been waiting for some time in
the waiting room
Can someone let me in ?


On Fri, Jun 26, 2026 at 7:59 AM Andrei Tserakhau via dev <
[email protected]> wrote:

> Hi all,
>
> Next labels sync is scheduled for Monday, June 29, 2026, 8am PT /
> 5pm CET. Earlier US-time slot to accommodate European attendees.
>
> Meet link: [https://meet.google.com/ezz-bvff-wjt]
>
> Agenda:
>
>    - Updates since last sync (May 28)
>    - Labels / Tag boundary, topic from EJ Wang
>    - Write-path sketch walk-through
>    - ETag-based caching and concurrency (new topic, surfaced at the
>    recent IRC catalog community sync)
>    - Path to VOTE on the read API
>
> Concrete changes in the proposal [1] and spec PR [2] since the
> May 28 recap:
>
>    - Schema consolidated: labels and column-labels merged into a
>    single Labels object with table and columns sub-properties
>    (per Daniel Weeks' May 28 review). Resolves the "two fields
>    representing the same thing" feedback on the PR. JSON example
>    and appendix updated.
>    - Hot-Path Discipline added: LoadTableResponse latency MUST NOT
>    increase due to labels; how catalogs meet this is
>    implementation-defined (caching, freshness trade-offs,
>    filtering). Added in response to Christian Thiel's design doc
>    comment.
>    - Governance scope clarified: labels carry context, not
>    enforcement decisions. Enforcement semantics and policy vocabulary
>    live
>    in other spec layers (Read Restrictions, PR #13879 [3]).
>    - Open Questions cleaned up: write path summarized with ETag-based
>    optimistic concurrency + two-class distinction (catalog-managed
>    vs externally-managed keys); structured classification layer
>    reframed as the future authoring companion to labels as the
>    read surface.
>
> Goal on Monday: walk through the updated proposal, work through
> remaining concerns on the read API, and identify what's needed to
> move toward a VOTE.
>
> Thanks,
> Andrei
>
> [1]
> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
> [2] https://github.com/apache/iceberg/pull/15750
> [3] https://github.com/apache/iceberg/pull/13879
>
> On Fri, Jun 12, 2026 at 3:56 AM EJ Wang <[email protected]>
> wrote:
>
>> Thanks Andrei for the recap. I want to clarify one point on the
>> labels-vs-Tag boundary, mostly to avoid having the labels discussion
>> pre-decide questions that belong in the parallel Tag discussion.
>>
>> I agree with the use cases motivating labels: exposing catalog-managed
>> context such as ownership, domain, cost attribution, classification hints,
>> and semantic hints in REST responses can be useful for engines and clients.
>> The part I am less sure about is whether those use cases require a separate
>> Labels concept in the spec, or whether they should be modeled as projected
>> metadata from a structured Tag/classification model.
>>
>> My concern is dependency direction. If we introduce labels as a flat
>> generic primitive first, and later add structure for identity, lifecycle,
>> allowed values, inheritance, field-id attachment, visibility, and reverse
>> lookup, then we may end up reconstructing a Tag model around labels. That
>> feels less clear than defining the structured model directly and allowing
>> catalogs to project the relevant assignments into REST responses where
>> useful.
>>
>> In other words, I don't think the interesting question is only whether
>> labels should be flat or structured. I think the question is whether labels
>> should be a separate primitive at all, or whether the read-response use
>> cases can be covered by a projected view of structured tag/classification
>> assignments.
>>
>> Where I'd be especially careful is the phrase that tags are
>> "catalog-internal structured concepts." I agree that the full Tag
>> discussion is outside the scope of Labels V1, but I would not want Labels
>> V1 to pre-decide that structured tagging/classification semantics are only
>> catalog-internal and not an IRC concept. That is exactly the separate
>> question being explored in the Tag thread.
>>
>> The factoring I'd prefer to evaluate is:
>>
>>    - Tag (structured) classification: authoring, lifecycle, identity,
>>    field-id attachment, inheritance, visibility, and lookup semantics
>>    - REST response projection: optional metadata returned to clients,
>>    potentially derived from structured tag assignments
>>    - read-restrictions: enforcement result delivered to engines
>>
>> That framing may reduce the need for a separate Labels primitive while
>> still preserving the read-response use cases that motivated the labels
>> proposal.
>>
>> I realize this may be a bigger factoring question than Labels V1 intended
>> to answer, but I think it is worth making explicit before the two threads
>> diverge. If the community wants one logical concept rather than both labels
>> and tags, I think we should at least evaluate the direction where the
>> structured Tag/classification model is the source of truth and lightweight
>> REST response metadata is a projection from it, before standardizing labels
>> as an independent primitive.
>>
>> -ej
>>
>> On Thu, Jun 11, 2026 at 11:53 AM Andrei Tserakhau via dev <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> Recap from the dedicated labels sync held on May 28, 2026
>>> (recording [1]).
>>>
>>> Summary of the discussion:
>>>
>>>    -
>>>
>>>    Strong consensus to land the read API first, with the write API
>>>    as a separate follow-up proposal (Ryan, Sung, Kevin, Christian
>>>    aligned). Christian raised a concern that the write half could
>>>    lag behind (Trino views precedent); to address this, the
>>>    proposal will document the write-path direction alongside the
>>>    read API.
>>>    -
>>>
>>>    Labels remain flat key-value pairs, no internal structure.
>>>    Kubernetes labels precedent invoked — flat shape, conventions
>>>    via well-known prefixes, no spec-defined vocabulary.
>>>    Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris)
>>>    discussed and set aside in favor of prefix conventions.
>>>    -
>>>
>>>    Labels-vs-Tag boundary: labels are the wire-protocol mechanism
>>>    for cross-catalog metadata exchange (this proposal); tags are
>>>    catalog-internal structured concepts (Snowflake, UC, Polaris
>>>    each have their own shape). Standardizing Tag itself as a
>>>    first-class spec entity is a separate effort, not in scope for
>>>    V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that
>>>    direction.
>>>    -
>>>
>>>    Governance scope: Prashant Singh raised concerns about
>>>    positioning labels as a governance protocol — provenance,
>>>    identity mapping across IDPs, inheritance semantics. Room
>>>    aligned that labels are broader than governance — semantic
>>>    metadata exchange is the load-bearing case; governance remains
>>>    a valid use case among many, and whether to use labels for
>>>    governance is a catalog-level decision rather than a spec
>>>    mandate. Policy decisions and enforcement live in read
>>>    restrictions (PR #13879 [3]) — a parallel and complementary
>>>    track.
>>>    -
>>>
>>>    Write API shape converging on an independent CRUD endpoint
>>>    (UpdateLabels-style verb) with a transactional path for atomic
>>>    table+label operations at create/alter time. Two-class
>>>    distinction (catalog-authored vs externally-managed labels)
>>>    reaffirmed; Ryan noted not all labels should be editable via
>>>    CRUD since many are produced by the catalog through inheritance,
>>>    classification, or automated paths.
>>>    -
>>>
>>>    Bulk APIs surfaced as a real need for both read (inverted index
>>>    — finding tables/columns matching given labels) and write
>>>    (applying labels at scale, classifier batch operations). Scoped
>>>    for inclusion in the write API proposal.
>>>    -
>>>
>>>    Pattern for adding new first-class REST concepts (labels, UDFs,
>>>    indexes, etc.): independent CRUD endpoint per concept, paired
>>>    with a transactional path for atomic operations alongside table
>>>    create/alter. Useful reference shape for future spec additions.
>>>
>>> Post-sync follow-ups already in motion:
>>>
>>>    -
>>>
>>>    Hot-path discipline added to the proposal in response to
>>>    Christian Thiel's doc comment: LoadTableResponse latency MUST
>>>    NOT increase due to labels; how catalogs meet this is
>>>    implementation-defined (caching, freshness trade-offs,
>>>    filtering). Capability negotiation — parallel to the work on
>>>    PR #13879 [3] — is a future direction.
>>>    -
>>>
>>>    Use case split (high-confidence cross-catalog: semantic, domain,
>>>    classification, sensitivity vs platform-specific: owner,
>>>    principals, anything identity-bound) agreed after offline
>>>    follow-up with Prashant; will be reflected in the next proposal
>>>    revision.
>>>    -
>>>
>>>    A separate [DISCUSS] thread will land the substrate framing
>>>    publicly.
>>>
>>> Next sync approximately three / four weeks out. Tentative agenda:
>>> labels/Tag boundary update, write-path sketch walk-through, path
>>> to VOTE on the read API.
>>>
>>> Thanks to everyone who joined and to those continuing to engage
>>> on the design doc [4] and spec PR #15750 [5].
>>>
>>> Best,
>>> Andrei
>>>
>>> [1] https://youtu.be/P4NOQASNtPA
>>> [2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b
>>> [3] https://github.com/apache/iceberg/pull/13879
>>> [4]
>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
>>> [5] https://github.com/apache/iceberg/pull/15750
>>>
>>> On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau <
>>> [email protected]> wrote:
>>>
>>>> Quick update on this on:
>>>> we'll cover this on the Dedicated Sync this Thursday (10-11am US /
>>>> 7-8pm CET). Thanks to Daniel Weeks for getting it on the calendar.
>>>>
>>>> Last time labels was on the sync was 2026-04-15. Plenty of productive
>>>> offline discussion since then, mostly in the gdoc comment threads. Thanks
>>>> to everyone who engaged:
>>>>
>>>>    - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now
>>>>    anchors the Alternatives section
>>>>    - *Fokko Driesprong* — for challenging motivation on the cost-based
>>>>    defense and driving the ownership reframe
>>>>    - *Yufei Gu* — for the structure debate that landed us on the split
>>>>    shape
>>>>    - *Sung Yun* — for the early consumption-pattern and addressing
>>>>    questions
>>>>    - *Maninder Parmar* — for the properties-relationship probing
>>>>    - *Christian Thiel* — for pushing on the write API direction
>>>>
>>>> Concrete changes in-doc since April:
>>>>
>>>>    - Problem Statement reframed around catalog-owned metainformation
>>>>    as the load-bearing concept.
>>>>    - Alternatives Considered rewritten with the IRC-spec-vs-table-spec
>>>>    boundary instead of cost arguments.
>>>>    - Structure debate closed on a split shape: labels (flat k/v at the
>>>>    table level, k8s-style) + column-labels (array with field-id).
>>>>    Labels type itself is flat — no internal structure. Same shape
>>>>    applies on LoadViewResponse and namespaces.
>>>>    - CRUD companion as a second tab in the same gdoc — UpdateLabels
>>>>    REST verb, two-class distinction for catalog-managed vs 
>>>> externally-managed
>>>>    keys, optimistic concurrency with ETags.
>>>>    - Working Trino prototype at
>>>>    https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER
>>>>    TABLE ... SET LABEL DDL translating end-to-end.
>>>>
>>>> Parallel work to flag: EJ Wang's first-class Tag concept
>>>> <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b>
>>>> proposal on dev@. We've agreed to coordinate as paired proposals — Tag
>>>> as a separate first-class REST concept, labels as the lower-level
>>>> attachment substrate. Both efforts share the cross-cutting interop 
>>>> question.
>>>>
>>>> Goal on Thursday is to walk through the current state, confirm the
>>>> split-shape lands cleanly, and identify what's needed to move toward a VOTE
>>>> on the read API. Anyone reading along is welcome to join.
>>>>
>>>> Doc (current state):
>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
>>>>
>>>> Thanks,
>>>> Andrei
>>>>
>>>> On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Ryan!
>>>>>
>>>>> Your point about avoiding first-class metadata requirements is exactly
>>>>> the design principle here. Labels let each catalog surface what it knows
>>>>> without the spec dictating what catalogs must track.
>>>>>
>>>>> To build on this, I put together a POC showing the approach works
>>>>> across the ecosystem.
>>>>>
>>>>> Key design principles that held up in practice:
>>>>>
>>>>> - No new requirements on catalogs. Labels are optional in the
>>>>> response. A catalog that doesn't serve labels returns the same response as
>>>>> today.
>>>>>
>>>>> - Catalog-scoped, not table state. Every catalog we tried already has
>>>>> internal metadata separate from Iceberg properties — Polaris has
>>>>> internalProperties, UC has uc_properties, Lakekeeper has namespace
>>>>> properties in PostgreSQL. Labels just give this existing metadata a
>>>>> standard way through the protocol.
>>>>>
>>>>> - No property overriding. Labels are explicitly separate from table
>>>>> properties. Properties configure behavior, labels describe context. 
>>>>> Engines
>>>>> know which is which.
>>>>>
>>>>> What built:
>>>>>
>>>>> - Spec change: https://github.com/apache/iceberg/pull/15750
>>>>> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191
>>>>>
>>>>> Catalog implementations:
>>>>> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from
>>>>> internalProperties)
>>>>> - Unity Catalog OSS:
>>>>> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from
>>>>> uc_properties)
>>>>> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676
>>>>> (labels from namespace properties)
>>>>>
>>>>> Full demo: https://github.com/laskoviymishka/irc-labels
>>>>>
>>>>> Three catalogs, two languages (Java + Rust), 40-95 lines each. The
>>>>> pattern is the same everywhere, each catalog already has internal metadata
>>>>> that doesn't belong in table properties. Labels give it a standard way out
>>>>> through the protocol.
>>>>>
>>>>> The Polaris implementation also addresses
>>>>> https://github.com/apache/polaris/issues/3222 - the community has
>>>>> been asking for a way to surface business metadata alongside table loads.
>>>>> Labels solve this without adding any requirements beyond an optional 
>>>>> field.
>>>>>
>>>>> Beyond ownership and classification, the demo also shows labels
>>>>> enabling AI agent table selection (agents reason about tables using
>>>>> semantic labels instead of guessing from column names) and governance via
>>>>> trusted engine (ClickHouse reading sensitivity labels to auto-generate
>>>>> masking policies).
>>>>>
>>>>> Happy to discuss the spec design or any of the implementation details.
>>>>>
>>>>> Andrei
>>>>>
>>>>> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> I think that this is a reasonable way to solve some persistent issues
>>>>>> that we've seen.
>>>>>>
>>>>>> Many catalogs track additional metadata that is not part of the table
>>>>>> spec (or others) like "owner", and right now there is no way to exchange 
>>>>>> or
>>>>>> share that information. I'm also hesitant to start including it as
>>>>>> first-class metadata because that puts additional requirements on 
>>>>>> catalogs
>>>>>> that may not align. For instance, Tabular had no concept of a table 
>>>>>> "owner"
>>>>>> and instead used default grants at the schema level. I like that this
>>>>>> solution allows catalogs to provide information in a generic way that
>>>>>> doesn't add requirements in the REST spec. And it is an alternative to
>>>>>> overriding table properties with catalog-managed information, which I 
>>>>>> think
>>>>>> is an anti-pattern.
>>>>>>
>>>>>> Thanks, Andrei! I think this is a good idea.
>>>>>>
>>>>>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> `LoadTableResponse` returns table metadata — schema, snapshots, file
>>>>>>> locations — but catalogs have operational context about tables that has 
>>>>>>> no
>>>>>>> standard place to go: cost attribution, ownership, governance hints,
>>>>>>> semantic metadata. Right now catalogs have two options:
>>>>>>>
>>>>>>> 1. Properties — durable, commit-versioned table state. Good for
>>>>>>> persistent metadata; wrong for ephemeral catalog context.
>>>>>>> 2. Custom fields — catalog-specific extensions with no
>>>>>>> interoperability. Each catalog invents its own structure; engines have 
>>>>>>> no
>>>>>>> basis to read them.
>>>>>>>
>>>>>>> The community has already identified this gap. Polaris opened an
>>>>>>> issue [1] requesting a standard extension point in the IRC protocol for
>>>>>>> catalog-managed metadata. Two earlier threads [2][3] explored 
>>>>>>> column-level
>>>>>>> metadata, though in the context of table format changes.
>>>>>>>
>>>>>>> We propose adding an optional `labels` field to `LoadTableResponse`
>>>>>>> for catalog-managed metadata. Labels are string key-value pairs 
>>>>>>> generated
>>>>>>> per-request from the catalog's internal systems; nothing is written to
>>>>>>> table files. Engines may use or ignore them entirely. Labels give 
>>>>>>> catalog
>>>>>>> providers a standard channel to surface context to any client without
>>>>>>> bilateral custom integrations for every catalog-engine pair.
>>>>>>>
>>>>>>> Details:
>>>>>>> - GitHub Issue: apache/iceberg#15521
>>>>>>> - Design Document: [4]
>>>>>>>
>>>>>>> Please review the proposal and share your feedback.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrei
>>>>>>>
>>>>>>> [1]: https://github.com/apache/polaris/issues/3222
>>>>>>> [2]:
>>>>>>> https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4
>>>>>>> [3]:
>>>>>>> https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0
>>>>>>> [4]:
>>>>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing
>>>>>>>
>>>>>>

Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Reply via email to