Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Andrei Tserakhau via dev Thu, 11 Jun 2026 11:52:29 -0700

Hi all,

Recap from the dedicated labels sync held on May 28, 2026
(recording [1]).

Summary of the discussion:

   -

   Strong consensus to land the read API first, with the write API
   as a separate follow-up proposal (Ryan, Sung, Kevin, Christian
   aligned). Christian raised a concern that the write half could
   lag behind (Trino views precedent); to address this, the
   proposal will document the write-path direction alongside the
   read API.
   -

   Labels remain flat key-value pairs, no internal structure.
   Kubernetes labels precedent invoked — flat shape, conventions
   via well-known prefixes, no spec-defined vocabulary.
   Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris)
   discussed and set aside in favor of prefix conventions.
   -

   Labels-vs-Tag boundary: labels are the wire-protocol mechanism
   for cross-catalog metadata exchange (this proposal); tags are
   catalog-internal structured concepts (Snowflake, UC, Polaris
   each have their own shape). Standardizing Tag itself as a
   first-class spec entity is a separate effort, not in scope for
   V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that
   direction.
   -

   Governance scope: Prashant Singh raised concerns about
   positioning labels as a governance protocol — provenance,
   identity mapping across IDPs, inheritance semantics. Room
   aligned that labels are broader than governance — semantic
   metadata exchange is the load-bearing case; governance remains
   a valid use case among many, and whether to use labels for
   governance is a catalog-level decision rather than a spec
   mandate. Policy decisions and enforcement live in read
   restrictions (PR #13879 [3]) — a parallel and complementary
   track.
   -

   Write API shape converging on an independent CRUD endpoint
   (UpdateLabels-style verb) with a transactional path for atomic
   table+label operations at create/alter time. Two-class
   distinction (catalog-authored vs externally-managed labels)
   reaffirmed; Ryan noted not all labels should be editable via
   CRUD since many are produced by the catalog through inheritance,
   classification, or automated paths.
   -

   Bulk APIs surfaced as a real need for both read (inverted index
   — finding tables/columns matching given labels) and write
   (applying labels at scale, classifier batch operations). Scoped
   for inclusion in the write API proposal.
   -

   Pattern for adding new first-class REST concepts (labels, UDFs,
   indexes, etc.): independent CRUD endpoint per concept, paired
   with a transactional path for atomic operations alongside table
   create/alter. Useful reference shape for future spec additions.

Post-sync follow-ups already in motion:

   -

   Hot-path discipline added to the proposal in response to
   Christian Thiel's doc comment: LoadTableResponse latency MUST
   NOT increase due to labels; how catalogs meet this is
   implementation-defined (caching, freshness trade-offs,
   filtering). Capability negotiation — parallel to the work on
   PR #13879 [3] — is a future direction.
   -

   Use case split (high-confidence cross-catalog: semantic, domain,
   classification, sensitivity vs platform-specific: owner,
   principals, anything identity-bound) agreed after offline
   follow-up with Prashant; will be reflected in the next proposal
   revision.
   -

   A separate [DISCUSS] thread will land the substrate framing
   publicly.

Next sync approximately three / four weeks out. Tentative agenda:
labels/Tag boundary update, write-path sketch walk-through, path
to VOTE on the read API.

Thanks to everyone who joined and to those continuing to engage
on the design doc [4] and spec PR #15750 [5].

Best,
Andrei

[1] https://youtu.be/P4NOQASNtPA
[2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b
[3] https://github.com/apache/iceberg/pull/13879
[4]
https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
[5] https://github.com/apache/iceberg/pull/15750

On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau <
[email protected]> wrote:

> Quick update on this on:
> we'll cover this on the Dedicated Sync this Thursday (10-11am US / 7-8pm
> CET). Thanks to Daniel Weeks for getting it on the calendar.
>
> Last time labels was on the sync was 2026-04-15. Plenty of productive
> offline discussion since then, mostly in the gdoc comment threads. Thanks
> to everyone who engaged:
>
>    - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now
>    anchors the Alternatives section
>    - *Fokko Driesprong* — for challenging motivation on the cost-based
>    defense and driving the ownership reframe
>    - *Yufei Gu* — for the structure debate that landed us on the split
>    shape
>    - *Sung Yun* — for the early consumption-pattern and addressing
>    questions
>    - *Maninder Parmar* — for the properties-relationship probing
>    - *Christian Thiel* — for pushing on the write API direction
>
> Concrete changes in-doc since April:
>
>    - Problem Statement reframed around catalog-owned metainformation as
>    the load-bearing concept.
>    - Alternatives Considered rewritten with the IRC-spec-vs-table-spec
>    boundary instead of cost arguments.
>    - Structure debate closed on a split shape: labels (flat k/v at the
>    table level, k8s-style) + column-labels (array with field-id). Labels
>    type itself is flat — no internal structure. Same shape applies on
>    LoadViewResponse and namespaces.
>    - CRUD companion as a second tab in the same gdoc — UpdateLabels REST
>    verb, two-class distinction for catalog-managed vs externally-managed keys,
>    optimistic concurrency with ETags.
>    - Working Trino prototype at
>    https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER
>    TABLE ... SET LABEL DDL translating end-to-end.
>
> Parallel work to flag: EJ Wang's first-class Tag concept
> <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b>
> proposal on dev@. We've agreed to coordinate as paired proposals — Tag as
> a separate first-class REST concept, labels as the lower-level attachment
> substrate. Both efforts share the cross-cutting interop question.
>
> Goal on Thursday is to walk through the current state, confirm the
> split-shape lands cleanly, and identify what's needed to move toward a VOTE
> on the read API. Anyone reading along is welcome to join.
>
> Doc (current state):
> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit
>
> Thanks,
> Andrei
>
> On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau <
> [email protected]> wrote:
>
>> Thanks Ryan!
>>
>> Your point about avoiding first-class metadata requirements is exactly
>> the design principle here. Labels let each catalog surface what it knows
>> without the spec dictating what catalogs must track.
>>
>> To build on this, I put together a POC showing the approach works across
>> the ecosystem.
>>
>> Key design principles that held up in practice:
>>
>> - No new requirements on catalogs. Labels are optional in the response. A
>> catalog that doesn't serve labels returns the same response as today.
>>
>> - Catalog-scoped, not table state. Every catalog we tried already has
>> internal metadata separate from Iceberg properties — Polaris has
>> internalProperties, UC has uc_properties, Lakekeeper has namespace
>> properties in PostgreSQL. Labels just give this existing metadata a
>> standard way through the protocol.
>>
>> - No property overriding. Labels are explicitly separate from table
>> properties. Properties configure behavior, labels describe context. Engines
>> know which is which.
>>
>> What built:
>>
>> - Spec change: https://github.com/apache/iceberg/pull/15750
>> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191
>>
>> Catalog implementations:
>> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from
>> internalProperties)
>> - Unity Catalog OSS:
>> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from
>> uc_properties)
>> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676 (labels
>> from namespace properties)
>>
>> Full demo: https://github.com/laskoviymishka/irc-labels
>>
>> Three catalogs, two languages (Java + Rust), 40-95 lines each. The
>> pattern is the same everywhere, each catalog already has internal metadata
>> that doesn't belong in table properties. Labels give it a standard way out
>> through the protocol.
>>
>> The Polaris implementation also addresses
>> https://github.com/apache/polaris/issues/3222 - the community has been
>> asking for a way to surface business metadata alongside table loads. Labels
>> solve this without adding any requirements beyond an optional field.
>>
>> Beyond ownership and classification, the demo also shows labels enabling
>> AI agent table selection (agents reason about tables using semantic labels
>> instead of guessing from column names) and governance via trusted engine
>> (ClickHouse reading sensitivity labels to auto-generate masking policies).
>>
>> Happy to discuss the spec design or any of the implementation details.
>>
>> Andrei
>>
>> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote:
>>
>>> I think that this is a reasonable way to solve some persistent issues
>>> that we've seen.
>>>
>>> Many catalogs track additional metadata that is not part of the table
>>> spec (or others) like "owner", and right now there is no way to exchange or
>>> share that information. I'm also hesitant to start including it as
>>> first-class metadata because that puts additional requirements on catalogs
>>> that may not align. For instance, Tabular had no concept of a table "owner"
>>> and instead used default grants at the schema level. I like that this
>>> solution allows catalogs to provide information in a generic way that
>>> doesn't add requirements in the REST spec. And it is an alternative to
>>> overriding table properties with catalog-managed information, which I think
>>> is an anti-pattern.
>>>
>>> Thanks, Andrei! I think this is a good idea.
>>>
>>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> `LoadTableResponse` returns table metadata — schema, snapshots, file
>>>> locations — but catalogs have operational context about tables that has no
>>>> standard place to go: cost attribution, ownership, governance hints,
>>>> semantic metadata. Right now catalogs have two options:
>>>>
>>>> 1. Properties — durable, commit-versioned table state. Good for
>>>> persistent metadata; wrong for ephemeral catalog context.
>>>> 2. Custom fields — catalog-specific extensions with no
>>>> interoperability. Each catalog invents its own structure; engines have no
>>>> basis to read them.
>>>>
>>>> The community has already identified this gap. Polaris opened an issue
>>>> [1] requesting a standard extension point in the IRC protocol for
>>>> catalog-managed metadata. Two earlier threads [2][3] explored column-level
>>>> metadata, though in the context of table format changes.
>>>>
>>>> We propose adding an optional `labels` field to `LoadTableResponse` for
>>>> catalog-managed metadata. Labels are string key-value pairs generated
>>>> per-request from the catalog's internal systems; nothing is written to
>>>> table files. Engines may use or ignore them entirely. Labels give catalog
>>>> providers a standard channel to surface context to any client without
>>>> bilateral custom integrations for every catalog-engine pair.
>>>>
>>>> Details:
>>>> - GitHub Issue: apache/iceberg#15521
>>>> - Design Document: [4]
>>>>
>>>> Please review the proposal and share your feedback.
>>>>
>>>> Thanks,
>>>> Andrei
>>>>
>>>> [1]: https://github.com/apache/polaris/issues/3222
>>>> [2]: https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4
>>>> [3]: https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0
>>>> [4]:
>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing
>>>>
>>>

Re: [DISCUSS] Table and Column Label Metadata in Iceberg REST Catalog

Reply via email to