Thanks Andrei for the recap. I want to clarify one point on the labels-vs-Tag boundary, mostly to avoid having the labels discussion pre-decide questions that belong in the parallel Tag discussion.
I agree with the use cases motivating labels: exposing catalog-managed context such as ownership, domain, cost attribution, classification hints, and semantic hints in REST responses can be useful for engines and clients. The part I am less sure about is whether those use cases require a separate Labels concept in the spec, or whether they should be modeled as projected metadata from a structured Tag/classification model. My concern is dependency direction. If we introduce labels as a flat generic primitive first, and later add structure for identity, lifecycle, allowed values, inheritance, field-id attachment, visibility, and reverse lookup, then we may end up reconstructing a Tag model around labels. That feels less clear than defining the structured model directly and allowing catalogs to project the relevant assignments into REST responses where useful. In other words, I don't think the interesting question is only whether labels should be flat or structured. I think the question is whether labels should be a separate primitive at all, or whether the read-response use cases can be covered by a projected view of structured tag/classification assignments. Where I'd be especially careful is the phrase that tags are "catalog-internal structured concepts." I agree that the full Tag discussion is outside the scope of Labels V1, but I would not want Labels V1 to pre-decide that structured tagging/classification semantics are only catalog-internal and not an IRC concept. That is exactly the separate question being explored in the Tag thread. The factoring I'd prefer to evaluate is: - Tag (structured) classification: authoring, lifecycle, identity, field-id attachment, inheritance, visibility, and lookup semantics - REST response projection: optional metadata returned to clients, potentially derived from structured tag assignments - read-restrictions: enforcement result delivered to engines That framing may reduce the need for a separate Labels primitive while still preserving the read-response use cases that motivated the labels proposal. I realize this may be a bigger factoring question than Labels V1 intended to answer, but I think it is worth making explicit before the two threads diverge. If the community wants one logical concept rather than both labels and tags, I think we should at least evaluate the direction where the structured Tag/classification model is the source of truth and lightweight REST response metadata is a projection from it, before standardizing labels as an independent primitive. -ej On Thu, Jun 11, 2026 at 11:53 AM Andrei Tserakhau via dev < [email protected]> wrote: > Hi all, > > Recap from the dedicated labels sync held on May 28, 2026 > (recording [1]). > > Summary of the discussion: > > - > > Strong consensus to land the read API first, with the write API > as a separate follow-up proposal (Ryan, Sung, Kevin, Christian > aligned). Christian raised a concern that the write half could > lag behind (Trino views precedent); to address this, the > proposal will document the write-path direction alongside the > read API. > - > > Labels remain flat key-value pairs, no internal structure. > Kubernetes labels precedent invoked — flat shape, conventions > via well-known prefixes, no spec-defined vocabulary. > Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris) > discussed and set aside in favor of prefix conventions. > - > > Labels-vs-Tag boundary: labels are the wire-protocol mechanism > for cross-catalog metadata exchange (this proposal); tags are > catalog-internal structured concepts (Snowflake, UC, Polaris > each have their own shape). Standardizing Tag itself as a > first-class spec entity is a separate effort, not in scope for > V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that > direction. > - > > Governance scope: Prashant Singh raised concerns about > positioning labels as a governance protocol — provenance, > identity mapping across IDPs, inheritance semantics. Room > aligned that labels are broader than governance — semantic > metadata exchange is the load-bearing case; governance remains > a valid use case among many, and whether to use labels for > governance is a catalog-level decision rather than a spec > mandate. Policy decisions and enforcement live in read > restrictions (PR #13879 [3]) — a parallel and complementary > track. > - > > Write API shape converging on an independent CRUD endpoint > (UpdateLabels-style verb) with a transactional path for atomic > table+label operations at create/alter time. Two-class > distinction (catalog-authored vs externally-managed labels) > reaffirmed; Ryan noted not all labels should be editable via > CRUD since many are produced by the catalog through inheritance, > classification, or automated paths. > - > > Bulk APIs surfaced as a real need for both read (inverted index > — finding tables/columns matching given labels) and write > (applying labels at scale, classifier batch operations). Scoped > for inclusion in the write API proposal. > - > > Pattern for adding new first-class REST concepts (labels, UDFs, > indexes, etc.): independent CRUD endpoint per concept, paired > with a transactional path for atomic operations alongside table > create/alter. Useful reference shape for future spec additions. > > Post-sync follow-ups already in motion: > > - > > Hot-path discipline added to the proposal in response to > Christian Thiel's doc comment: LoadTableResponse latency MUST > NOT increase due to labels; how catalogs meet this is > implementation-defined (caching, freshness trade-offs, > filtering). Capability negotiation — parallel to the work on > PR #13879 [3] — is a future direction. > - > > Use case split (high-confidence cross-catalog: semantic, domain, > classification, sensitivity vs platform-specific: owner, > principals, anything identity-bound) agreed after offline > follow-up with Prashant; will be reflected in the next proposal > revision. > - > > A separate [DISCUSS] thread will land the substrate framing > publicly. > > Next sync approximately three / four weeks out. Tentative agenda: > labels/Tag boundary update, write-path sketch walk-through, path > to VOTE on the read API. > > Thanks to everyone who joined and to those continuing to engage > on the design doc [4] and spec PR #15750 [5]. > > Best, > Andrei > > [1] https://youtu.be/P4NOQASNtPA > [2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b > [3] https://github.com/apache/iceberg/pull/13879 > [4] > https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit > [5] https://github.com/apache/iceberg/pull/15750 > > On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau < > [email protected]> wrote: > >> Quick update on this on: >> we'll cover this on the Dedicated Sync this Thursday (10-11am US / 7-8pm >> CET). Thanks to Daniel Weeks for getting it on the calendar. >> >> Last time labels was on the sync was 2026-04-15. Plenty of productive >> offline discussion since then, mostly in the gdoc comment threads. Thanks >> to everyone who engaged: >> >> - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now >> anchors the Alternatives section >> - *Fokko Driesprong* — for challenging motivation on the cost-based >> defense and driving the ownership reframe >> - *Yufei Gu* — for the structure debate that landed us on the split >> shape >> - *Sung Yun* — for the early consumption-pattern and addressing >> questions >> - *Maninder Parmar* — for the properties-relationship probing >> - *Christian Thiel* — for pushing on the write API direction >> >> Concrete changes in-doc since April: >> >> - Problem Statement reframed around catalog-owned metainformation as >> the load-bearing concept. >> - Alternatives Considered rewritten with the IRC-spec-vs-table-spec >> boundary instead of cost arguments. >> - Structure debate closed on a split shape: labels (flat k/v at the >> table level, k8s-style) + column-labels (array with field-id). Labels >> type itself is flat — no internal structure. Same shape applies on >> LoadViewResponse and namespaces. >> - CRUD companion as a second tab in the same gdoc — UpdateLabels REST >> verb, two-class distinction for catalog-managed vs externally-managed >> keys, >> optimistic concurrency with ETags. >> - Working Trino prototype at >> https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER >> TABLE ... SET LABEL DDL translating end-to-end. >> >> Parallel work to flag: EJ Wang's first-class Tag concept >> <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b> >> proposal on dev@. We've agreed to coordinate as paired proposals — Tag >> as a separate first-class REST concept, labels as the lower-level >> attachment substrate. Both efforts share the cross-cutting interop question. >> >> Goal on Thursday is to walk through the current state, confirm the >> split-shape lands cleanly, and identify what's needed to move toward a VOTE >> on the read API. Anyone reading along is welcome to join. >> >> Doc (current state): >> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit >> >> Thanks, >> Andrei >> >> On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau < >> [email protected]> wrote: >> >>> Thanks Ryan! >>> >>> Your point about avoiding first-class metadata requirements is exactly >>> the design principle here. Labels let each catalog surface what it knows >>> without the spec dictating what catalogs must track. >>> >>> To build on this, I put together a POC showing the approach works across >>> the ecosystem. >>> >>> Key design principles that held up in practice: >>> >>> - No new requirements on catalogs. Labels are optional in the response. >>> A catalog that doesn't serve labels returns the same response as today. >>> >>> - Catalog-scoped, not table state. Every catalog we tried already has >>> internal metadata separate from Iceberg properties — Polaris has >>> internalProperties, UC has uc_properties, Lakekeeper has namespace >>> properties in PostgreSQL. Labels just give this existing metadata a >>> standard way through the protocol. >>> >>> - No property overriding. Labels are explicitly separate from table >>> properties. Properties configure behavior, labels describe context. Engines >>> know which is which. >>> >>> What built: >>> >>> - Spec change: https://github.com/apache/iceberg/pull/15750 >>> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191 >>> >>> Catalog implementations: >>> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from >>> internalProperties) >>> - Unity Catalog OSS: >>> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from >>> uc_properties) >>> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676 >>> (labels from namespace properties) >>> >>> Full demo: https://github.com/laskoviymishka/irc-labels >>> >>> Three catalogs, two languages (Java + Rust), 40-95 lines each. The >>> pattern is the same everywhere, each catalog already has internal metadata >>> that doesn't belong in table properties. Labels give it a standard way out >>> through the protocol. >>> >>> The Polaris implementation also addresses >>> https://github.com/apache/polaris/issues/3222 - the community has been >>> asking for a way to surface business metadata alongside table loads. Labels >>> solve this without adding any requirements beyond an optional field. >>> >>> Beyond ownership and classification, the demo also shows labels enabling >>> AI agent table selection (agents reason about tables using semantic labels >>> instead of guessing from column names) and governance via trusted engine >>> (ClickHouse reading sensitivity labels to auto-generate masking policies). >>> >>> Happy to discuss the spec design or any of the implementation details. >>> >>> Andrei >>> >>> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote: >>> >>>> I think that this is a reasonable way to solve some persistent issues >>>> that we've seen. >>>> >>>> Many catalogs track additional metadata that is not part of the table >>>> spec (or others) like "owner", and right now there is no way to exchange or >>>> share that information. I'm also hesitant to start including it as >>>> first-class metadata because that puts additional requirements on catalogs >>>> that may not align. For instance, Tabular had no concept of a table "owner" >>>> and instead used default grants at the schema level. I like that this >>>> solution allows catalogs to provide information in a generic way that >>>> doesn't add requirements in the REST spec. And it is an alternative to >>>> overriding table properties with catalog-managed information, which I think >>>> is an anti-pattern. >>>> >>>> Thanks, Andrei! I think this is a good idea. >>>> >>>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev < >>>> [email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> `LoadTableResponse` returns table metadata — schema, snapshots, file >>>>> locations — but catalogs have operational context about tables that has no >>>>> standard place to go: cost attribution, ownership, governance hints, >>>>> semantic metadata. Right now catalogs have two options: >>>>> >>>>> 1. Properties — durable, commit-versioned table state. Good for >>>>> persistent metadata; wrong for ephemeral catalog context. >>>>> 2. Custom fields — catalog-specific extensions with no >>>>> interoperability. Each catalog invents its own structure; engines have no >>>>> basis to read them. >>>>> >>>>> The community has already identified this gap. Polaris opened an issue >>>>> [1] requesting a standard extension point in the IRC protocol for >>>>> catalog-managed metadata. Two earlier threads [2][3] explored column-level >>>>> metadata, though in the context of table format changes. >>>>> >>>>> We propose adding an optional `labels` field to `LoadTableResponse` >>>>> for catalog-managed metadata. Labels are string key-value pairs generated >>>>> per-request from the catalog's internal systems; nothing is written to >>>>> table files. Engines may use or ignore them entirely. Labels give catalog >>>>> providers a standard channel to surface context to any client without >>>>> bilateral custom integrations for every catalog-engine pair. >>>>> >>>>> Details: >>>>> - GitHub Issue: apache/iceberg#15521 >>>>> - Design Document: [4] >>>>> >>>>> Please review the proposal and share your feedback. >>>>> >>>>> Thanks, >>>>> Andrei >>>>> >>>>> [1]: https://github.com/apache/polaris/issues/3222 >>>>> [2]: https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4 >>>>> [3]: https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0 >>>>> [4]: >>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing >>>>> >>>>
