Hi all, Next labels sync is scheduled for Monday, June 29, 2026, 8am PT / 5pm CET. Earlier US-time slot to accommodate European attendees.
Meet link: [https://meet.google.com/ezz-bvff-wjt] Agenda: - Updates since last sync (May 28) - Labels / Tag boundary, topic from EJ Wang - Write-path sketch walk-through - ETag-based caching and concurrency (new topic, surfaced at the recent IRC catalog community sync) - Path to VOTE on the read API Concrete changes in the proposal [1] and spec PR [2] since the May 28 recap: - Schema consolidated: labels and column-labels merged into a single Labels object with table and columns sub-properties (per Daniel Weeks' May 28 review). Resolves the "two fields representing the same thing" feedback on the PR. JSON example and appendix updated. - Hot-Path Discipline added: LoadTableResponse latency MUST NOT increase due to labels; how catalogs meet this is implementation-defined (caching, freshness trade-offs, filtering). Added in response to Christian Thiel's design doc comment. - Governance scope clarified: labels carry context, not enforcement decisions. Enforcement semantics and policy vocabulary live in other spec layers (Read Restrictions, PR #13879 [3]). - Open Questions cleaned up: write path summarized with ETag-based optimistic concurrency + two-class distinction (catalog-managed vs externally-managed keys); structured classification layer reframed as the future authoring companion to labels as the read surface. Goal on Monday: walk through the updated proposal, work through remaining concerns on the read API, and identify what's needed to move toward a VOTE. Thanks, Andrei [1] https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit [2] https://github.com/apache/iceberg/pull/15750 [3] https://github.com/apache/iceberg/pull/13879 On Fri, Jun 12, 2026 at 3:56 AM EJ Wang <[email protected]> wrote: > Thanks Andrei for the recap. I want to clarify one point on the > labels-vs-Tag boundary, mostly to avoid having the labels discussion > pre-decide questions that belong in the parallel Tag discussion. > > I agree with the use cases motivating labels: exposing catalog-managed > context such as ownership, domain, cost attribution, classification hints, > and semantic hints in REST responses can be useful for engines and clients. > The part I am less sure about is whether those use cases require a separate > Labels concept in the spec, or whether they should be modeled as projected > metadata from a structured Tag/classification model. > > My concern is dependency direction. If we introduce labels as a flat > generic primitive first, and later add structure for identity, lifecycle, > allowed values, inheritance, field-id attachment, visibility, and reverse > lookup, then we may end up reconstructing a Tag model around labels. That > feels less clear than defining the structured model directly and allowing > catalogs to project the relevant assignments into REST responses where > useful. > > In other words, I don't think the interesting question is only whether > labels should be flat or structured. I think the question is whether labels > should be a separate primitive at all, or whether the read-response use > cases can be covered by a projected view of structured tag/classification > assignments. > > Where I'd be especially careful is the phrase that tags are > "catalog-internal structured concepts." I agree that the full Tag > discussion is outside the scope of Labels V1, but I would not want Labels > V1 to pre-decide that structured tagging/classification semantics are only > catalog-internal and not an IRC concept. That is exactly the separate > question being explored in the Tag thread. > > The factoring I'd prefer to evaluate is: > > - Tag (structured) classification: authoring, lifecycle, identity, > field-id attachment, inheritance, visibility, and lookup semantics > - REST response projection: optional metadata returned to clients, > potentially derived from structured tag assignments > - read-restrictions: enforcement result delivered to engines > > That framing may reduce the need for a separate Labels primitive while > still preserving the read-response use cases that motivated the labels > proposal. > > I realize this may be a bigger factoring question than Labels V1 intended > to answer, but I think it is worth making explicit before the two threads > diverge. If the community wants one logical concept rather than both labels > and tags, I think we should at least evaluate the direction where the > structured Tag/classification model is the source of truth and lightweight > REST response metadata is a projection from it, before standardizing labels > as an independent primitive. > > -ej > > On Thu, Jun 11, 2026 at 11:53 AM Andrei Tserakhau via dev < > [email protected]> wrote: > >> Hi all, >> >> Recap from the dedicated labels sync held on May 28, 2026 >> (recording [1]). >> >> Summary of the discussion: >> >> - >> >> Strong consensus to land the read API first, with the write API >> as a separate follow-up proposal (Ryan, Sung, Kevin, Christian >> aligned). Christian raised a concern that the write half could >> lag behind (Trino views precedent); to address this, the >> proposal will document the write-path direction alongside the >> read API. >> - >> >> Labels remain flat key-value pairs, no internal structure. >> Kubernetes labels precedent invoked — flat shape, conventions >> via well-known prefixes, no spec-defined vocabulary. >> Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris) >> discussed and set aside in favor of prefix conventions. >> - >> >> Labels-vs-Tag boundary: labels are the wire-protocol mechanism >> for cross-catalog metadata exchange (this proposal); tags are >> catalog-internal structured concepts (Snowflake, UC, Polaris >> each have their own shape). Standardizing Tag itself as a >> first-class spec entity is a separate effort, not in scope for >> V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that >> direction. >> - >> >> Governance scope: Prashant Singh raised concerns about >> positioning labels as a governance protocol — provenance, >> identity mapping across IDPs, inheritance semantics. Room >> aligned that labels are broader than governance — semantic >> metadata exchange is the load-bearing case; governance remains >> a valid use case among many, and whether to use labels for >> governance is a catalog-level decision rather than a spec >> mandate. Policy decisions and enforcement live in read >> restrictions (PR #13879 [3]) — a parallel and complementary >> track. >> - >> >> Write API shape converging on an independent CRUD endpoint >> (UpdateLabels-style verb) with a transactional path for atomic >> table+label operations at create/alter time. Two-class >> distinction (catalog-authored vs externally-managed labels) >> reaffirmed; Ryan noted not all labels should be editable via >> CRUD since many are produced by the catalog through inheritance, >> classification, or automated paths. >> - >> >> Bulk APIs surfaced as a real need for both read (inverted index >> — finding tables/columns matching given labels) and write >> (applying labels at scale, classifier batch operations). Scoped >> for inclusion in the write API proposal. >> - >> >> Pattern for adding new first-class REST concepts (labels, UDFs, >> indexes, etc.): independent CRUD endpoint per concept, paired >> with a transactional path for atomic operations alongside table >> create/alter. Useful reference shape for future spec additions. >> >> Post-sync follow-ups already in motion: >> >> - >> >> Hot-path discipline added to the proposal in response to >> Christian Thiel's doc comment: LoadTableResponse latency MUST >> NOT increase due to labels; how catalogs meet this is >> implementation-defined (caching, freshness trade-offs, >> filtering). Capability negotiation — parallel to the work on >> PR #13879 [3] — is a future direction. >> - >> >> Use case split (high-confidence cross-catalog: semantic, domain, >> classification, sensitivity vs platform-specific: owner, >> principals, anything identity-bound) agreed after offline >> follow-up with Prashant; will be reflected in the next proposal >> revision. >> - >> >> A separate [DISCUSS] thread will land the substrate framing >> publicly. >> >> Next sync approximately three / four weeks out. Tentative agenda: >> labels/Tag boundary update, write-path sketch walk-through, path >> to VOTE on the read API. >> >> Thanks to everyone who joined and to those continuing to engage >> on the design doc [4] and spec PR #15750 [5]. >> >> Best, >> Andrei >> >> [1] https://youtu.be/P4NOQASNtPA >> [2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b >> [3] https://github.com/apache/iceberg/pull/13879 >> [4] >> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit >> [5] https://github.com/apache/iceberg/pull/15750 >> >> On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau < >> [email protected]> wrote: >> >>> Quick update on this on: >>> we'll cover this on the Dedicated Sync this Thursday (10-11am US / 7-8pm >>> CET). Thanks to Daniel Weeks for getting it on the calendar. >>> >>> Last time labels was on the sync was 2026-04-15. Plenty of productive >>> offline discussion since then, mostly in the gdoc comment threads. Thanks >>> to everyone who engaged: >>> >>> - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now >>> anchors the Alternatives section >>> - *Fokko Driesprong* — for challenging motivation on the cost-based >>> defense and driving the ownership reframe >>> - *Yufei Gu* — for the structure debate that landed us on the split >>> shape >>> - *Sung Yun* — for the early consumption-pattern and addressing >>> questions >>> - *Maninder Parmar* — for the properties-relationship probing >>> - *Christian Thiel* — for pushing on the write API direction >>> >>> Concrete changes in-doc since April: >>> >>> - Problem Statement reframed around catalog-owned metainformation as >>> the load-bearing concept. >>> - Alternatives Considered rewritten with the IRC-spec-vs-table-spec >>> boundary instead of cost arguments. >>> - Structure debate closed on a split shape: labels (flat k/v at the >>> table level, k8s-style) + column-labels (array with field-id). Labels >>> type itself is flat — no internal structure. Same shape applies on >>> LoadViewResponse and namespaces. >>> - CRUD companion as a second tab in the same gdoc — UpdateLabels >>> REST verb, two-class distinction for catalog-managed vs >>> externally-managed >>> keys, optimistic concurrency with ETags. >>> - Working Trino prototype at >>> https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER >>> TABLE ... SET LABEL DDL translating end-to-end. >>> >>> Parallel work to flag: EJ Wang's first-class Tag concept >>> <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b> >>> proposal on dev@. We've agreed to coordinate as paired proposals — Tag >>> as a separate first-class REST concept, labels as the lower-level >>> attachment substrate. Both efforts share the cross-cutting interop question. >>> >>> Goal on Thursday is to walk through the current state, confirm the >>> split-shape lands cleanly, and identify what's needed to move toward a VOTE >>> on the read API. Anyone reading along is welcome to join. >>> >>> Doc (current state): >>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit >>> >>> Thanks, >>> Andrei >>> >>> On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau < >>> [email protected]> wrote: >>> >>>> Thanks Ryan! >>>> >>>> Your point about avoiding first-class metadata requirements is exactly >>>> the design principle here. Labels let each catalog surface what it knows >>>> without the spec dictating what catalogs must track. >>>> >>>> To build on this, I put together a POC showing the approach works >>>> across the ecosystem. >>>> >>>> Key design principles that held up in practice: >>>> >>>> - No new requirements on catalogs. Labels are optional in the response. >>>> A catalog that doesn't serve labels returns the same response as today. >>>> >>>> - Catalog-scoped, not table state. Every catalog we tried already has >>>> internal metadata separate from Iceberg properties — Polaris has >>>> internalProperties, UC has uc_properties, Lakekeeper has namespace >>>> properties in PostgreSQL. Labels just give this existing metadata a >>>> standard way through the protocol. >>>> >>>> - No property overriding. Labels are explicitly separate from table >>>> properties. Properties configure behavior, labels describe context. Engines >>>> know which is which. >>>> >>>> What built: >>>> >>>> - Spec change: https://github.com/apache/iceberg/pull/15750 >>>> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191 >>>> >>>> Catalog implementations: >>>> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from >>>> internalProperties) >>>> - Unity Catalog OSS: >>>> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from >>>> uc_properties) >>>> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676 >>>> (labels from namespace properties) >>>> >>>> Full demo: https://github.com/laskoviymishka/irc-labels >>>> >>>> Three catalogs, two languages (Java + Rust), 40-95 lines each. The >>>> pattern is the same everywhere, each catalog already has internal metadata >>>> that doesn't belong in table properties. Labels give it a standard way out >>>> through the protocol. >>>> >>>> The Polaris implementation also addresses >>>> https://github.com/apache/polaris/issues/3222 - the community has been >>>> asking for a way to surface business metadata alongside table loads. Labels >>>> solve this without adding any requirements beyond an optional field. >>>> >>>> Beyond ownership and classification, the demo also shows labels >>>> enabling AI agent table selection (agents reason about tables using >>>> semantic labels instead of guessing from column names) and governance via >>>> trusted engine (ClickHouse reading sensitivity labels to auto-generate >>>> masking policies). >>>> >>>> Happy to discuss the spec design or any of the implementation details. >>>> >>>> Andrei >>>> >>>> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote: >>>> >>>>> I think that this is a reasonable way to solve some persistent issues >>>>> that we've seen. >>>>> >>>>> Many catalogs track additional metadata that is not part of the table >>>>> spec (or others) like "owner", and right now there is no way to exchange >>>>> or >>>>> share that information. I'm also hesitant to start including it as >>>>> first-class metadata because that puts additional requirements on catalogs >>>>> that may not align. For instance, Tabular had no concept of a table >>>>> "owner" >>>>> and instead used default grants at the schema level. I like that this >>>>> solution allows catalogs to provide information in a generic way that >>>>> doesn't add requirements in the REST spec. And it is an alternative to >>>>> overriding table properties with catalog-managed information, which I >>>>> think >>>>> is an anti-pattern. >>>>> >>>>> Thanks, Andrei! I think this is a good idea. >>>>> >>>>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> `LoadTableResponse` returns table metadata — schema, snapshots, file >>>>>> locations — but catalogs have operational context about tables that has >>>>>> no >>>>>> standard place to go: cost attribution, ownership, governance hints, >>>>>> semantic metadata. Right now catalogs have two options: >>>>>> >>>>>> 1. Properties — durable, commit-versioned table state. Good for >>>>>> persistent metadata; wrong for ephemeral catalog context. >>>>>> 2. Custom fields — catalog-specific extensions with no >>>>>> interoperability. Each catalog invents its own structure; engines have no >>>>>> basis to read them. >>>>>> >>>>>> The community has already identified this gap. Polaris opened an >>>>>> issue [1] requesting a standard extension point in the IRC protocol for >>>>>> catalog-managed metadata. Two earlier threads [2][3] explored >>>>>> column-level >>>>>> metadata, though in the context of table format changes. >>>>>> >>>>>> We propose adding an optional `labels` field to `LoadTableResponse` >>>>>> for catalog-managed metadata. Labels are string key-value pairs generated >>>>>> per-request from the catalog's internal systems; nothing is written to >>>>>> table files. Engines may use or ignore them entirely. Labels give catalog >>>>>> providers a standard channel to surface context to any client without >>>>>> bilateral custom integrations for every catalog-engine pair. >>>>>> >>>>>> Details: >>>>>> - GitHub Issue: apache/iceberg#15521 >>>>>> - Design Document: [4] >>>>>> >>>>>> Please review the proposal and share your feedback. >>>>>> >>>>>> Thanks, >>>>>> Andrei >>>>>> >>>>>> [1]: https://github.com/apache/polaris/issues/3222 >>>>>> [2]: https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4 >>>>>> [3]: https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0 >>>>>> [4]: >>>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing >>>>>> >>>>>
