Hey Folks, This meeting is not open to everyone ? Have been waiting for some time in the waiting room Can someone let me in ?
On Fri, Jun 26, 2026 at 7:59 AM Andrei Tserakhau via dev < [email protected]> wrote: > Hi all, > > Next labels sync is scheduled for Monday, June 29, 2026, 8am PT / > 5pm CET. Earlier US-time slot to accommodate European attendees. > > Meet link: [https://meet.google.com/ezz-bvff-wjt] > > Agenda: > > - Updates since last sync (May 28) > - Labels / Tag boundary, topic from EJ Wang > - Write-path sketch walk-through > - ETag-based caching and concurrency (new topic, surfaced at the > recent IRC catalog community sync) > - Path to VOTE on the read API > > Concrete changes in the proposal [1] and spec PR [2] since the > May 28 recap: > > - Schema consolidated: labels and column-labels merged into a > single Labels object with table and columns sub-properties > (per Daniel Weeks' May 28 review). Resolves the "two fields > representing the same thing" feedback on the PR. JSON example > and appendix updated. > - Hot-Path Discipline added: LoadTableResponse latency MUST NOT > increase due to labels; how catalogs meet this is > implementation-defined (caching, freshness trade-offs, > filtering). Added in response to Christian Thiel's design doc > comment. > - Governance scope clarified: labels carry context, not > enforcement decisions. Enforcement semantics and policy vocabulary > live > in other spec layers (Read Restrictions, PR #13879 [3]). > - Open Questions cleaned up: write path summarized with ETag-based > optimistic concurrency + two-class distinction (catalog-managed > vs externally-managed keys); structured classification layer > reframed as the future authoring companion to labels as the > read surface. > > Goal on Monday: walk through the updated proposal, work through > remaining concerns on the read API, and identify what's needed to > move toward a VOTE. > > Thanks, > Andrei > > [1] > https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit > [2] https://github.com/apache/iceberg/pull/15750 > [3] https://github.com/apache/iceberg/pull/13879 > > On Fri, Jun 12, 2026 at 3:56 AM EJ Wang <[email protected]> > wrote: > >> Thanks Andrei for the recap. I want to clarify one point on the >> labels-vs-Tag boundary, mostly to avoid having the labels discussion >> pre-decide questions that belong in the parallel Tag discussion. >> >> I agree with the use cases motivating labels: exposing catalog-managed >> context such as ownership, domain, cost attribution, classification hints, >> and semantic hints in REST responses can be useful for engines and clients. >> The part I am less sure about is whether those use cases require a separate >> Labels concept in the spec, or whether they should be modeled as projected >> metadata from a structured Tag/classification model. >> >> My concern is dependency direction. If we introduce labels as a flat >> generic primitive first, and later add structure for identity, lifecycle, >> allowed values, inheritance, field-id attachment, visibility, and reverse >> lookup, then we may end up reconstructing a Tag model around labels. That >> feels less clear than defining the structured model directly and allowing >> catalogs to project the relevant assignments into REST responses where >> useful. >> >> In other words, I don't think the interesting question is only whether >> labels should be flat or structured. I think the question is whether labels >> should be a separate primitive at all, or whether the read-response use >> cases can be covered by a projected view of structured tag/classification >> assignments. >> >> Where I'd be especially careful is the phrase that tags are >> "catalog-internal structured concepts." I agree that the full Tag >> discussion is outside the scope of Labels V1, but I would not want Labels >> V1 to pre-decide that structured tagging/classification semantics are only >> catalog-internal and not an IRC concept. That is exactly the separate >> question being explored in the Tag thread. >> >> The factoring I'd prefer to evaluate is: >> >> - Tag (structured) classification: authoring, lifecycle, identity, >> field-id attachment, inheritance, visibility, and lookup semantics >> - REST response projection: optional metadata returned to clients, >> potentially derived from structured tag assignments >> - read-restrictions: enforcement result delivered to engines >> >> That framing may reduce the need for a separate Labels primitive while >> still preserving the read-response use cases that motivated the labels >> proposal. >> >> I realize this may be a bigger factoring question than Labels V1 intended >> to answer, but I think it is worth making explicit before the two threads >> diverge. If the community wants one logical concept rather than both labels >> and tags, I think we should at least evaluate the direction where the >> structured Tag/classification model is the source of truth and lightweight >> REST response metadata is a projection from it, before standardizing labels >> as an independent primitive. >> >> -ej >> >> On Thu, Jun 11, 2026 at 11:53 AM Andrei Tserakhau via dev < >> [email protected]> wrote: >> >>> Hi all, >>> >>> Recap from the dedicated labels sync held on May 28, 2026 >>> (recording [1]). >>> >>> Summary of the discussion: >>> >>> - >>> >>> Strong consensus to land the read API first, with the write API >>> as a separate follow-up proposal (Ryan, Sung, Kevin, Christian >>> aligned). Christian raised a concern that the write half could >>> lag behind (Trino views precedent); to address this, the >>> proposal will document the write-path direction alongside the >>> read API. >>> - >>> >>> Labels remain flat key-value pairs, no internal structure. >>> Kubernetes labels precedent invoked — flat shape, conventions >>> via well-known prefixes, no spec-defined vocabulary. >>> Namespace-as-attribute (raised by Uladzimir Makaranka, Polaris) >>> discussed and set aside in favor of prefix conventions. >>> - >>> >>> Labels-vs-Tag boundary: labels are the wire-protocol mechanism >>> for cross-catalog metadata exchange (this proposal); tags are >>> catalog-internal structured concepts (Snowflake, UC, Polaris >>> each have their own shape). Standardizing Tag itself as a >>> first-class spec entity is a separate effort, not in scope for >>> V1. EJ Wang's parallel Tag proposal on dev@ [2] is in that >>> direction. >>> - >>> >>> Governance scope: Prashant Singh raised concerns about >>> positioning labels as a governance protocol — provenance, >>> identity mapping across IDPs, inheritance semantics. Room >>> aligned that labels are broader than governance — semantic >>> metadata exchange is the load-bearing case; governance remains >>> a valid use case among many, and whether to use labels for >>> governance is a catalog-level decision rather than a spec >>> mandate. Policy decisions and enforcement live in read >>> restrictions (PR #13879 [3]) — a parallel and complementary >>> track. >>> - >>> >>> Write API shape converging on an independent CRUD endpoint >>> (UpdateLabels-style verb) with a transactional path for atomic >>> table+label operations at create/alter time. Two-class >>> distinction (catalog-authored vs externally-managed labels) >>> reaffirmed; Ryan noted not all labels should be editable via >>> CRUD since many are produced by the catalog through inheritance, >>> classification, or automated paths. >>> - >>> >>> Bulk APIs surfaced as a real need for both read (inverted index >>> — finding tables/columns matching given labels) and write >>> (applying labels at scale, classifier batch operations). Scoped >>> for inclusion in the write API proposal. >>> - >>> >>> Pattern for adding new first-class REST concepts (labels, UDFs, >>> indexes, etc.): independent CRUD endpoint per concept, paired >>> with a transactional path for atomic operations alongside table >>> create/alter. Useful reference shape for future spec additions. >>> >>> Post-sync follow-ups already in motion: >>> >>> - >>> >>> Hot-path discipline added to the proposal in response to >>> Christian Thiel's doc comment: LoadTableResponse latency MUST >>> NOT increase due to labels; how catalogs meet this is >>> implementation-defined (caching, freshness trade-offs, >>> filtering). Capability negotiation — parallel to the work on >>> PR #13879 [3] — is a future direction. >>> - >>> >>> Use case split (high-confidence cross-catalog: semantic, domain, >>> classification, sensitivity vs platform-specific: owner, >>> principals, anything identity-bound) agreed after offline >>> follow-up with Prashant; will be reflected in the next proposal >>> revision. >>> - >>> >>> A separate [DISCUSS] thread will land the substrate framing >>> publicly. >>> >>> Next sync approximately three / four weeks out. Tentative agenda: >>> labels/Tag boundary update, write-path sketch walk-through, path >>> to VOTE on the read API. >>> >>> Thanks to everyone who joined and to those continuing to engage >>> on the design doc [4] and spec PR #15750 [5]. >>> >>> Best, >>> Andrei >>> >>> [1] https://youtu.be/P4NOQASNtPA >>> [2] https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b >>> [3] https://github.com/apache/iceberg/pull/13879 >>> [4] >>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit >>> [5] https://github.com/apache/iceberg/pull/15750 >>> >>> On Wed, May 27, 2026 at 12:12 AM Andrei Tserakhau < >>> [email protected]> wrote: >>> >>>> Quick update on this on: >>>> we'll cover this on the Dedicated Sync this Thursday (10-11am US / >>>> 7-8pm CET). Thanks to Daniel Weeks for getting it on the calendar. >>>> >>>> Last time labels was on the sync was 2026-04-15. Plenty of productive >>>> offline discussion since then, mostly in the gdoc comment threads. Thanks >>>> to everyone who engaged: >>>> >>>> - *Daniel Weeks* — for the IRC-spec-vs-table-spec framing that now >>>> anchors the Alternatives section >>>> - *Fokko Driesprong* — for challenging motivation on the cost-based >>>> defense and driving the ownership reframe >>>> - *Yufei Gu* — for the structure debate that landed us on the split >>>> shape >>>> - *Sung Yun* — for the early consumption-pattern and addressing >>>> questions >>>> - *Maninder Parmar* — for the properties-relationship probing >>>> - *Christian Thiel* — for pushing on the write API direction >>>> >>>> Concrete changes in-doc since April: >>>> >>>> - Problem Statement reframed around catalog-owned metainformation >>>> as the load-bearing concept. >>>> - Alternatives Considered rewritten with the IRC-spec-vs-table-spec >>>> boundary instead of cost arguments. >>>> - Structure debate closed on a split shape: labels (flat k/v at the >>>> table level, k8s-style) + column-labels (array with field-id). >>>> Labels type itself is flat — no internal structure. Same shape >>>> applies on LoadViewResponse and namespaces. >>>> - CRUD companion as a second tab in the same gdoc — UpdateLabels >>>> REST verb, two-class distinction for catalog-managed vs >>>> externally-managed >>>> keys, optimistic concurrency with ETags. >>>> - Working Trino prototype at >>>> https://github.com/laskoviymishka/irc-labels/pull/1 — native ALTER >>>> TABLE ... SET LABEL DDL translating end-to-end. >>>> >>>> Parallel work to flag: EJ Wang's first-class Tag concept >>>> <https://lists.apache.org/thread/r5r3vpmrfy9wmmb4sdybwcjz1c4wld5b> >>>> proposal on dev@. We've agreed to coordinate as paired proposals — Tag >>>> as a separate first-class REST concept, labels as the lower-level >>>> attachment substrate. Both efforts share the cross-cutting interop >>>> question. >>>> >>>> Goal on Thursday is to walk through the current state, confirm the >>>> split-shape lands cleanly, and identify what's needed to move toward a VOTE >>>> on the read API. Anyone reading along is welcome to join. >>>> >>>> Doc (current state): >>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit >>>> >>>> Thanks, >>>> Andrei >>>> >>>> On Tue, Mar 24, 2026 at 9:35 PM Andrei Tserakhau < >>>> [email protected]> wrote: >>>> >>>>> Thanks Ryan! >>>>> >>>>> Your point about avoiding first-class metadata requirements is exactly >>>>> the design principle here. Labels let each catalog surface what it knows >>>>> without the spec dictating what catalogs must track. >>>>> >>>>> To build on this, I put together a POC showing the approach works >>>>> across the ecosystem. >>>>> >>>>> Key design principles that held up in practice: >>>>> >>>>> - No new requirements on catalogs. Labels are optional in the >>>>> response. A catalog that doesn't serve labels returns the same response as >>>>> today. >>>>> >>>>> - Catalog-scoped, not table state. Every catalog we tried already has >>>>> internal metadata separate from Iceberg properties — Polaris has >>>>> internalProperties, UC has uc_properties, Lakekeeper has namespace >>>>> properties in PostgreSQL. Labels just give this existing metadata a >>>>> standard way through the protocol. >>>>> >>>>> - No property overriding. Labels are explicitly separate from table >>>>> properties. Properties configure behavior, labels describe context. >>>>> Engines >>>>> know which is which. >>>>> >>>>> What built: >>>>> >>>>> - Spec change: https://github.com/apache/iceberg/pull/15750 >>>>> - PyIceberg client: https://github.com/apache/iceberg-python/pull/3191 >>>>> >>>>> Catalog implementations: >>>>> - Polaris: https://github.com/apache/polaris/pull/4048 (labels from >>>>> internalProperties) >>>>> - Unity Catalog OSS: >>>>> https://github.com/unitycatalog/unitycatalog/pull/1417 (labels from >>>>> uc_properties) >>>>> - Lakekeeper: https://github.com/lakekeeper/lakekeeper/pull/1676 >>>>> (labels from namespace properties) >>>>> >>>>> Full demo: https://github.com/laskoviymishka/irc-labels >>>>> >>>>> Three catalogs, two languages (Java + Rust), 40-95 lines each. The >>>>> pattern is the same everywhere, each catalog already has internal metadata >>>>> that doesn't belong in table properties. Labels give it a standard way out >>>>> through the protocol. >>>>> >>>>> The Polaris implementation also addresses >>>>> https://github.com/apache/polaris/issues/3222 - the community has >>>>> been asking for a way to surface business metadata alongside table loads. >>>>> Labels solve this without adding any requirements beyond an optional >>>>> field. >>>>> >>>>> Beyond ownership and classification, the demo also shows labels >>>>> enabling AI agent table selection (agents reason about tables using >>>>> semantic labels instead of guessing from column names) and governance via >>>>> trusted engine (ClickHouse reading sensitivity labels to auto-generate >>>>> masking policies). >>>>> >>>>> Happy to discuss the spec design or any of the implementation details. >>>>> >>>>> Andrei >>>>> >>>>> On Fri, Mar 6, 2026 at 11:25 PM Ryan Blue <[email protected]> wrote: >>>>> >>>>>> I think that this is a reasonable way to solve some persistent issues >>>>>> that we've seen. >>>>>> >>>>>> Many catalogs track additional metadata that is not part of the table >>>>>> spec (or others) like "owner", and right now there is no way to exchange >>>>>> or >>>>>> share that information. I'm also hesitant to start including it as >>>>>> first-class metadata because that puts additional requirements on >>>>>> catalogs >>>>>> that may not align. For instance, Tabular had no concept of a table >>>>>> "owner" >>>>>> and instead used default grants at the schema level. I like that this >>>>>> solution allows catalogs to provide information in a generic way that >>>>>> doesn't add requirements in the REST spec. And it is an alternative to >>>>>> overriding table properties with catalog-managed information, which I >>>>>> think >>>>>> is an anti-pattern. >>>>>> >>>>>> Thanks, Andrei! I think this is a good idea. >>>>>> >>>>>> On Thu, Mar 5, 2026 at 2:04 PM Andrei Tserakhau via dev < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> `LoadTableResponse` returns table metadata — schema, snapshots, file >>>>>>> locations — but catalogs have operational context about tables that has >>>>>>> no >>>>>>> standard place to go: cost attribution, ownership, governance hints, >>>>>>> semantic metadata. Right now catalogs have two options: >>>>>>> >>>>>>> 1. Properties — durable, commit-versioned table state. Good for >>>>>>> persistent metadata; wrong for ephemeral catalog context. >>>>>>> 2. Custom fields — catalog-specific extensions with no >>>>>>> interoperability. Each catalog invents its own structure; engines have >>>>>>> no >>>>>>> basis to read them. >>>>>>> >>>>>>> The community has already identified this gap. Polaris opened an >>>>>>> issue [1] requesting a standard extension point in the IRC protocol for >>>>>>> catalog-managed metadata. Two earlier threads [2][3] explored >>>>>>> column-level >>>>>>> metadata, though in the context of table format changes. >>>>>>> >>>>>>> We propose adding an optional `labels` field to `LoadTableResponse` >>>>>>> for catalog-managed metadata. Labels are string key-value pairs >>>>>>> generated >>>>>>> per-request from the catalog's internal systems; nothing is written to >>>>>>> table files. Engines may use or ignore them entirely. Labels give >>>>>>> catalog >>>>>>> providers a standard channel to surface context to any client without >>>>>>> bilateral custom integrations for every catalog-engine pair. >>>>>>> >>>>>>> Details: >>>>>>> - GitHub Issue: apache/iceberg#15521 >>>>>>> - Design Document: [4] >>>>>>> >>>>>>> Please review the proposal and share your feedback. >>>>>>> >>>>>>> Thanks, >>>>>>> Andrei >>>>>>> >>>>>>> [1]: https://github.com/apache/polaris/issues/3222 >>>>>>> [2]: >>>>>>> https://lists.apache.org/thread/vwrc3m534gfyfjnsfflwtgkg158yzrb4 >>>>>>> [3]: >>>>>>> https://lists.apache.org/thread/yflg8w1h87qgwc4s3qtog4l8nx8nk8m0 >>>>>>> [4]: >>>>>>> https://docs.google.com/document/d/1aj-6JlfBiMYEEVtNuh5WLMOrRQiMCcyYUGbouPM4hXI/edit?usp=sharing >>>>>>> >>>>>>
