Hi Matt, Thank you for the feedback. I will go through the discussion and see where my idea fits..
Thanks, Chandra Sekhar On Wed, 24 Jun 2026 at 8:32 PM, Matt Butrovich <[email protected]> wrote: > Hi Chandra, > > There has been recent discussion (and community calls) on adding > constraint support (including PRIMARY KEY). Could you take a look at the > proposal and see where your ideas fit within and maybe conflict and/or > extend it? > https://docs.google.com/document/d/1re65fx3uqC7I_tJuS79IxLiB7HEN2Grt5qRIDjd3p-4/edit?tab=t.0#heading=h.o38ny2ndrd79 > > It would be great to bring your ideas to that venue. > > Thanks, > > Matt > > On Sat, Jun 20, 2026 at 12:34 AM chandra sekhar k < > [email protected]> wrote: > >> Hi Iceberg Community, >> >> We would like to start a discussion about introducing native primary-key >> table support in Apache Iceberg. >> >> Background >> ========== >> >> Apache Iceberg has become a widely adopted table format for large-scale >> analytic datasets and provides strong support for schema evolution, >> partition evolution, row-level operations, and incremental processing. >> >> At the same time, an increasing number of users are building CDC-driven >> and operational analytics workloads where data is naturally organized >> around primary keys and continuously updated through inserts, updates, and >> deletes. >> >> While Iceberg provides important building blocks such as identifier >> fields, equality deletes, position deletes, and MERGE operations, there is >> currently no standardized primary-key table abstraction within the Iceberg >> specification. >> >> Motivation >> ========== >> >> Many modern data lake workloads rely on: >> >> * Database CDC ingestion >> * Streaming upsert pipelines >> * Data synchronization between transactional systems and data lakes >> * Near real-time operational analytics >> * Incremental changelog consumption >> >> These workloads often require: >> >> * Primary-key based update semantics >> * Efficient handling of high-frequency updates and deletes >> * Storage layouts optimized for mutable data >> * Efficient compaction strategies >> * Standardized changelog generation and consumption >> >> Today, users typically implement these capabilities through >> engine-specific solutions or custom ingestion frameworks, which can lead to >> inconsistent behavior across engines and increased operational complexity. >> >> Existing Iceberg Capabilities and Gaps >> ====================================== >> >> Iceberg already provides several important capabilities for mutable >> datasets: >> >> * Identifier fields >> * Equality deletes >> * Position deletes >> * MERGE INTO support through compute engines >> * Incremental snapshot processing >> >> However, these features primarily serve as low-level primitives and do >> not provide a complete primary-key table model. >> >> For example: >> >> * Identifier fields define row identity but do not provide write >> semantics. >> * MERGE operations are engine-specific and may behave differently across >> engines. >> * Equality deletes can become expensive for heavy CDC workloads. >> * There is currently no standard mechanism for organizing data around >> primary keys or exposing changelog semantics. >> >> As a result, users building CDC and streaming upsert workloads often need >> significant custom infrastructure on top of Iceberg. >> >> Industry Context >> ================ >> >> Several lakehouse systems have introduced native support for >> primary-key-oriented workloads. >> >> For example, Apache Paimon provides primary-key tables with built-in >> support for upserts, changelog production, and storage layouts optimized >> for mutable data. These capabilities have proven useful for streaming and >> CDC scenarios. >> >> At the same time, many organizations have already standardized on Iceberg >> as their table format and would benefit from similar capabilities without >> requiring adoption of a separate table format. >> >> This raises the question of whether a standardized primary-key table >> abstraction should be part of Iceberg itself. >> >> Initial Proposal >> ================ >> >> We would like to discuss introducing a first-class primary-key table >> abstraction in Iceberg. >> >> Conceptually, users could define tables such as: >> >> CREATE TABLE orders ( >> order_id BIGINT PRIMARY KEY, >> customer_id BIGINT, >> amount DECIMAL(18,2), >> updated_at TIMESTAMP >> ); >> >> The intent is not to provide OLTP-style uniqueness enforcement or >> database constraints. >> >> Instead, the goal is to provide a standard storage and processing model >> for mutable datasets organized around primary keys. >> >> Potential capabilities could include: >> >> * Primary-key metadata stored as part of table metadata >> * Standardized primary-key write semantics >> * Primary-key aware compaction and maintenance >> * Efficient changelog generation for downstream consumers >> * Optimized storage organization for mutable workloads >> * Consistent behavior across engines >> >> The feature would be optional and would not affect existing Iceberg >> tables or workloads. >> >> Open Questions >> ============== >> >> We would appreciate feedback from the community on the following topics: >> >> 1. Is a native primary-key table abstraction within the scope and vision >> of Iceberg? >> >> 2. Are existing Iceberg features sufficient to address these use cases? >> >> 3. What are the advantages or disadvantages of introducing primary-key >> semantics at the table-format level? >> >> 4. Should Iceberg standardize changelog and mutable-data handling for CDC >> workloads? >> >> 5. What compatibility or interoperability concerns should be considered? >> >> 6. Would the community be interested in reviewing a detailed design >> proposal if there is agreement on the problem statement? >> >> At Huawei, we have been experimenting with primary-key table semantics in >> production environments for CDC-driven and mutable-data workloads. The >> experience has highlighted both the demand for these capabilities and the >> challenges of building them consistently on top of existing primitives. >> Based on these experiences, we would like to discuss whether a standardized >> approach belongs in Iceberg. >> >> If there is interest from the community, we would be happy to share a >> detailed design proposal covering metadata representation, write/read >> semantics, compaction strategies, changelog support, and engine >> integrations. >> >> Looking forward to hearing the community's thoughts. >> >> Thank you for your consideration, >> Chandra Sekhar >> >
