Hi all,

I wanted to follow up on some discussions that came up in one of the
Iceberg Catalog community syncs awhile back relating to the concept of
tables that can be registered in an Iceberg REST Catalog but which have
their "source of truth" in some external Catalog.

The original context was that Apache Polaris currently adds a
Polaris-specific method "sendNotification" on top of the otherwise standard
Iceberg REST API (
https://github.com/apache/polaris/blob/0547e8b3a9e38fedc466348d05f3d448f4a03930/spec/rest-catalog-open-api.yaml#L977)
but the goal is to come up with something that the broader community can
align on to ensure standardization long term.

This relates closely to a couple other more ambitious areas of discussion
that have also come up in community syncs:

   1. Catalog Federation - defining the protocol(s) by which all our
   different Iceberg REST Catalog implementations can talk to each other
   cooperatively, where entity metadata might be read-through, pushed, or
   pulled in various ways
   2. Generalized events and notifications - beyond serving the purpose of
   federation, folks have proposed a generalized model that could also be
   applied to things like workflow triggering

In the narrowest formulation there are two building blocks to consider:

   1. Expressing the concept of an "externally owned table" in an Iceberg
   REST Catalog
      1. At the most basic level, this could just mean that the target REST
      Catalog should refuse to perform mutation dances on the table
(i.e. reject
      updateTable/commitTransaction calls on such tables) because it knows
      there's an external "source of truth" and wants to avoid causing a
      split-brain problem
   2. Endpoint for doing a "simple" register/update of a table by "forcing"
   the table metadata to the latest incarnation
      1. Instead of updates being something for this target REST Catalog to
      perform a transaction protocol for, the semantic is that the "source of
      truth" transaction is already committed in the external source, so this
      target catalog's job is simply to "trust" the latest metadata
(modulo some
      watermark semantics to deal with transient errors and out-of-order
      deliveries)

Interestingly, it appears there was a github issue filed awhile back for
some formulation of (2) that was closed silently:
https://github.com/apache/iceberg/issues/7261

It seems like there's an opportunity to find a good balance between breadth
of scope, generalizability and practicality in terms of what building
blocks can be defined in the core spec and what broader/ambitious features
can be built on top of it.

Would love to hear everyone's thoughts on this.

Cheers,
Dennis

Reply via email to