Hi Dennis,

thanks for your initiative!
I believe externally owned Tables / Federation would enable a variety of new 
use-cases in Data Mesh scenarios.

Personally, I currently favor pull based over push-based approaches (think 
/changes Endpoint) for the following reasons:

  *   Less ambiguity for failure / missing messages. What does the sender do if 
a POST fails? How often is it retried? What is the fallback behavior? If a 
message is missed, how would the reflecting catalog ever get back to the 
correct state?. In contrast, a pull-based approach is quite clear: The 
reflecting catalog is responsible to store a pointer and can handle retries 
internally.
  *   Changes are not only relevant for other catalogs, but for a variety of 
other systems that might want to act based on them. They might not have a REST 
API and certainly don’t want to implement the Iceberg REST protocol (i.e. 
/config).
  *   Pull-based approaches need less configuration – only the reflecting 
catalog needs to be configured. This follows the behavior we already implement 
in the other endpoints with other clients. I don’t think the “owning” catalog 
must know where it’s federated to – very much like it doesn’t need to know 
which query engines access it.
  *   The "Push" feature itself not part of spec, thus making it easier for 
Catalogs to just implement the receiving end without the actual "push" and 
still be 100% spec compliant - without being fully integrable with other 
catalogs. This is also a problem regarding my first point: push & receive 
behaviors and expectations must match between sender and receiver – and we 
don’t have a good place to document the “push” part.

I would design a /changes endpoint to only contain the information THAT 
something changed, not WHAT changed – to keep it lightweight.
For full change tracking I believe event queues / streaming solutions such as 
kafka, nats are better suited. Standardizing events could be a second step. In 
our catalog we are just using CloudEvents wrapped around `TableUpdates` 
enriched with a few extensions.

For both pull and push based approaches, your building block 1) is needed 
anyway – so that’s surely a common ground.

I would be interested to hear some more motivation from your side @Dennis to 
choose the pull-based approach – maybe I am looking at this too specific for my 
own use-case.

Thanks!
Christian




From: Dennis Huo <huoi...@gmail.com>
Date: Thursday, 19. September 2024 at 05:46
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Subject: [DISCUSS] Defining a concept of "externally owned" tables in the REST 
spec
Hi all,

I wanted to follow up on some discussions that came up in one of the Iceberg 
Catalog community syncs awhile back relating to the concept of tables that can 
be registered in an Iceberg REST Catalog but which have their "source of truth" 
in some external Catalog.

The original context was that Apache Polaris currently adds a Polaris-specific 
method "sendNotification" on top of the otherwise standard Iceberg REST API 
(https://github.com/apache/polaris/blob/0547e8b3a9e38fedc466348d05f3d448f4a03930/spec/rest-catalog-open-api.yaml#L977)
 but the goal is to come up with something that the broader community can align 
on to ensure standardization long term.

This relates closely to a couple other more ambitious areas of discussion that 
have also come up in community syncs:

  1.  Catalog Federation - defining the protocol(s) by which all our different 
Iceberg REST Catalog implementations can talk to each other cooperatively, 
where entity metadata might be read-through, pushed, or pulled in various ways
  2.  Generalized events and notifications - beyond serving the purpose of 
federation, folks have proposed a generalized model that could also be applied 
to things like workflow triggering
In the narrowest formulation there are two building blocks to consider:

  1.  Expressing the concept of an "externally owned table" in an Iceberg REST 
Catalog

     *   At the most basic level, this could just mean that the target REST 
Catalog should refuse to perform mutation dances on the table (i.e. reject 
updateTable/commitTransaction calls on such tables) because it knows there's an 
external "source of truth" and wants to avoid causing a split-brain problem

  1.  Endpoint for doing a "simple" register/update of a table by "forcing" the 
table metadata to the latest incarnation

     *   Instead of updates being something for this target REST Catalog to 
perform a transaction protocol for, the semantic is that the "source of truth" 
transaction is already committed in the external source, so this target 
catalog's job is simply to "trust" the latest metadata (modulo some watermark 
semantics to deal with transient errors and out-of-order deliveries)
Interestingly, it appears there was a github issue filed awhile back for some 
formulation of (2) that was closed silently: 
https://github.com/apache/iceberg/issues/7261

It seems like there's an opportunity to find a good balance between breadth of 
scope, generalizability and practicality in terms of what building blocks can 
be defined in the core spec and what broader/ambitious features can be built on 
top of it.

Would love to hear everyone's thoughts on this.

Cheers,
Dennis

Reply via email to