Upserts in Iceberg

Jacques Nadeau Tue, 21 May 2019 13:39:06 -0700

It would be useful to describe the types of concurrent operations that
> would be supported (i.e., failed snapshotting could easily be recovered,
> vs. the whole operation needing to be re-executed) vs. those that wouldn't.
> Solving for unlimited concurrency cases may create way more complexity than
> is necessary.
>


I'd like to restate my comment a little bit. We need unique keys to make
things work. They can be synthetic or not but they should not have any
retrievable iceberg related data in them.

The main thing I'm talking about is how you target a deletion across time.
If you have a file A, and you want to delete record X in A, you define
delete A.X. At the same time, another process may be compacting A into A'.
In so doing, the position of A.X in A' is something other than X. At this
point, the deletion needs to be rerun against A' so that we can ensure that
the deletion is propagated forward. If the only thing you have is A.X, you
need to have way from of getting to the same location in A'. You should be
able to take the delta file that lists the delete of A.2 and apply it
directly to A' without having to also consult A. If you didn't need to
solve this number, then you could simply use A.X as opposed to the key of
A.X in your delta files.

Synthetic seems relative. If the synthetic key is client-supplied, in what
> way is it relevant to Iceberg whether it is synthetic vs. natural? By
> calling it synthetic within Iceberg there is a strong implication that it
> is the implementation that generates it (the filename/position key suggests
> that). If it's the client that supplies it, it _may_ be synthetic (from the
> point of view of the overall data model; i.e. a customer key in a database
> vs. a customer ID that shows up on a bill) but from Iceberg's case that
> doesn't matter. Only the unicity constraint does.
>

I agree with the main statement here: the only real requirement is keys
need to be unique across all existing snapshots. There could be two
generators: one that uses an iceberg internal behavior to generate keys and
one that is user definable. While there could be a third which uses an
existing field (or set of fields) to define the key I think we probably
should avoid implementing this as it has a whole other sets of problems
that are best left outside of Iceberg's area of concern.

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to