Upserts in Iceberg

Owen O'Malley Wed, 12 Jun 2019 15:18:04 -0700


> On May 21, 2019, at 1:31 PM, Jacques Nadeau <jacq...@dremio.com> wrote:
> 
> The main thing I'm talking about is how you target a deletion across time. If 
> you have a file A, and you want to delete record X in A, you define delete 
> A.X. At the same time, another process may be compacting A into A'. In so 
> doing, the position of A.X in A' is something other than X.


I would argue that this is backwards. This argues that compactions need a lock 
so that the delete either happens before or after the compaction. If it happens 
before, the delete is incorporated into the new version of the file. If it 
happens afterwards, the delete is using the new version of the file.

> At this point, the deletion needs to be rerun against A' so that we can 
> ensure that the deletion is propagated forward. If the only thing you have is 
> A.X, you need to have way from of getting to the same location in A'. You 
> should be able to take the delta file that lists the delete of A.2 and apply 
> it directly to A' without having to also consult A. If you didn't need to 
> solve this number, then you could simply use A.X as opposed to the key of A.X 
> in your delta files.

I’d much prefer using file/row# as the reference for the synthetic key for the 
deletes. Thus, it would be file 1: rows 100, 200, and 300. That makes it clear 
that the delta can only be applied to a given version of the file. This has the 
added advantage that you know how many rows are left in file.

.. Owen

> 
> Synthetic seems relative. If the synthetic key is client-supplied, in what 
> way is it relevant to Iceberg whether it is synthetic vs. natural? By calling 
> it synthetic within Iceberg there is a strong implication that it is the 
> implementation that generates it (the filename/position key suggests that). 
> If it's the client that supplies it, it _may_ be synthetic (from the point of 
> view of the overall data model; i.e. a customer key in a database vs. a 
> customer ID that shows up on a bill) but from Iceberg's case that doesn't 
> matter. Only the unicity constraint does.
> 
> I agree with the main statement here: the only real requirement is keys need 
> to be unique across all existing snapshots. There could be two generators: 
> one that uses an iceberg internal behavior to generate keys and one that is 
> user definable. While there could be a third which uses an existing field (or 
> set of fields) to define the key I think we probably should avoid 
> implementing this as it has a whole other sets of problems that are best left 
> outside of Iceberg's area of concern.
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to