Upserts in Iceberg

Erik Wright Thu, 20 Jun 2019 12:27:29 -0700

On Thu, Jun 20, 2019 at 12:57 PM Ryan Blue <rb...@netflix.com> wrote:


> Sounds like we’re in agreement on the direction! Let’s have a sync up
> sometime next week to make sure we are agreed and plan some of this work.
> What days work best for everyone?
>
Tuesday until 2PM PT
Wednesday 11AM -> 2PM PT
Thursday 9AM -> 10:30AM / 12PM -> 2PM PT
Friday until 2PM PT


> Here’s a quick summary of the approach from my perspective:
>
>    - Will define two options for writing delete files that identify rows
>    by:
>       1. File and offset for MERGE INTO or similar operations, and
>       2. Column equality for encoding deletes without reading data
>    - Will add a delete file type to manifests
>    - Will add sequence numbers to existing metadata to track when to
>    apply delete files to data files (in time), used by both delete file 
> options
>
> I don't see that we need these for file/offset-deletes, since they apply
to a specific file. They're not harmful, but the don't seem relevant.

>
>    - Will add a tracking sequence number to manifest file metadata in the
>    manifest list, as well as a min sequence number for each manifest (min data
>    seq as well?)
>
> I don't understand the purpose of the min sequence number, nor what the
"min data seq" is.

>
>    - For column equality deletes, the partition of the file in the
>    manifest will track the partition that it applies to (stats filtering to be
>    done at read time)
>
> I’d also like to have a way to encode deletes across the entire dataset,
> for GDPR deletes (e.g. delete a user from all sessions partitioned by
> time). Any ideas about how to encode this?
>
Off the top of my head this requires adding additional information to the
manifest file, indicating the columns that are used for the deletion. Only
equality would be supported; if multiple columns were used, they would be
combined with boolean-and. I don't see anything too tricky about it.

More replies to specific points:
>
> If there is a monotonically increasing value per snapshot, should we
> simply update the requirement for the snapshot ID to require it to be
> monotonically increasing?
>
> I was thinking about this and I don’t think it’s a good idea, but I could
> be convinced otherwise.
>
> When compacting data and delete files, the sequence number of the
> compacted result is set to the sequence of the latest delete file applied
> to the data, but the compacted file is actually added in the replace
> snapshot. So we wouldn’t be able to identify the files that were added by a
> compaction without running a diff with another snapshot. I think that we
> don’t want to lose that information because we want to be able to recover
> what operation created any given file, if the snapshot is still around with
> metadata.
>
You've convinced me, below, that we need to be able to commit new files in
a new snapshot with a past sequence number.

So far, we will have the following information per file:

   - The snapshot that added it.
   - The snapshot that deleted it.
   - Its sequence number (which is, indirectly, a snapshot ID, but not
   necessarily the one in which the file was added)

Regardless of whether the sequence number and the snapshot are the same or
simply have a 1:1 mapping, these three properties need to be tracked. It's
true that using different names (and different value spaces) for them
reduces some potential confusion.

I would assume that the maximum sequence number should appear in the table
metadata, as the alternative requires scanning manifests. Furthermore, I
could imagine some cases where the number goes backwards according to the
manifests (because the deletes from 13 are applied, but not all of the
deletes from 12, so the compacted files are given sequence number 12 and
there are no remaining files from number 13).

Given that, would you make it optional to assign a sequence number to a
snapshot? "Replace" snapshots would not need one.

> The sequence number assigned to the resulting files is irrelevant if *all*
> relevant deletes have been applied.
>
> I agree, but we can also use the sequence number to avoid commit
> conflicts. If a delete is committed just before a compaction, we don’t want
> to re-run the compaction. Using the highest sequence number from the delete
> files applied to the data, we can confidently commit the compaction,
> knowing that any new deltas will be applied.
>
> Ultimately, all files would appear to be eligible for compaction in any
> rewrite snapshot.
>
> Yes, which I think is a benefit, right?
>
Absolutely.

> I don’t think that the other approach inhibits aging off data, it just
> represents the deletion of the data differently.
>
> True, but if we can preserve the existing functionality then this is a
> quick operation. Usually, these file-level deletes align with partitioning
> so they are quick to run. And, we can prune any delete files that no longer
> have data files after it happens.
>
> On Thu, Jun 20, 2019 at 8:06 AM Erik Wright <erik.wri...@shopify.com>
> wrote:
>
>>
>>
>> On Wed, Jun 19, 2019 at 6:39 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Replies inline.
>>>
>>> On Mon, Jun 17, 2019 at 10:59 AM Erik Wright
>>> erik.wri...@shopify.com.invalid
>>> <http://mailto:erik.wri...@shopify.com.invalid> wrote:
>>>
>>> Because snapshots and versions are basically the same idea, we don’t
>>>>> need both. If we were to add versions, they should replace snapshots.
>>>>>
>>>>
>>>> I'm a little confused by this. You mention sequence numbers quite a
>>>> bit, but then say that we should not introduce versions. From my
>>>> understanding, sequence numbers and versions are essentially identical.
>>>> What has changed is where they are encoded (in file metadata vs. in
>>>> snapshot metadata). It would also seem necessary to keep them distinct from
>>>> snapshots in order to be able to collapse/compact files older than N
>>>> (oldest valid version) without losing the ability to incrementally apply
>>>> the changes from N + 1.
>>>>
>>> Snapshots are versions of the table, but Iceberg does not use a
>>> monotonically increasing identifier to track them. Your proposal added
>>> table versions in addition to snapshots to get a monotonically increasing
>>> ID, but my point is that we don’t need to duplicate the “table state at
>>> some time” idea by adding a layer of versions inside snapshots. We just
>>> need a monotonically increasing ID associated with each snapshot. That ID
>>> what I’m calling a sequence number. Each snapshot will have a new sequence
>>> number used to track the order of files for applying changes that are
>>> introduced by that snapshot.
>>>
>> If there is a monotonically increasing value per snapshot, should we
>> simply update the requirement for the snapshot ID to require it to be
>> monotonically increasing?
>>
>>
>>> With sequence numbers embedded in metadata, you can still compact. If
>>> you have a file with sequence number N, then updated by sequence numbers
>>> N+2 and N+3, you can compact the changes from N and N+2 by writing the
>>> compacted file with sequence N+2. Rewriting files in N+2 is still possible,
>>> just like with the scheme you suggested, where the version N+2 would be
>>> rewritten.
>>>
>>> A key difference is that because the approach using sequence numbers
>>> doesn’t require keeping state around in the root metadata file, tables are
>>> not required to rewrite or compact data files just to keep the table
>>> functioning. If the data won’t be read, then why bother changing it?
>>>
>>> Because I am having a really hard time seeing how things would work
>>>> without these sequence numbers (versions), and given their prominence in
>>>> your reply, my remaining comments assume they are present, distinct from
>>>> snapshots, and monotonically increasing.
>>>>
>>> Yes, a sequence number would be added to the existing metadata. Each
>>> snapshot would produce a new sequence number.
>>>
>>> Another way of thinking about it: if versions are stored as data/delete
>>>>> file metadata and not in the table metadata file, then the complete 
>>>>> history
>>>>> can be kept.
>>>>
>>>>
>>>> Sure. Although version expiry would still be an important process to
>>>> improve read performance and reduce storage overhead.
>>>>
>>> Maybe. If it is unlikely that the data will be read, the best option may
>>> be to leave the original data file and associated delete file.
>>>
>> It's safe to say that is optional for any given use-case, but that it is
>> a required capability.
>>
>>> Based on your proposal, it would seem that we can basically collapse any
>>>> given data file with its corresponding deletions. It's important to
>>>> consider how this will affect incremental readers. The following make sense
>>>> to me:
>>>>
>>>>    1. We should track an oldest-valid-version in the table metadata.
>>>>    No changes newer than this version should be collapsed, as a reader who 
>>>> has
>>>>    consumed up to that version must be able to continue reading.
>>>>
>>>> Table versions are tracked by snapshots, which are kept for a window of
>>> time. The oldest valid version is the oldest snapshot in the table. New
>>> snapshots can always compact or rewrite older data and incremental
>>> consumers ignore those changes by skipping snapshots where the operation
>>> was replace because the data has not changed.
>>>
>> Incremental consumers operate on append, delete, and overwrite snapshots
>>> and consume data files that are added or deleted. With delete files, these
>>> would also consume delete files that were added in a snapshot.
>>>
>> Sure, when a whole file is deleted there is already a column for
>> indicating that, so we do not need to represent it a second time.
>>
>>
>>> It is unlikely that sequence number would be used by incremental
>>> consumers because the changes are already available for each snapshot. This
>>> distinction is the main reason why I’m calling these sequence numbers and
>>> not “version”. The sequence number for a file indicates what delete files
>>> need to be applied; files can be rewritten the new copy gets a different
>>> sequence number as you describe just below.
>>>
>> Interesting. This explains the difference in the way we are looking at
>> this. Yes, I agree that this ought to work and seems more consistent with
>> the way that Iceberg works currently.
>>
>>>
>>>>    1. Any data file strictly older than the oldest-valid-version is
>>>>    eligible for delete collapsing. During delete collapsing, all deletions 
>>>> up
>>>>    to version N (any version <= the oldest-valid-version) are applied to 
>>>> the
>>>>    file. *The newly created file should be assigned a sequence number
>>>>    of N.*
>>>>
>>>> First, a data file is kept until the last snapshot that references it
>>> is expired. That enable incremental consumers without a restriction on
>>> rewriting the file in a later replace snapshot.
>>>
>>> Second, I agree with your description of rewriting the file. That’s what
>>> I was trying to say above.
>>>
>> Given that incremental consumers will skip "replace" snapshots and will
>> not attempt to incrementally read the changes introduced in version "N"
>> using the manifest list file for version "N+1", there would appear to be no
>> restrictions on how much data can be collapsed in any rewrite operation.
>> The sequence number assigned to the resulting files is irrelevant if _all_
>> relevant deletes have been applied.
>>
>>>
>>>>    1. Any delete file that has been completely applied is eligible for
>>>>    expiry. A delete file is completely applied if its sequence number is <=
>>>>    all data files to which it may apply (based on partitioning data).
>>>>
>>>> Yes, I agree.
>>>
>>>
>>>>    1. Any data file equal to or older than the oldest-valid-version is
>>>>    eligible for compaction. During compaction, some or all data files up to
>>>>    version N (any version <= the oldest-valid version) are selected. There 
>>>> may
>>>>    not be any deletions applicable to any of the selected data files. The
>>>>    selected data files are read, rewritten as desired (partitioning,
>>>>    compaction, sorting, etc.). The new files are included in the new 
>>>> snapshot (*with
>>>>    sequence number N*) while the old files are dropped.
>>>>
>>>> Yes, I agree.
>>>
>> Ultimately, all files would appear to be eligible for compaction in any
>> rewrite snapshot.
>>
>>>
>>>>    1. Delete collapsing, delete expiry, and compaction may be
>>>>    performed in a single commit. The oldest-valid-version may be advanced
>>>>    during this process as well. The outcome must logically be consistent 
>>>> with
>>>>    applying these steps independently.
>>>>
>>>> Because snapshots (table versions) are independent, they can always be
>>> expired. Incremental consumers must process a new snapshot before it is old
>>> enough to expire. The time period after which snapshots are expired is up
>>> to the process running ExpireSnapshots.
>>>
>> Agreed.
>>
>>> Important to note in the above is that, during collapsing/compaction,
>>>> new files may be emitted with a version number N that are older than the
>>>> most recent version. This is important to ensure that all deltas newer than
>>>> N are appropriately applied, and that incremental readers are able to
>>>> continue processing the dataset.
>>>>
>>> Because incremental consumers operate using snapshots and not sequence
>>> numbers, I think this is decoupled in the approach I’m proposing.
>>>
>> Agreed.
>>
>>> It is also compatible with file-level deletes used to age off old data
>>>>> (e.g. delete where day(ts) < '2019-01-01').
>>>>
>>>>
>>>> This particular operation is not compatible with incremental
>>>> consumption.
>>>>
>>> I disagree. This just encodes deletes for an entire file worth of rows
>>> in the manifest file instead of in a delete file. Current metadata tracks
>>> when files are deleted and we would also associate a sequence number with
>>> those changes, if we needed to. Incremental consumers will work like they
>>> do today: they will consume snapshots that contain changes
>>> (append/delete/overwrite) and will get the files that were changed in that
>>> snapshot.
>>>
>> Agreed.
>>
>>
>>> Also, being able to cleanly age off data is a hard requirement for
>>> Iceberg. We all need to comply with data policies with age limits and we
>>> need to ensure that we can cleanly apply those changes.
>>>
>> I don't think that the other approach inhibits aging off data, it just
>> represents the deletion of the data differently. In any case, I agree that
>> we can reuse the existing "deleted in snapshot" mechanism.
>>
>>> One must still generate a deletion that is associated with a new
>>>> sequence number. I could see a reasonable workaround where, in order to
>>>> delete an entire file (insertion file `x.dat`, sequence number X) one would
>>>> reference the same file a second time (deletion file `x.dat, sequence
>>>> number Y > X). Since an insertion file and a deletion file have compatible
>>>> formats, there is no need to rewrite this file simply to mark each row in
>>>> it as deleted. A savvy consumer would be able to tell that this is a
>>>> whole-file deletion while a naïve consumer would be able to apply them
>>>> using the basic algorithm.
>>>>
>>> With file-level deletes, the files are no longer processed by readers,
>>> unless an incremental reader needs to read all of the deletes from the file.
>>>
>>> If it seems like we are roughly on the same page I will take a stab at
>>>> updating that document to go to the same level of detail that it does now
>>>> but use the sequence-number approach.
>>>>
>>> Sure, that sounds good.
>>>
>> Phew. I think we are nearly there!
>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to