Upserts in Iceberg

Ryan Blue Wed, 03 Jul 2019 10:42:23 -0700

I'm available Thursday and Friday this week as well, but it's a holiday in
the US so some people may be out. If there are no objections from anyone
that would like to attend, then I'm up for that.


On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <[email protected]>
wrote:

> I apologize for the delay on my side. I’ll still have to go through the
> last emails. I am available on Thursday/Friday this week and would be great
> to sync.
>
> Thanks,
> Anton
>
> On 3 Jul 2019, at 01:29, Ryan Blue <[email protected]> wrote:
>
> Sorry I didn't get back to this thread last week. Let's try to have a
> video call to sync up on this next week. What days would work for everyone?
>
> rb
>
> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <[email protected]>
> wrote:
>
>> With regards to operation values. Currently they are:
>>
>>    - append: data files were added and no files were removed.
>>    - replace: data files were rewritten with the same data; i.e.,
>>    compaction, changing the data file format, or relocating data files.
>>    - overwrite: data files were deleted and added in a logical overwrite
>>    operation.
>>    - delete: data files were removed and their contents logically
>>    deleted.
>>
>> If deletion files (with or without data files) are appended to the
>> dataset, will we consider that an `append` operation? If so, if deletion
>> and/or data files are appended, and whole files are also deleted, will we
>> consider that an `overwrite`?
>>
>> Given that the only apparent purpose of the operation field is to
>> optimize snapshot expiration the above seems to meet its needs. An
>> incremental reader can also skip `replace` snapshots but no others. Once it
>> decides to read a snapshot I don't think there's any difference in how it
>> processes the data for append/overwrite/delete cases.
>>
>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <[email protected]> wrote:
>>
>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>> since they apply to a specific file. They’re not harmful, but the don’t
>>> seem relevant.
>>>
>>> These delete files will probably contain a path and an offset and could
>>> contain deletes for multiple files. In that case, the sequence number can
>>> be used to eliminate delete files that don’t need to be applied to a
>>> particular data file, just like the column equality deletes. Likewise, it
>>> can be used to drop the delete files when there are no data files with an
>>> older sequence number.
>>>
>>> I don’t understand the purpose of the min sequence number, nor what the
>>> “min data seq” is.
>>>
>>> Min sequence number would be used for pruning delete files without
>>> reading all the manifests to find out if there are old data files. If no
>>> manifest with data for a partition contains a file older than some sequence
>>> number N, then any delete file with a sequence number < N can be removed.
>>>
>> OK, so the minimum sequence number is an attribute of manifest files.
>> Sounds good. It can likely permit us to optimize compaction operations as
>> well (i.e., you can easily limit the operation to a subset of manifest
>> files as long as they are the oldest ones).
>>
>>
>>> The “min data seq” is the minimum sequence number of a data file. That
>>> seems like what we actually want for the pruning I described above.
>>>
>> I would expect a data file (appended rows or deletions by column value)
>> to have a single sequence number that applies to the whole file. Even a
>> delete-by-file-and-offset file can do with only a single sequence number
>> (which must be larger than the sequence numbers of all deleted files). Why
>> do we need a "minimum" data sequence per file?
>>
>>> Off the top of my head [supporting non-key delete] requires adding
>>> additional information to the manifest file, indicating the columns that
>>> are used for the deletion. Only equality would be supported; if multiple
>>> columns were used, they would be combined with boolean-and. I don’t see
>>> anything too tricky about it.
>>>
>>> Yes, exactly. I actually phrased it wrong initially. I think it would be
>>> simple to extend the equality deletes to do this. We just need a way to
>>> have global scope, not just partition scope.
>>>
>> I don't think anything special needs to be done with regards to
>> scoping/partitioning of delete files. When scanning one or more data files,
>> one must also consider any and all deletion files that could apply to them.
>> The only way to prune deletion files from consideration is:
>>
>>    1. All of your data files have at least one partition column in
>>    common.
>>    2. The deletion file is also partitioned on that column (at least).
>>    3. The value sets of the data files do not overlap the value sets of
>>    the deletion files in that column.
>>
>>  So given a dataset of sessions that is partitioned by device form factor
>> and date, for example, you could have a delete (user_id=9876) in a deletion
>> file that is not partitioned. And it would be "in scope" for all of those
>> data files.
>>
>> If you had the same dataset partitioned by hash(user_id) and your deletes
>> were _also_ partitioned by hash(user_id) you would be able to prune those
>> deletes while scanning the sessions.
>>
>>> If we add this on a per-deletion file basis it is not clear if there is
>>> any relevance in preserving the concept of a unique row ID.
>>>
>>> Agreed. That’s why I’ve been steering us away from the debate about
>>> whether keys are unique or not. Either way, a natural key delete must
>>> delete all of the records it matches.
>>>
>>> I would assume that the maximum sequence number should appear in the
>>> table metadata
>>>
>>> Agreed.
>>>
>>> [W]ould you make it optional to assign a sequence number to a snapshot?
>>> “Replace” snapshots would not need one.
>>>
>>> The only requirement is that it is monotonically increasing. If one
>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>> implementation to decide. I would probably increment it every time to avoid
>>> errors.
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to