Upserts in Iceberg

Ryan Blue Wed, 03 Jul 2019 11:01:54 -0700

How about 9AM PDT on Friday, 5 July then?

On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <[email protected]>
wrote:


> I'd like to call in, but I'm out Thursday. Friday would work except 11am
> to 1pm pdt.
>
> .. Owen
>
> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <[email protected]>
> wrote:
>
>> I'm available Thursday and Friday this week as well, but it's a holiday
>> in the US so some people may be out. If there are no objections from anyone
>> that would like to attend, then I'm up for that.
>>
>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> I apologize for the delay on my side. I’ll still have to go through the
>>> last emails. I am available on Thursday/Friday this week and would be great
>>> to sync.
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 3 Jul 2019, at 01:29, Ryan Blue <[email protected]> wrote:
>>>
>>> Sorry I didn't get back to this thread last week. Let's try to have a
>>> video call to sync up on this next week. What days would work for everyone?
>>>
>>> rb
>>>
>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <[email protected]>
>>> wrote:
>>>
>>>> With regards to operation values. Currently they are:
>>>>
>>>>    - append: data files were added and no files were removed.
>>>>    - replace: data files were rewritten with the same data; i.e.,
>>>>    compaction, changing the data file format, or relocating data files.
>>>>    - overwrite: data files were deleted and added in a logical
>>>>    overwrite operation.
>>>>    - delete: data files were removed and their contents logically
>>>>    deleted.
>>>>
>>>> If deletion files (with or without data files) are appended to the
>>>> dataset, will we consider that an `append` operation? If so, if deletion
>>>> and/or data files are appended, and whole files are also deleted, will we
>>>> consider that an `overwrite`?
>>>>
>>>> Given that the only apparent purpose of the operation field is to
>>>> optimize snapshot expiration the above seems to meet its needs. An
>>>> incremental reader can also skip `replace` snapshots but no others. Once it
>>>> decides to read a snapshot I don't think there's any difference in how it
>>>> processes the data for append/overwrite/delete cases.
>>>>
>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>>>> since they apply to a specific file. They’re not harmful, but the don’t
>>>>> seem relevant.
>>>>>
>>>>> These delete files will probably contain a path and an offset and
>>>>> could contain deletes for multiple files. In that case, the sequence 
>>>>> number
>>>>> can be used to eliminate delete files that don’t need to be applied to a
>>>>> particular data file, just like the column equality deletes. Likewise, it
>>>>> can be used to drop the delete files when there are no data files with an
>>>>> older sequence number.
>>>>>
>>>>> I don’t understand the purpose of the min sequence number, nor what
>>>>> the “min data seq” is.
>>>>>
>>>>> Min sequence number would be used for pruning delete files without
>>>>> reading all the manifests to find out if there are old data files. If no
>>>>> manifest with data for a partition contains a file older than some 
>>>>> sequence
>>>>> number N, then any delete file with a sequence number < N can be removed.
>>>>>
>>>> OK, so the minimum sequence number is an attribute of manifest files.
>>>> Sounds good. It can likely permit us to optimize compaction operations as
>>>> well (i.e., you can easily limit the operation to a subset of manifest
>>>> files as long as they are the oldest ones).
>>>>
>>>>
>>>>> The “min data seq” is the minimum sequence number of a data file. That
>>>>> seems like what we actually want for the pruning I described above.
>>>>>
>>>> I would expect a data file (appended rows or deletions by column value)
>>>> to have a single sequence number that applies to the whole file. Even a
>>>> delete-by-file-and-offset file can do with only a single sequence number
>>>> (which must be larger than the sequence numbers of all deleted files). Why
>>>> do we need a "minimum" data sequence per file?
>>>>
>>>>> Off the top of my head [supporting non-key delete] requires adding
>>>>> additional information to the manifest file, indicating the columns that
>>>>> are used for the deletion. Only equality would be supported; if multiple
>>>>> columns were used, they would be combined with boolean-and. I don’t see
>>>>> anything too tricky about it.
>>>>>
>>>>> Yes, exactly. I actually phrased it wrong initially. I think it would
>>>>> be simple to extend the equality deletes to do this. We just need a way to
>>>>> have global scope, not just partition scope.
>>>>>
>>>> I don't think anything special needs to be done with regards to
>>>> scoping/partitioning of delete files. When scanning one or more data files,
>>>> one must also consider any and all deletion files that could apply to them.
>>>> The only way to prune deletion files from consideration is:
>>>>
>>>>    1. All of your data files have at least one partition column in
>>>>    common.
>>>>    2. The deletion file is also partitioned on that column (at least).
>>>>    3. The value sets of the data files do not overlap the value sets
>>>>    of the deletion files in that column.
>>>>
>>>>  So given a dataset of sessions that is partitioned by device form
>>>> factor and date, for example, you could have a delete (user_id=9876) in a
>>>> deletion file that is not partitioned. And it would be "in scope" for all
>>>> of those data files.
>>>>
>>>> If you had the same dataset partitioned by hash(user_id) and your
>>>> deletes were _also_ partitioned by hash(user_id) you would be able to prune
>>>> those deletes while scanning the sessions.
>>>>
>>>>> If we add this on a per-deletion file basis it is not clear if there
>>>>> is any relevance in preserving the concept of a unique row ID.
>>>>>
>>>>> Agreed. That’s why I’ve been steering us away from the debate about
>>>>> whether keys are unique or not. Either way, a natural key delete must
>>>>> delete all of the records it matches.
>>>>>
>>>>> I would assume that the maximum sequence number should appear in the
>>>>> table metadata
>>>>>
>>>>> Agreed.
>>>>>
>>>>> [W]ould you make it optional to assign a sequence number to a
>>>>> snapshot? “Replace” snapshots would not need one.
>>>>>
>>>>> The only requirement is that it is monotonically increasing. If one
>>>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>>>> implementation to decide. I would probably increment it every time to 
>>>>> avoid
>>>>> errors.
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to