Upserts in Iceberg

Ryan Blue Tue, 02 Jul 2019 17:30:50 -0700

Sorry I didn't get back to this thread last week. Let's try to have a video
call to sync up on this next week. What days would work for everyone?


rb

On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wri...@shopify.com> wrote:

> With regards to operation values. Currently they are:
>
>    - append: data files were added and no files were removed.
>    - replace: data files were rewritten with the same data; i.e.,
>    compaction, changing the data file format, or relocating data files.
>    - overwrite: data files were deleted and added in a logical overwrite
>    operation.
>    - delete: data files were removed and their contents logically deleted.
>
> If deletion files (with or without data files) are appended to the
> dataset, will we consider that an `append` operation? If so, if deletion
> and/or data files are appended, and whole files are also deleted, will we
> consider that an `overwrite`?
>
> Given that the only apparent purpose of the operation field is to optimize
> snapshot expiration the above seems to meet its needs. An incremental
> reader can also skip `replace` snapshots but no others. Once it decides to
> read a snapshot I don't think there's any difference in how it processes
> the data for append/overwrite/delete cases.
>
> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>> since they apply to a specific file. They’re not harmful, but the don’t
>> seem relevant.
>>
>> These delete files will probably contain a path and an offset and could
>> contain deletes for multiple files. In that case, the sequence number can
>> be used to eliminate delete files that don’t need to be applied to a
>> particular data file, just like the column equality deletes. Likewise, it
>> can be used to drop the delete files when there are no data files with an
>> older sequence number.
>>
>> I don’t understand the purpose of the min sequence number, nor what the
>> “min data seq” is.
>>
>> Min sequence number would be used for pruning delete files without
>> reading all the manifests to find out if there are old data files. If no
>> manifest with data for a partition contains a file older than some sequence
>> number N, then any delete file with a sequence number < N can be removed.
>>
> OK, so the minimum sequence number is an attribute of manifest files.
> Sounds good. It can likely permit us to optimize compaction operations as
> well (i.e., you can easily limit the operation to a subset of manifest
> files as long as they are the oldest ones).
>
>
>> The “min data seq” is the minimum sequence number of a data file. That
>> seems like what we actually want for the pruning I described above.
>>
> I would expect a data file (appended rows or deletions by column value) to
> have a single sequence number that applies to the whole file. Even a
> delete-by-file-and-offset file can do with only a single sequence number
> (which must be larger than the sequence numbers of all deleted files). Why
> do we need a "minimum" data sequence per file?
>
>> Off the top of my head [supporting non-key delete] requires adding
>> additional information to the manifest file, indicating the columns that
>> are used for the deletion. Only equality would be supported; if multiple
>> columns were used, they would be combined with boolean-and. I don’t see
>> anything too tricky about it.
>>
>> Yes, exactly. I actually phrased it wrong initially. I think it would be
>> simple to extend the equality deletes to do this. We just need a way to
>> have global scope, not just partition scope.
>>
> I don't think anything special needs to be done with regards to
> scoping/partitioning of delete files. When scanning one or more data files,
> one must also consider any and all deletion files that could apply to them.
> The only way to prune deletion files from consideration is:
>
>    1. All of your data files have at least one partition column in common.
>    2. The deletion file is also partitioned on that column (at least).
>    3. The value sets of the data files do not overlap the value sets of
>    the deletion files in that column.
>
>  So given a dataset of sessions that is partitioned by device form factor
> and date, for example, you could have a delete (user_id=9876) in a deletion
> file that is not partitioned. And it would be "in scope" for all of those
> data files.
>
> If you had the same dataset partitioned by hash(user_id) and your deletes
> were _also_ partitioned by hash(user_id) you would be able to prune those
> deletes while scanning the sessions.
>
>> If we add this on a per-deletion file basis it is not clear if there is
>> any relevance in preserving the concept of a unique row ID.
>>
>> Agreed. That’s why I’ve been steering us away from the debate about
>> whether keys are unique or not. Either way, a natural key delete must
>> delete all of the records it matches.
>>
>> I would assume that the maximum sequence number should appear in the
>> table metadata
>>
>> Agreed.
>>
>> [W]ould you make it optional to assign a sequence number to a snapshot?
>> “Replace” snapshots would not need one.
>>
>> The only requirement is that it is monotonically increasing. If one isn’t
>> used, we don’t have to increment. I’d say it is up to the implementation to
>> decide. I would probably increment it every time to avoid errors.
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to