Upserts in Iceberg

Jacques Nadeau Tue, 21 May 2019 11:02:17 -0700

I think we just need to have further discussion about keys. Ryan said:

3. Synthetic keys should be based on filename and position



But I'm not clear there is consensus around that. I'm also not sure whether
he means lossless inclusion, simply derived-from or something else. My
thinking before is you must have synthetic keys in many cases since solving
concurrency becomes challenging otherwise.

Erik said:

> C might not be partitioned or sorted by the natural key, either, which
> means that finding the synthetic key can be expensive.


But I'm not sure why this needs to be true. If we give control over the
creation of synthetic key, wouldn't that resolve this issue?

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, May 21, 2019 at 7:54 AM Erik Wright <erik.wri...@shopify.com.invalid>
wrote:

> On Thu, May 16, 2019 at 4:13 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Replies inline.
>>
>> On Thu, May 16, 2019 at 10:07 AM Erik Wright <erik.wri...@shopify.com>
>> wrote:
>>
>>> I would be happy to participate. Iceberg with merge-on-read capabilities
>>> is a technology choice that my team is actively considering. It appears
>>> that our scenario differs meaningfully from the one that Anton and Miguel
>>> are considering. It would be great to take the time to compare the two and
>>> see if there is a single implementation that can meet the needs of each
>>> scenario.
>>>
>>
>> Can you be more specific about where the use cases differ meaningfully? I
>> think that if we agree that operations on natural keys can be implemented
>> using synthetic keys to encode deletes (#2), then everyone is aligned on
>> the core parts of a design. We can figure out the implications of how
>> synthetic keys are encoded, but I don't see that issue (#3) having a huge
>> impact on use cases. So is #2 the main disagreement?
>>
>
> We are mainly interested in upserts and deletes by natural key. We are not
> interested in more powerful queries of the types mentioned in the doc.
>
> On the other hand, in our cases, the upserts and deletes are generated
> upstream. In other words, I have an incremental transformation job J1 with
> inputs A and B, producing an output C. I can generate the upserts and
> deletes directly from my inputs, without referring to the current state of
> C.
>
> C might not be partitioned or sorted by the natural key, either, which
> means that finding the synthetic key can be expensive.
>
> In our bespoke model, we are able to generate a new delta (and append it
> to the dataset) without referring to the existing base or existing deltas.
>
> Another apparent divergence is that my jobs are incremental and want to be
> able to track the state they have seen so far in order to consume only new
> deltas in their next execution. So if job J2 consumes C, it needs to know
> that the deltas since its last read will not have been compacted into the
> base. This implies support for multiple generations of deltas, as different
> consumers could be at different points in the stream of deltas.
>
>
>> On Wed, May 15, 2019 at 3:55 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> *2. Iceberg diff files should use synthetic keys*
>>>>
>>>> A lot of the discussion on the doc is about whether natural keys are
>>>> practical or what assumptions we can make or trade about them. In my
>>>> opinion, Iceberg tables will absolutely need natural keys for reasonable
>>>> use cases. And those natural keys will need to be unique. And Iceberg will
>>>> need to rely on engines to enforce that uniqueness.
>>>>
>>>> But, there is a difference between table behavior and implementation.
>>>> We can use synthetic keys to implement the requirements of natural keys.
>>>> Each row should be identified by its file and position in a file. When
>>>> deleting by a natural key, we just need to find out what the synthetic key
>>>> is and encode that in the delete diff.
>>>>
>>> This comment has important implications for the effort required to
>>> generate delete diff files. I've tried to cover why in comments I added
>>> today to the doc, but it could also be a topic of the hangout.
>>>
>>
>> Do you mean that you can't encode a delete without reading data to locate
>> the affected rows?
>>
>
> Yes.
>
>
>>
>> *3. Synthetic keys should be based on filename and position*
>>>>
>>>> I think identifying the file in a synthetic key makes a lot of sense.
>>>> This would allow for delta file reuse as individual files are rewritten by
>>>> a “major” compaction and provides nice flexibility that fits with the
>>>> format. We will need to think through all the impacts, like how file
>>>> relocation works (e.g., move between regions) and the requirements for
>>>> rewrites (must apply the delta when rewriting).
>>>>
>>> I'm confused. I feel like specifying the filename has the opposite
>>> effect. One of the biggest advantages of Iceberg is the decoupling of a
>>> dataset from physical location of the constituent files. If a delta file
>>> encodes the filename of the row that it updates/deletes you are putting a
>>> significant constraint on the way that an implementation can manipulate
>>> those files later.
>>>
>>
>> If I understand your concern, it is that we are encoding a file location
>> in the delete diff. We could solve this with a level of indirection like an
>> ID for data files in table metadata. So there are options to make sure we
>> can still move files and data around.
>>
>
>> What I like about using a filename or file-specific identifier is that
>> the deltas are tied to a particular file. When that file is deleted, the
>> delta no longer needs to be carried forward. So if I have a maintenance job
>> that is compacting small files, it must apply deletes when rewriting all of
>> those files. But we don't have to rewrite or replace the delete diff file
>> because its deletes were scoped to a file that no longer exists.
>>
>
> My intuition is that you overestimate the complexity of maintaining a
> sequence of deltas and underestimate the constraints that would be imposed
> by the "level of indirection like an ID for data files in table metadata".
> Prior to this proposal, there were many legitimate and useful things one
> could do to the files making up a dataset between two snapshots. For
> example, one could split a file into multiple files using a new partition
> scheme. How does the level of indirection work when what was previously one
> file becomes five?
>
> This is a good way around an ugly problem of knowing when to apply a
>> particular delete diff. Say we are using a UUID column for a natural key.
>> Then deletes just need to encode a set of UUIDs. But when a row is
>> upserted, it gets deleted and then re-appended with the same (natural) UUID
>> key. So we have to scope the delete to just the file that contained the
>> original row and not the inserted data file. It is tempting to use snapshot
>> ordering for this (delete diff applies to versions < V) but we eventually
>> lose the order of snapshots when they expire. We could also use insert diff
>> files, but then replacing the same row more than once has the same problem
>> (how does a delete apply to an insert diff?). Then to correctly encode
>> state, we would have to either solve the order problem or require a minor
>> compaction.
>>
>
> > lose the order of snapshots when they expire
>
> Generations of data can be considered distinct from snapshots. Within a
> given snapshot, the files would be grouped by generation. The rules for
> expiring snapshots would be orthogonal to the rules for compacting
> generations. And the rules for GC'ing individual files would be unaffected.
>
> Scoping the delete diffs using a file solves this problem tying the
>> lifecycle of deltas to the lifecycle of files. If a file is live in the
>> table, then all deletes for that file must be applied when reading or
>> compacting it. When compacting, you can either write a delete diff for the
>> new file or remove the deleted records. That requirement seems pretty
>> reasonable to me.
>>
>
> Iceberg has always had to consider which files need to be accessed in
> order to find all data relevant to a query. Applying deltas to a base
> (whether for merge-on-read or for compaction) is not different. One first
> translates filter predicates to partition predicates, picks up all
> partitions that might contain relevant data, applies remaining filter
> predicates, and then one collapses base data with subsequent deltas on the
> fly.
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to