Upserts in Iceberg

Ryan Blue Fri, 14 Jun 2019 15:30:44 -0700

First, I want to clear up some of the confusion with a bit of background on
how Iceberg works, but I have inline replies below as well.

Conceptually, a snapshot is a table version. Each snapshot is the complete
table state at some point in time and each snapshot is independent.
Independence is a key property of snapshots because it allows us to
maintain table data and metadata.

Snapshots track additions and deletions using per-file status (ADDED/
EXISTING/DELETED) and the snapshot ID when each file was added or deleted.
This can be efficiently loaded and used to process changes incrementally.
Snapshots are also annotated with the operation that created each snapshot.
This is used to identify snapshots with no logical changes (replace) and
snapshots with no deletions (append) to speed up incremental consumption
and operations like snapshot expiration.

Readers use the current snapshot for isolation from writes. Because readers
may be using an old snapshot, Iceberg keeps snapshots around and only
physically deletes files when snapshots expire — data files referenced in a
valid snapshot cannot be deleted. Independence is important so that
snapshots can expire and get cleaned up without any other actions on the
table data.

Consider the alternative: if each snapshot depended on the previous one,
then Iceberg cannot expire them and physically delete data. Here’s an
example snapshot history:

S1
- add file_a.parquet
- add file_b.parquet
S2
- delete file_a.parquet
S3
- add file_c.parquet

To delete file_a.parquet, S1 and S2 need to be rewritten into S2’ that
contains only file_b.parquet. This would work, but takes more time than
dropping references and deleting files that are no longer used. An
expensive metadata rewrite to delete data isn’t ideal, and not taking the
time to rewrite makes metadata stack up in the table metadata file, that
gets rewritten often.

Finally, Iceberg relies on regular metadata maintenance — manifest
compaction — to reduce both write volume and the number of file reads
needed to plan a scan. Snapshots also reuse metadata files from previous
snapshots to reduce write volume.

On Wed, Jun 12, 2019 at 9:12 AM Erik Wright erik.wri...@shopify.com
<http://mailto:erik.wri...@shopify.com> wrote:

>>    - Maintaining separate manifest lists for each version requires
>>    keeping versions in separate metadata trees, so we can’t merge manifest
>>    files together across versions. We initially had this problem when we used
>>    only fast appends for our streaming writes. We ended up with way too many
>>    metadata files and job planning took a really long time.
>>
>> If I understand correctly, the main concern here is for scenarios where
> versions are essentially micro-batches. They accumulate rapidly, each one
> is relatively small, and there may be many valid versions at a time even if
> expiry happens regularly. The negative impact is many small versions cannot
> be combined together. Even the manifest list files are many, and all of
> them need to be read.
>
I’m not sure how you define micro-batch, but this happens when commits are
on the order of every 10 minutes and consist of large data files. Manifest
compaction is required to avoid thousands of small metadata files impacting
performance.

>>    - Although snapshots are independent and can be aged off, this
>>    requires rewriting the base version to expire versions. That means we 
>> can’t
>>    delete old data files until versions are rewritten, so the problem affects
>>    versions instead of snapshots. We would still have the problem of not 
>> being
>>    able to delete old data until after a compaction.
>>
>> I don't quite understand this point, so forgive me if the following is
> irrelevant or obvious.
>
> You are correct that snapshots become less useful as a tool, compared to
> versions, for expiring data. The process now goes through (1) expire a
> version and then (2) invalidate the last snapshot containing that version.
> Of course, fundamentally you need to keep version X around until you are
> willing to compact delta Y (or more) into it. And this is (I believe) our
> common purpose in this discussion: permit a trade-off between write
> amplification and storage/read costs.
>
Hopefully the background above is useful. Requiring expensive data-file
compaction in order to compact version metadata or delete files is a big
problem.

And you’re right that snapshots are less useful. Because snapshots and
versions are basically the same idea, we don’t need both. If we were to add
versions, they should replace snapshots. But I don’t think it is a good
idea to give up the independence.

The good news is that we can use snapshots exactly like versions and avoid
these problems by adding sequence numbers.

The value of decoupling versions from snapshots is:
>
> 1. It makes the process of correctly applying multiple deltas easy to
> reason about.
> 2. It allows us to issue updates that don't introduce new versions (i.e.,
> those that merely expire old versions, or that perform other optimizations).
> 3. It enables clients to cleanly consume new deltas when those clients are
> sophisticated enough to incrementally update their own outputs in function
> of those changes.
>
Points 2 and 3 are available already, and 1 can be done with sequence
numbers in snapshot metadata.

Here are some benefits to this approach:
>>
>>    - There is no explicit version that must be expired so compaction is
>>    only required to optimize reads — partition or file-level delete can
>>    coexist with deltas
>>
>> Can you expand on this?
>
By adding a sequence number instead of versions, the essential version
information is embedded in the existing metadata files. Those files are
reused across snapshots so there is no need to compact deltas into data
files to reduce the write volume.

Another way of thinking about it: if versions are stored as data/delete
file metadata and not in the table metadata file, then the complete history
can be kept.

It is also compatible with file-level deletes used to age off old data
(e.g. delete where day(ts) < '2019-01-01'). When files match a delete, they
are removed from metadata and physically deleted when the delete snapshot
is expired. This is independent of the delete files that get applied on
read. In fact, if a delete file’s sequence number is less than all data
files in a partition, the delete itself can be removed and never compacted.

>>    - When a delete’s sequence number is lower than all data files, it
>>    can be pruned
>>
>> In combination with the previous comment, I have the impression that we
> might be losing the distinction between snapshots and versions. In
> particular, what I'm concerned about is that, from a consumer's point of
> view, it may be difficult to figure out, from one snapshot to the next,
> what new deltas I need to process to update my view.
>
We don’t need both and should go with either snapshots (independent) or
versions (dependent). I don’t think it is possible to make dependent
versions work.

But as I noted above, the changes that occurred in a snapshot are available
as data file additions and data file deletions. Delete file additions and
deletions would be tracked the same way. Consumers would have full access
to the changes.

>>    - Compaction can translate a natural key delete to a synthetic key
>>    delete by adding a synthetic key delete file and updating the data file’s
>>    sequence number
>>
>> I don't understand what this means.
>
Synthetic keys are better than natural keys in several ways. They are
cheaper to apply at read time, they are scoped to a single data file, they
compress well, etc. If both are supported, compaction could rewrite a
natural key delete as a synthetic key delete to avoid rewriting the entire
data file to delete just a few rows.

In addition, if a natural key delete is applied to an entire table, it
would be possible to change sequence numbers to apply this process
incrementally. But, this would change history just like rewriting the base
version of a table loses the changes in that version. So we would need to
be careful about when changing a sequence number is allowed.

>>    - Synthetic and natural key approaches can coexist
>>
>> Can you elaborate on what the synthetic key approach would imply, and
> what use-cases it would serve?
>
Some use cases, like MERGE INTO, already require reading data so we can
write synthetic key delete files without additional work. That means
Iceberg should support not just natural key deletes, but also synthetic.

I agree we are getting closer. I am confused by some things, and I suspect
> that the difference lies in the use-cases I wish to support for incremental
> consumers of these datasets. I have the impression that most of you folks
> are primarily considering incremental writes that end up serving analytical
> use cases (that want to see the full effective dataset as of the most
> recent snapshot).
>
We have both use cases and I want to make sure Iceberg can be used for both.
-- 
Ryan Blue
Software Engineer
Netflix

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to