Hi, everyone,

I know several people that are planning to attend the sync tomorrow are
interested in the row-level delete work, so I wanted to share some of the
progress and my current thinking ahead of time.

The codebase now supports a new version number, 2. Tables must be manually
upgraded to version 2 in order to use any of the metadata changes we are
making; v1 readers cannot read v2 tables. When a write takes place, the
version number is now passed to the manifest writer, manifest list writer,
etc. and the right schema for the table's current version is used. We've
also frozen the v1 schemas and added wrappers to ensure that even as the
internal classes, like DataFile, evolve, the exact same data is written to
v1.

Next, we've added sequence numbers and the proposed inheritance scheme to
v2, along with tests to ensure that v1 is written without sequence numbers
and that when reading v1 metadata, the sequence numbers are all 0. This
gives us the ability to track "when" a row-level delete occurred in a v2
table.

The next steps are to start making larger changes to metadata files.

One change that I've been considering is getting rid of manifest_entry. In
v1, a manifest stored a manifest_entry that wrapped a data_file. The intent
was to separate data that API users needed to supply -- fields in data_file
-- from data that was tracked internally by Iceberg -- the snapshot_id and
status fields of manifest_entry. If we want to combine these so that a
manifest stores one top-level data_file struct, then now is the time to
make that change. I've prototyped this in #963
<https://github.com/apache/incubator-iceberg/pull/963>. The benefit is that
the schema is flatter so we wouldn't need two metadata tables (entries and
files). The main drawback is that we aren't going to stop using v1 tables,
so we would effectively have two different manifest schemas instead of v2
as an evolution of v1. I'd love to hear more opinions on whether to do
this. I'm leaning toward not merging the two.

Another change is to start adding tracking fields for delete files and
updating the APIs. The metadata for this is fairly simple: an enum that
stores whether the file is data, position deletes, or equality deletes. The
main decision point is whether to allow mixing data files and delete files
together in manifests. I don't think that we should allow manifests with
both delete files and data files. The reason is job planning: we want to
start emitting splits immediately so that we can stream them, instead of
holding them all in memory. That means we need some way to guarantee that
we know all of the delete files to apply to a data file before we encounter
the data file.

OpenInx suggested sorting by sequence number to see delete files before
data files, but it still requires holding all splits in memory in the worst
case due to overlapping sequence number ranges. I think Iceberg should plan
a scan in two phases: one to find matching delete files (held in memory)
and one to find matching data files. That solves the problem of having all
deletes available so a split can be immediately emitted, and also allows
parallelizing both phases without coordination across threads.

For the two-phase approach, mixing delete files and data files in a
manifest would require reading that manifest twice, once in each phase. I
think it makes the most sense to keep delete files and data files in
separate manifests. But the trade-off is that Iceberg will need to track
the content of a manifest (deletes or data) and perform actions on separate
manifest groups.

Also, because with separate delete and data manifests we _could_ use
separate manifest schemas, I went through and wrote out a schema for a
delete file manifest. That schema was so similar to the current data file
schema that I think it's simpler to use the same one for both.

In summary, here are the things that we need to decide and what I think we
should do:

* Merge manifest_entry and data_file? I think we should not, to
avoid additional complexity.
* How should planning with delete files work? The two-phase approach is the
only one I think is viable.
* Mix delete files and data files in manifests? I think we should not, to
support the two-phase planning approach.
* If delete files and data files are separate, should manifests use the
same schema? Yes, because it is simpler.

Let's plan on talking about these questions in tomorrow's sync. And if you
have other topics, please send them to me!

rb

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to