Re: [DISCUSS] Changes for row-level deletes

Ryan Blue Mon, 11 May 2020 15:54:00 -0700

I opened more issues under the row-level delete milestone
<https://github.com/apache/incubator-iceberg/milestone/4>. Hopefully that's
more helpful for tracking tasks and for everyone that would like to
contribute!


On Thu, May 7, 2020 at 11:52 AM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> I am going to write down a short doc with the current idea on how to do
> job planning based on what we discussed yesterday. In addition, I would
> like to cover how minor/major compaction can look like. I want this to be
> quite detailed and focus on the implementation so that we can review this
> property and catch any issues soon. I think we had a consensus on the
> conceptual approach yesterday.
>
> It would be great to update the milestone for row-level deletes and add
> more granular tasks so that we can parallelize the work among the community.
> https://github.com/apache/incubator-iceberg/milestone/4
>
> - Anton
>
> On 7 May 2020, at 07:12, Ryan Murray <rym...@dremio.com> wrote:
>
> fwiw i agree with Gautam on the changes. Keeping complexity down and
> easing transition to V2 should be a goal for this work.
>
> Is there a list of items that need to be finished for V2 schema/row level
> deletes to be ready? I would love to help but am not sure what is
> missing/in-progress.
>
> Best,
> Ryan
>
> On Thu, May 7, 2020 at 2:52 AM Gautam <gautamkows...@gmail.com> wrote:
>
>> My 2 cents :
>>
>>
>> >  * Merge manifest_entry and data_file?
>>
>>  ...   -1  ..   keeping the difference between v1 and v2 metadata to a
>> minimum would be my preference by keeping manifest_entries the same way in
>> both v1 and v2. People using either flows would want to modify and
>> contribute and shouldn't have to worry about porting  things over every
>> time.
>>
>> >  * How should planning with delete files work?
>>
>>  .. +1 on keeping these independent and in two phases , as you mentioned.
>> Allows processing in parallel. Could make this a SparkAction too at some
>> point?
>>
>>
>> >  * Mix delete files and data files in manifests? I think we should not,
>> to support the two-phase planning approach.
>>
>>   -1  .. We should not for the reason you mention.
>>
>>
>> >  * If delete files and data files are separate, should manifests use
>> the same schema?
>>
>> +1.
>>
>> On Wed, May 6, 2020 at 10:39 AM Anton Okolnychyi <
>> aokolnyc...@apple.com.invalid> wrote:
>>
>>> We won’t have to rewrite V1 metadata when migrating to V2. The format is
>>> backward compatible and we can read V1 manifests just fine in V2. For
>>> example, V1 metadata will not have have sequence number and V2 would
>>> interpret that as sequence number = 0. The only thing we need to prohibit
>>> is V1 writers writing to V2 tables. That check is already in place and such
>>> attempts will fail. Recent changes that went in ensure that V1 and V2
>>> co-exist in the same codebase. As of now, we have a format version in
>>> TableMetadata. I think the manual change Ryan was referring to would
>>> simply mean updating that version flag, not rewriting the metadata.
>>> That change can be done via TableOperations.
>>>
>>> One change that I've been considering is getting rid of manifest_entry.
>>> In v1, a manifest stored a manifest_entry that wrapped a data_file. The
>>> intent was to separate data that API users needed to supply -- fields in
>>> data_file -- from data that was tracked internally by Iceberg -- the
>>> snapshot_id and status fields of manifest_entry. If we want to combine
>>> these so that a manifest stores one top-level data_file struct, then now is
>>> the time to make that change. I've prototyped this in #963
>>> <https://github.com/apache/incubator-iceberg/pull/963>. The benefit is
>>> that the schema is flatter so we wouldn't need two metadata tables (entries
>>> and files). The main drawback is that we aren't going to stop using v1
>>> tables, so we would effectively have two different manifest schemas instead
>>> of v2 as an evolution of v1. I'd love to hear more opinions on whether to
>>> do this. I'm leaning toward not merging the two.
>>>
>>>
>>> As mentioned earlier, I’d rather keep ManifestEntry to reduce the number
>>> of changes we have in V1 and V2. I feel it will be easier for other people
>>> who want to contribute to the core metadata management to follow it. That
>>> being said, I do get the intention of merging the two.
>>>
>>> Another change is to start adding tracking fields for delete files and
>>> updating the APIs. The metadata for this is fairly simple: an enum that
>>> stores whether the file is data, position deletes, or equality deletes. The
>>> main decision point is whether to allow mixing data files and delete files
>>> together in manifests. I don't think that we should allow manifests with
>>> both delete files and data files. The reason is job planning: we want to
>>> start emitting splits immediately so that we can stream them, instead of
>>> holding them all in memory. That means we need some way to guarantee that
>>> we know all of the delete files to apply to a data file before we encounter
>>> the data file.
>>>
>>>
>>> I don’t see a good reason to mix delete and data files in a single
>>> manifest now. In our original idea, we wanted to keep deletes separately as
>>> it felt it would be easier to come up with an efficient job planning
>>> approach later on. I think once we know the approach we want to take for
>>> planning input splits and doing compaction, we can revisit this point again.
>>>
>>> - Anton
>>>
>>> On 6 May 2020, at 09:04, Junjie Chen <chenjunjied...@gmail.com> wrote:
>>>
>>> Hi Ryan
>>>
>>> Besides the reading and merging of delete files, can we talk a bit about
>>> write side of delete files? For example, generate delete files in a spark
>>> action, the metadata column support, the service to transfer equality
>>> delete files to position delete files etc..
>>>
>>> On Wed, May 6, 2020 at 1:34 PM Miao Wang <miw...@adobe.com.invalid>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>>
>>>>
>>>> “Tables must be manually upgraded to version 2 in order to use any of
>>>> the metadata changes we are making” If I understand correctly, for exist
>>>> iceberg table in v1, we have to run some CLI/script to rewrite the
>>>> metadata.
>>>>
>>>>
>>>>
>>>> “Next, we've added sequence numbers and the proposed inheritance scheme
>>>> to v2, along with tests to ensure that v1 is written without sequence
>>>> numbers and that when reading v1 metadata, the sequence numbers are all 0.”
>>>> To me, this means V2 reader should be able to read V1 table metadata.
>>>> Therefore, the step above is not required, which only requires us to use a
>>>> V2 reader on a V1 table.
>>>>
>>>>
>>>>
>>>> However, if a table has been written in V1, we want to save it as V2. I
>>>> expect only metadata data will be rewritten into V2 and V1 metadata will be
>>>> vacuumed upon V2 success.
>>>>
>>>>
>>>>
>>>> Is my understanding correct?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> Miao
>>>>
>>>> *From: *Ryan Blue <rb...@netflix.com.INVALID>
>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>, "
>>>> rb...@netflix.com" <rb...@netflix.com>
>>>> *Date: *Tuesday, May 5, 2020 at 5:03 PM
>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>> *Subject: *[DISCUSS] Changes for row-level deletes
>>>>
>>>>
>>>>
>>>> Hi, everyone,
>>>>
>>>>
>>>>
>>>> I know several people that are planning to attend the sync tomorrow are
>>>> interested in the row-level delete work, so I wanted to share some of the
>>>> progress and my current thinking ahead of time.
>>>>
>>>>
>>>>
>>>> The codebase now supports a new version number, 2. Tables must be
>>>> manually upgraded to version 2 in order to use any of the metadata changes
>>>> we are making; v1 readers cannot read v2 tables. When a write takes place,
>>>> the version number is now passed to the manifest writer, manifest list
>>>> writer, etc. and the right schema for the table's current version is used.
>>>> We've also frozen the v1 schemas and added wrappers to ensure that even as
>>>> the internal classes, like DataFile, evolve, the exact same data is written
>>>> to v1.
>>>>
>>>>
>>>>
>>>> Next, we've added sequence numbers and the proposed inheritance scheme
>>>> to v2, along with tests to ensure that v1 is written without sequence
>>>> numbers and that when reading v1 metadata, the sequence numbers are all 0.
>>>> This gives us the ability to track "when" a row-level delete occurred in a
>>>> v2 table.
>>>>
>>>>
>>>>
>>>> The next steps are to start making larger changes to metadata files.
>>>>
>>>>
>>>>
>>>> One change that I've been considering is getting rid of manifest_entry.
>>>> In v1, a manifest stored a manifest_entry that wrapped a data_file. The
>>>> intent was to separate data that API users needed to supply -- fields in
>>>> data_file -- from data that was tracked internally by Iceberg -- the
>>>> snapshot_id and status fields of manifest_entry. If we want to combine
>>>> these so that a manifest stores one top-level data_file struct, then now is
>>>> the time to make that change. I've prototyped this in #963
>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F963&data=02%7C01%7Cmiwang%40adobe.com%7C6deae35f2a5b47fd3dbb08d7f150e20d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637243202006254913&sdata=BF4quqX2Cn%2FL3Ckyi1cpr6h3rkUnWf8MYbCTUugYXgw%3D&reserved=0>.
>>>> The benefit is that the schema is flatter so we wouldn't need two metadata
>>>> tables (entries and files). The main drawback is that we aren't going to
>>>> stop using v1 tables, so we would effectively have two different manifest
>>>> schemas instead of v2 as an evolution of v1. I'd love to hear more opinions
>>>> on whether to do this. I'm leaning toward not merging the two.
>>>>
>>>>
>>>>
>>>> Another change is to start adding tracking fields for delete files and
>>>> updating the APIs. The metadata for this is fairly simple: an enum that
>>>> stores whether the file is data, position deletes, or equality deletes. The
>>>> main decision point is whether to allow mixing data files and delete files
>>>> together in manifests. I don't think that we should allow manifests with
>>>> both delete files and data files. The reason is job planning: we want to
>>>> start emitting splits immediately so that we can stream them, instead of
>>>> holding them all in memory. That means we need some way to guarantee that
>>>> we know all of the delete files to apply to a data file before we encounter
>>>> the data file.
>>>>
>>>>
>>>>
>>>> OpenInx suggested sorting by sequence number to see delete files before
>>>> data files, but it still requires holding all splits in memory in the worst
>>>> case due to overlapping sequence number ranges. I think Iceberg should plan
>>>> a scan in two phases: one to find matching delete files (held in memory)
>>>> and one to find matching data files. That solves the problem of having all
>>>> deletes available so a split can be immediately emitted, and also allows
>>>> parallelizing both phases without coordination across threads.
>>>>
>>>>
>>>>
>>>> For the two-phase approach, mixing delete files and data files in a
>>>> manifest would require reading that manifest twice, once in each phase. I
>>>> think it makes the most sense to keep delete files and data files in
>>>> separate manifests. But the trade-off is that Iceberg will need to track
>>>> the content of a manifest (deletes or data) and perform actions on separate
>>>> manifest groups.
>>>>
>>>>
>>>>
>>>> Also, because with separate delete and data manifests we _could_ use
>>>> separate manifest schemas, I went through and wrote out a schema for a
>>>> delete file manifest. That schema was so similar to the current data file
>>>> schema that I think it's simpler to use the same one for both.
>>>>
>>>>
>>>>
>>>> In summary, here are the things that we need to decide and what I think
>>>> we should do:
>>>>
>>>>
>>>>
>>>> * Merge manifest_entry and data_file? I think we should not, to
>>>> avoid additional complexity.
>>>>
>>>> * How should planning with delete files work? The two-phase approach is
>>>> the only one I think is viable.
>>>>
>>>> * Mix delete files and data files in manifests? I think we should not,
>>>> to support the two-phase planning approach.
>>>>
>>>> * If delete files and data files are separate, should manifests use the
>>>> same schema? Yes, because it is simpler.
>>>>
>>>>
>>>>
>>>> Let's plan on talking about these questions in tomorrow's sync. And if
>>>> you have other topics, please send them to me!
>>>>
>>>>
>>>>
>>>> rb
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Software Engineer
>>>>
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>>
>>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Changes for row-level deletes

Reply via email to