Re: [DISCUSS] Changes for row-level deletes

Anton Okolnychyi Wed, 06 May 2020 10:40:21 -0700

We won’t have to rewrite V1 metadata when migrating to V2. The format is 
backward compatible and we can read V1 manifests just fine in V2. For example, 
V1 metadata will not have have sequence number and V2 would interpret that as 
sequence number = 0. The only thing we need to prohibit is V1 writers writing 
to V2 tables. That check is already in place and such attempts will fail. 
Recent changes that went in ensure that V1 and V2 co-exist in the same 
codebase. As of now, we have a format version in TableMetadata. I think the 
manual change Ryan was referring to would simply mean updating that version 
flag, not rewriting the metadata. That change can be done via TableOperations.


> One change that I've been considering is getting rid of manifest_entry. In 
> v1, a manifest stored a manifest_entry that wrapped a data_file. The intent 
> was to separate data that API users needed to supply -- fields in data_file 
> -- from data that was tracked internally by Iceberg -- the snapshot_id and 
> status fields of manifest_entry. If we want to combine these so that a 
> manifest stores one top-level data_file struct, then now is the time to make 
> that change. I've prototyped this in #963 
> <https://github.com/apache/incubator-iceberg/pull/963>. The benefit is that 
> the schema is flatter so we wouldn't need two metadata tables (entries and 
> files). The main drawback is that we aren't going to stop using v1 tables, so 
> we would effectively have two different manifest schemas instead of v2 as an 
> evolution of v1. I'd love to hear more opinions on whether to do this. I'm 
> leaning toward not merging the two.


As mentioned earlier, I’d rather keep ManifestEntry to reduce the number of 
changes we have in V1 and V2. I feel it will be easier for other people who 
want to contribute to the core metadata management to follow it. That being 
said, I do get the intention of merging the two.

> Another change is to start adding tracking fields for delete files and 
> updating the APIs. The metadata for this is fairly simple: an enum that 
> stores whether the file is data, position deletes, or equality deletes. The 
> main decision point is whether to allow mixing data files and delete files 
> together in manifests. I don't think that we should allow manifests with both 
> delete files and data files. The reason is job planning: we want to start 
> emitting splits immediately so that we can stream them, instead of holding 
> them all in memory. That means we need some way to guarantee that we know all 
> of the delete files to apply to a data file before we encounter the data file.

I don’t see a good reason to mix delete and data files in a single manifest 
now. In our original idea, we wanted to keep deletes separately as it felt it 
would be easier to come up with an efficient job planning approach later on. I 
think once we know the approach we want to take for planning input splits and 
doing compaction, we can revisit this point again.

- Anton

> On 6 May 2020, at 09:04, Junjie Chen <chenjunjied...@gmail.com> wrote:
> 
> Hi Ryan
> 
> Besides the reading and merging of delete files, can we talk a bit about 
> write side of delete files? For example, generate delete files in a spark 
> action, the metadata column support, the service to transfer equality delete 
> files to position delete files etc..
> 
> On Wed, May 6, 2020 at 1:34 PM Miao Wang <miw...@adobe.com.invalid> wrote:
> Hi Ryan,
> 
>  
> 
> “Tables must be manually upgraded to version 2 in order to use any of the 
> metadata changes we are making” If I understand correctly, for exist iceberg 
> table in v1, we have to run some CLI/script to rewrite the metadata.
> 
>  
> 
> “Next, we've added sequence numbers and the proposed inheritance scheme to 
> v2, along with tests to ensure that v1 is written without sequence numbers 
> and that when reading v1 metadata, the sequence numbers are all 0.” To me, 
> this means V2 reader should be able to read V1 table metadata. Therefore, the 
> step above is not required, which only requires us to use a V2 reader on a V1 
> table.
> 
>  
> 
> However, if a table has been written in V1, we want to save it as V2. I 
> expect only metadata data will be rewritten into V2 and V1 metadata will be 
> vacuumed upon V2 success.
> 
>  
> 
> Is my understanding correct?
> 
>  
> 
> Thanks!
> 
>  
> 
> Miao
> 
> From: Ryan Blue <rb...@netflix.com.INVALID>
> Reply-To: "dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>" 
> <dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>>, "rb...@netflix.com 
> <mailto:rb...@netflix.com>" <rb...@netflix.com <mailto:rb...@netflix.com>>
> Date: Tuesday, May 5, 2020 at 5:03 PM
> To: Iceberg Dev List <dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>>
> Subject: [DISCUSS] Changes for row-level deletes
> 
>  
> 
> Hi, everyone,
> 
>  
> 
> I know several people that are planning to attend the sync tomorrow are 
> interested in the row-level delete work, so I wanted to share some of the 
> progress and my current thinking ahead of time.
> 
>  
> 
> The codebase now supports a new version number, 2. Tables must be manually 
> upgraded to version 2 in order to use any of the metadata changes we are 
> making; v1 readers cannot read v2 tables. When a write takes place, the 
> version number is now passed to the manifest writer, manifest list writer, 
> etc. and the right schema for the table's current version is used. We've also 
> frozen the v1 schemas and added wrappers to ensure that even as the internal 
> classes, like DataFile, evolve, the exact same data is written to v1.
> 
>  
> 
> Next, we've added sequence numbers and the proposed inheritance scheme to v2, 
> along with tests to ensure that v1 is written without sequence numbers and 
> that when reading v1 metadata, the sequence numbers are all 0. This gives us 
> the ability to track "when" a row-level delete occurred in a v2 table.
> 
>  
> 
> The next steps are to start making larger changes to metadata files.
> 
>  
> 
> One change that I've been considering is getting rid of manifest_entry. In 
> v1, a manifest stored a manifest_entry that wrapped a data_file. The intent 
> was to separate data that API users needed to supply -- fields in data_file 
> -- from data that was tracked internally by Iceberg -- the snapshot_id and 
> status fields of manifest_entry. If we want to combine these so that a 
> manifest stores one top-level data_file struct, then now is the time to make 
> that change. I've prototyped this in #963 
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F963&data=02%7C01%7Cmiwang%40adobe.com%7C6deae35f2a5b47fd3dbb08d7f150e20d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637243202006254913&sdata=BF4quqX2Cn%2FL3Ckyi1cpr6h3rkUnWf8MYbCTUugYXgw%3D&reserved=0>.
>  The benefit is that the schema is flatter so we wouldn't need two metadata 
> tables (entries and files). The main drawback is that we aren't going to stop 
> using v1 tables, so we would effectively have two different manifest schemas 
> instead of v2 as an evolution of v1. I'd love to hear more opinions on 
> whether to do this. I'm leaning toward not merging the two.
> 
>  
> 
> Another change is to start adding tracking fields for delete files and 
> updating the APIs. The metadata for this is fairly simple: an enum that 
> stores whether the file is data, position deletes, or equality deletes. The 
> main decision point is whether to allow mixing data files and delete files 
> together in manifests. I don't think that we should allow manifests with both 
> delete files and data files. The reason is job planning: we want to start 
> emitting splits immediately so that we can stream them, instead of holding 
> them all in memory. That means we need some way to guarantee that we know all 
> of the delete files to apply to a data file before we encounter the data file.
> 
>  
> 
> OpenInx suggested sorting by sequence number to see delete files before data 
> files, but it still requires holding all splits in memory in the worst case 
> due to overlapping sequence number ranges. I think Iceberg should plan a scan 
> in two phases: one to find matching delete files (held in memory) and one to 
> find matching data files. That solves the problem of having all deletes 
> available so a split can be immediately emitted, and also allows 
> parallelizing both phases without coordination across threads.
> 
>  
> 
> For the two-phase approach, mixing delete files and data files in a manifest 
> would require reading that manifest twice, once in each phase. I think it 
> makes the most sense to keep delete files and data files in separate 
> manifests. But the trade-off is that Iceberg will need to track the content 
> of a manifest (deletes or data) and perform actions on separate manifest 
> groups.
> 
>  
> 
> Also, because with separate delete and data manifests we _could_ use separate 
> manifest schemas, I went through and wrote out a schema for a delete file 
> manifest. That schema was so similar to the current data file schema that I 
> think it's simpler to use the same one for both.
> 
>  
> 
> In summary, here are the things that we need to decide and what I think we 
> should do:
> 
>  
> 
> * Merge manifest_entry and data_file? I think we should not, to avoid 
> additional complexity.
> 
> * How should planning with delete files work? The two-phase approach is the 
> only one I think is viable.
> 
> * Mix delete files and data files in manifests? I think we should not, to 
> support the two-phase planning approach.
> 
> * If delete files and data files are separate, should manifests use the same 
> schema? Yes, because it is simpler.
> 
>  
> 
> Let's plan on talking about these questions in tomorrow's sync. And if you 
> have other topics, please send them to me!
> 
>  
> 
> rb
> 
>  
> 
> --
> 
> Ryan Blue
> 
> Software Engineer
> 
> Netflix
> 
> 
> 
> -- 
> Best Regards

Re: [DISCUSS] Changes for row-level deletes

Reply via email to