I am going to write down a short doc with the current idea on how to do job planning based on what we discussed yesterday. In addition, I would like to cover how minor/major compaction can look like. I want this to be quite detailed and focus on the implementation so that we can review this property and catch any issues soon. I think we had a consensus on the conceptual approach yesterday.
It would be great to update the milestone for row-level deletes and add more granular tasks so that we can parallelize the work among the community. https://github.com/apache/incubator-iceberg/milestone/4 <https://github.com/apache/incubator-iceberg/milestone/4> - Anton > On 7 May 2020, at 07:12, Ryan Murray <rym...@dremio.com> wrote: > > fwiw i agree with Gautam on the changes. Keeping complexity down and easing > transition to V2 should be a goal for this work. > > Is there a list of items that need to be finished for V2 schema/row level > deletes to be ready? I would love to help but am not sure what is > missing/in-progress. > > Best, > Ryan > > On Thu, May 7, 2020 at 2:52 AM Gautam <gautamkows...@gmail.com > <mailto:gautamkows...@gmail.com>> wrote: > My 2 cents : > > > > * Merge manifest_entry and data_file? > > ... -1 .. keeping the difference between v1 and v2 metadata to a > minimum would be my preference by keeping manifest_entries the same way in > both v1 and v2. People using either flows would want to modify and contribute > and shouldn't have to worry about porting things over every time. > > > * How should planning with delete files work? > > .. +1 on keeping these independent and in two phases , as you mentioned. > Allows processing in parallel. Could make this a SparkAction too at some > point? > > > > * Mix delete files and data files in manifests? I think we should not, to > > support the two-phase planning approach. > > -1 .. We should not for the reason you mention. > > > > * If delete files and data files are separate, should manifests use the > > same schema? > > +1. > > On Wed, May 6, 2020 at 10:39 AM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > We won’t have to rewrite V1 metadata when migrating to V2. The format is > backward compatible and we can read V1 manifests just fine in V2. For > example, V1 metadata will not have have sequence number and V2 would > interpret that as sequence number = 0. The only thing we need to prohibit is > V1 writers writing to V2 tables. That check is already in place and such > attempts will fail. Recent changes that went in ensure that V1 and V2 > co-exist in the same codebase. As of now, we have a format version in > TableMetadata. I think the manual change Ryan was referring to would simply > mean updating that version flag, not rewriting the metadata. That change can > be done via TableOperations. > >> One change that I've been considering is getting rid of manifest_entry. In >> v1, a manifest stored a manifest_entry that wrapped a data_file. The intent >> was to separate data that API users needed to supply -- fields in data_file >> -- from data that was tracked internally by Iceberg -- the snapshot_id and >> status fields of manifest_entry. If we want to combine these so that a >> manifest stores one top-level data_file struct, then now is the time to make >> that change. I've prototyped this in #963 >> <https://github.com/apache/incubator-iceberg/pull/963>. The benefit is that >> the schema is flatter so we wouldn't need two metadata tables (entries and >> files). The main drawback is that we aren't going to stop using v1 tables, >> so we would effectively have two different manifest schemas instead of v2 as >> an evolution of v1. I'd love to hear more opinions on whether to do this. >> I'm leaning toward not merging the two. > > > As mentioned earlier, I’d rather keep ManifestEntry to reduce the number of > changes we have in V1 and V2. I feel it will be easier for other people who > want to contribute to the core metadata management to follow it. That being > said, I do get the intention of merging the two. > >> Another change is to start adding tracking fields for delete files and >> updating the APIs. The metadata for this is fairly simple: an enum that >> stores whether the file is data, position deletes, or equality deletes. The >> main decision point is whether to allow mixing data files and delete files >> together in manifests. I don't think that we should allow manifests with >> both delete files and data files. The reason is job planning: we want to >> start emitting splits immediately so that we can stream them, instead of >> holding them all in memory. That means we need some way to guarantee that we >> know all of the delete files to apply to a data file before we encounter the >> data file. > > I don’t see a good reason to mix delete and data files in a single manifest > now. In our original idea, we wanted to keep deletes separately as it felt it > would be easier to come up with an efficient job planning approach later on. > I think once we know the approach we want to take for planning input splits > and doing compaction, we can revisit this point again. > > - Anton > >> On 6 May 2020, at 09:04, Junjie Chen <chenjunjied...@gmail.com >> <mailto:chenjunjied...@gmail.com>> wrote: >> >> Hi Ryan >> >> Besides the reading and merging of delete files, can we talk a bit about >> write side of delete files? For example, generate delete files in a spark >> action, the metadata column support, the service to transfer equality delete >> files to position delete files etc.. >> >> On Wed, May 6, 2020 at 1:34 PM Miao Wang <miw...@adobe.com.invalid >> <mailto:miw...@adobe.com.invalid>> wrote: >> Hi Ryan, >> >> >> >> “Tables must be manually upgraded to version 2 in order to use any of the >> metadata changes we are making” If I understand correctly, for exist iceberg >> table in v1, we have to run some CLI/script to rewrite the metadata. >> >> >> >> “Next, we've added sequence numbers and the proposed inheritance scheme to >> v2, along with tests to ensure that v1 is written without sequence numbers >> and that when reading v1 metadata, the sequence numbers are all 0.” To me, >> this means V2 reader should be able to read V1 table metadata. Therefore, >> the step above is not required, which only requires us to use a V2 reader on >> a V1 table. >> >> >> >> However, if a table has been written in V1, we want to save it as V2. I >> expect only metadata data will be rewritten into V2 and V1 metadata will be >> vacuumed upon V2 success. >> >> >> >> Is my understanding correct? >> >> >> >> Thanks! >> >> >> >> Miao >> >> From: Ryan Blue <rb...@netflix.com.INVALID >> <mailto:rb...@netflix.com.INVALID>> >> Reply-To: "dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>" >> <dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>>, "rb...@netflix.com >> <mailto:rb...@netflix.com>" <rb...@netflix.com <mailto:rb...@netflix.com>> >> Date: Tuesday, May 5, 2020 at 5:03 PM >> To: Iceberg Dev List <dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>> >> Subject: [DISCUSS] Changes for row-level deletes >> >> >> >> Hi, everyone, >> >> >> >> I know several people that are planning to attend the sync tomorrow are >> interested in the row-level delete work, so I wanted to share some of the >> progress and my current thinking ahead of time. >> >> >> >> The codebase now supports a new version number, 2. Tables must be manually >> upgraded to version 2 in order to use any of the metadata changes we are >> making; v1 readers cannot read v2 tables. When a write takes place, the >> version number is now passed to the manifest writer, manifest list writer, >> etc. and the right schema for the table's current version is used. We've >> also frozen the v1 schemas and added wrappers to ensure that even as the >> internal classes, like DataFile, evolve, the exact same data is written to >> v1. >> >> >> >> Next, we've added sequence numbers and the proposed inheritance scheme to >> v2, along with tests to ensure that v1 is written without sequence numbers >> and that when reading v1 metadata, the sequence numbers are all 0. This >> gives us the ability to track "when" a row-level delete occurred in a v2 >> table. >> >> >> >> The next steps are to start making larger changes to metadata files. >> >> >> >> One change that I've been considering is getting rid of manifest_entry. In >> v1, a manifest stored a manifest_entry that wrapped a data_file. The intent >> was to separate data that API users needed to supply -- fields in data_file >> -- from data that was tracked internally by Iceberg -- the snapshot_id and >> status fields of manifest_entry. If we want to combine these so that a >> manifest stores one top-level data_file struct, then now is the time to make >> that change. I've prototyped this in #963 >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F963&data=02%7C01%7Cmiwang%40adobe.com%7C6deae35f2a5b47fd3dbb08d7f150e20d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637243202006254913&sdata=BF4quqX2Cn%2FL3Ckyi1cpr6h3rkUnWf8MYbCTUugYXgw%3D&reserved=0>. >> The benefit is that the schema is flatter so we wouldn't need two metadata >> tables (entries and files). The main drawback is that we aren't going to >> stop using v1 tables, so we would effectively have two different manifest >> schemas instead of v2 as an evolution of v1. I'd love to hear more opinions >> on whether to do this. I'm leaning toward not merging the two. >> >> >> >> Another change is to start adding tracking fields for delete files and >> updating the APIs. The metadata for this is fairly simple: an enum that >> stores whether the file is data, position deletes, or equality deletes. The >> main decision point is whether to allow mixing data files and delete files >> together in manifests. I don't think that we should allow manifests with >> both delete files and data files. The reason is job planning: we want to >> start emitting splits immediately so that we can stream them, instead of >> holding them all in memory. That means we need some way to guarantee that we >> know all of the delete files to apply to a data file before we encounter the >> data file. >> >> >> >> OpenInx suggested sorting by sequence number to see delete files before data >> files, but it still requires holding all splits in memory in the worst case >> due to overlapping sequence number ranges. I think Iceberg should plan a >> scan in two phases: one to find matching delete files (held in memory) and >> one to find matching data files. That solves the problem of having all >> deletes available so a split can be immediately emitted, and also allows >> parallelizing both phases without coordination across threads. >> >> >> >> For the two-phase approach, mixing delete files and data files in a manifest >> would require reading that manifest twice, once in each phase. I think it >> makes the most sense to keep delete files and data files in separate >> manifests. But the trade-off is that Iceberg will need to track the content >> of a manifest (deletes or data) and perform actions on separate manifest >> groups. >> >> >> >> Also, because with separate delete and data manifests we _could_ use >> separate manifest schemas, I went through and wrote out a schema for a >> delete file manifest. That schema was so similar to the current data file >> schema that I think it's simpler to use the same one for both. >> >> >> >> In summary, here are the things that we need to decide and what I think we >> should do: >> >> >> >> * Merge manifest_entry and data_file? I think we should not, to avoid >> additional complexity. >> >> * How should planning with delete files work? The two-phase approach is the >> only one I think is viable. >> >> * Mix delete files and data files in manifests? I think we should not, to >> support the two-phase planning approach. >> >> * If delete files and data files are separate, should manifests use the same >> schema? Yes, because it is simpler. >> >> >> >> Let's plan on talking about these questions in tomorrow's sync. And if you >> have other topics, please send them to me! >> >> >> >> rb >> >> >> >> -- >> >> Ryan Blue >> >> Software Engineer >> >> Netflix >> >> >> >> -- >> Best Regards >