Hi Wing Yew Poon, Here is my understanding, but not necessarily how an engine implements it. It should only consider the columns in equality_ids when we apply eq deletes. Also the engine should ignore the unrelated columns. It will still delete the row with id 3 in the following case you described even if the name doesn't match. equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 3: name <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
To verify the behavior, we can check the test case like TestSparkReaderDeletes::testReadEqualityDeleteRows. Yufei On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Hi, > > I have some questions on the current Iceberg spec regarding equality > deletes: > https://iceberg.apache.org/spec/#equality-delete-files > The spec says that for "a table with the following data: > > <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | 3: > name > <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|--------- > <https://iceberg.apache.org/spec/#__codelineno-1-3> 1 | marsupial | > Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2 | toy > | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3 | NULL > | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4 | NULL > | Polar > > The delete id = 3 could be written as either of the following equality > delete files: > > <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] > <https://iceberg.apache.org/spec/#__codelineno-2-2> > <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id > <https://iceberg.apache.org/spec/#__codelineno-2-4>------- > <https://iceberg.apache.org/spec/#__codelineno-2-5> 3 > > equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> > <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 3: > name > <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- > <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | > Grizzly > > " > > 1. Are the options either (a) write only the column(s) listed in > equality_ids or (b) write all the columns? i.e, no in between. > 2. If we write all the columns, are only columns listed in equality_ids > considered? What happens if a non-equality_id column does not match? e.g., > > equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> > <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | > 3: name > <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- > <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar > > Is that (a) invalid, or does that (b) still result in deleting id = 3, or > (c) result in deleting no rows? > > The spec says "Each row of the delete file produces one equality > predicate that matches any row where the delete columns are equal. Multiple > columns can be thought of as an AND of equality predicates." That could > be interpreted to mean (c). > > Thanks, > Wing Yew > >