Hi Yufei, Thank you for your response. It sounds like on 2, your thinking is that (b) is the correct behavior. Indeed, I have tried it out with Spark and afaict, it does (b). However, that does not mean that it is the correct behavior. The spec should clearly define it. - Wing Yew
On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <flyrain...@gmail.com> wrote: > Hi Wing Yew Poon, > > Here is my understanding, but not necessarily how an engine implements it. > It should only consider the columns in equality_ids when we apply eq > deletes. Also the engine should ignore the unrelated columns. > It will still delete the row with id 3 in the following case you described > even if the name doesn't match. > equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> > <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | > 3: name > <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- > <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar > > To verify the behavior, we can check the test case > like TestSparkReaderDeletes::testReadEqualityDeleteRows. > > Yufei > > > On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon <wyp...@cloudera.com.invalid> > wrote: > >> Hi, >> >> I have some questions on the current Iceberg spec regarding equality >> deletes: >> https://iceberg.apache.org/spec/#equality-delete-files >> The spec says that for "a table with the following data: >> >> <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | >> 3: name >> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|--------- >> <https://iceberg.apache.org/spec/#__codelineno-1-3> 1 | marsupial | >> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2 | toy >> | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3 | NULL >> | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4 | >> NULL | Polar >> >> The delete id = 3 could be written as either of the following equality >> delete files: >> >> <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] >> <https://iceberg.apache.org/spec/#__codelineno-2-2> >> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id >> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- >> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3 >> >> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 3: >> name >> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | >> Grizzly >> >> " >> >> 1. Are the options either (a) write only the column(s) listed in >> equality_ids or (b) write all the columns? i.e, no in between. >> 2. If we write all the columns, are only columns listed in equality_ids >> considered? What happens if a non-equality_id column does not match? e.g., >> > >> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category >> | 3: name >> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar >> >> Is that (a) invalid, or does that (b) still result in deleting id = 3, >> or (c) result in deleting no rows? >> >> The spec says "Each row of the delete file produces one equality >> predicate that matches any row where the delete columns are equal. Multiple >> columns can be thought of as an AND of equality predicates." That could >> be interpreted to mean (c). >> >> Thanks, >> Wing Yew >> >>