For me, (b) is the right behavior, we may just be clearer in the spec doc, but open for suggestions in case I missed something.
Yufei On Mon, Apr 15, 2024 at 11:02 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > Hi, Wing: > > I totally agree that we should clearly define the expected behavior in > spec. I lean towards a), e.g. the row should be completed ignored or > completed same as original row, intermediate state should be defined as > invalid. > > On Tue, Apr 16, 2024 at 8:40 AM Wing Yew Poon <wyp...@cloudera.com.invalid> > wrote: > >> Hi Yufei, >> Thank you for your response. >> It sounds like on 2, your thinking is that (b) is the correct behavior. >> Indeed, I have tried it out with Spark and afaict, it does (b). However, >> that does not mean that it is the correct behavior. The spec should clearly >> define it. >> - Wing Yew >> >> >> On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <flyrain...@gmail.com> wrote: >> >>> Hi Wing Yew Poon, >>> >>> Here is my understanding, but not necessarily how an engine implements >>> it. >>> It should only consider the columns in equality_ids when we apply eq >>> deletes. Also the engine should ignore the unrelated columns. >>> It will still delete the row with id 3 in the following case you >>> described even if the name doesn't match. >>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category >>> | 3: name >>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar >>> >>> To verify the behavior, we can check the test case >>> like TestSparkReaderDeletes::testReadEqualityDeleteRows. >>> >>> Yufei >>> >>> >>> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon >>> <wyp...@cloudera.com.invalid> wrote: >>> >>>> Hi, >>>> >>>> I have some questions on the current Iceberg spec regarding equality >>>> deletes: >>>> https://iceberg.apache.org/spec/#equality-delete-files >>>> The spec says that for "a table with the following data: >>>> >>>> <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | >>>> 3: name >>>> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|--------- >>>> <https://iceberg.apache.org/spec/#__codelineno-1-3> 1 | marsupial | >>>> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2 | toy >>>> | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3 | >>>> NULL | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> >>>> 4 | NULL | Polar >>>> >>>> The delete id = 3 could be written as either of the following equality >>>> delete files: >>>> >>>> <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] >>>> <https://iceberg.apache.org/spec/#__codelineno-2-2> >>>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id >>>> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- >>>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3 >>>> >>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | >>>> 3: name >>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | >>>> Grizzly >>>> >>>> " >>>> >>>> 1. Are the options either (a) write only the column(s) listed in >>>> equality_ids or (b) write all the columns? i.e, no in between. >>>> 2. If we write all the columns, are only columns listed in equality_ids >>>> considered? What happens if a non-equality_id column does not match? e.g., >>>> >>> >>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: >>>> category | 3: name >>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar >>>> >>>> Is that (a) invalid, or does that (b) still result in deleting id = 3, >>>> or (c) result in deleting no rows? >>>> >>>> The spec says "Each row of the delete file produces one equality >>>> predicate that matches any row where the delete columns are equal. Multiple >>>> columns can be thought of as an AND of equality predicates." That >>>> could be interpreted to mean (c). >>>> >>>> Thanks, >>>> Wing Yew >>>> >>>>