Hi, Wing: I totally agree that we should clearly define the expected behavior in spec. I lean towards a), e.g. the row should be completed ignored or completed same as original row, intermediate state should be defined as invalid.
On Tue, Apr 16, 2024 at 8:40 AM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Hi Yufei, > Thank you for your response. > It sounds like on 2, your thinking is that (b) is the correct behavior. > Indeed, I have tried it out with Spark and afaict, it does (b). However, > that does not mean that it is the correct behavior. The spec should clearly > define it. > - Wing Yew > > > On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <flyrain...@gmail.com> wrote: > >> Hi Wing Yew Poon, >> >> Here is my understanding, but not necessarily how an engine implements it. >> It should only consider the columns in equality_ids when we apply eq >> deletes. Also the engine should ignore the unrelated columns. >> It will still delete the row with id 3 in the following case you >> described even if the name doesn't match. >> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category >> | 3: name >> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar >> >> To verify the behavior, we can check the test case >> like TestSparkReaderDeletes::testReadEqualityDeleteRows. >> >> Yufei >> >> >> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon >> <wyp...@cloudera.com.invalid> wrote: >> >>> Hi, >>> >>> I have some questions on the current Iceberg spec regarding equality >>> deletes: >>> https://iceberg.apache.org/spec/#equality-delete-files >>> The spec says that for "a table with the following data: >>> >>> <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | >>> 3: name >>> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|--------- >>> <https://iceberg.apache.org/spec/#__codelineno-1-3> 1 | marsupial | >>> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2 | toy >>> | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3 | NULL >>> | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4 | >>> NULL | Polar >>> >>> The delete id = 3 could be written as either of the following equality >>> delete files: >>> >>> <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] >>> <https://iceberg.apache.org/spec/#__codelineno-2-2> >>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id >>> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- >>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3 >>> >>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | >>> 3: name >>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | >>> Grizzly >>> >>> " >>> >>> 1. Are the options either (a) write only the column(s) listed in >>> equality_ids or (b) write all the columns? i.e, no in between. >>> 2. If we write all the columns, are only columns listed in equality_ids >>> considered? What happens if a non-equality_id column does not match? e.g., >>> >> >>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> >>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category >>> | 3: name >>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|--------- >>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar >>> >>> Is that (a) invalid, or does that (b) still result in deleting id = 3, >>> or (c) result in deleting no rows? >>> >>> The spec says "Each row of the delete file produces one equality >>> predicate that matches any row where the delete columns are equal. Multiple >>> columns can be thought of as an AND of equality predicates." That could >>> be interpreted to mean (c). >>> >>> Thanks, >>> Wing Yew >>> >>>