For me, (b) is the right behavior, we may just be clearer in the spec doc,
but open for suggestions in case I missed something.

Yufei


On Mon, Apr 15, 2024 at 11:02 PM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi, Wing:
>
> I totally agree that we should clearly define the expected behavior in
> spec. I lean towards a), e.g. the row should be completed ignored or
> completed same as original row, intermediate state should be defined as
> invalid.
>
> On Tue, Apr 16, 2024 at 8:40 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
> wrote:
>
>> Hi Yufei,
>> Thank you for your response.
>> It sounds like on 2, your thinking is that (b) is the correct behavior.
>> Indeed, I have tried it out with Spark and afaict, it does (b). However,
>> that does not mean that it is the correct behavior. The spec should clearly
>> define it.
>> - Wing Yew
>>
>>
>> On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>>> Hi Wing Yew Poon,
>>>
>>> Here is my understanding, but not necessarily how an engine implements
>>> it.
>>> It should only consider the columns in equality_ids when we apply eq
>>> deletes. Also the engine should ignore the unrelated columns.
>>> It will still delete the row with id 3 in the following case you
>>> described even if the name doesn't match.
>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category
>>> | 3: name 
>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>>
>>> To verify the behavior, we can check the test case
>>> like TestSparkReaderDeletes::testReadEqualityDeleteRows.
>>>
>>> Yufei
>>>
>>>
>>> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon
>>> <wyp...@cloudera.com.invalid> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have some questions on the current Iceberg spec regarding equality
>>>> deletes:
>>>> https://iceberg.apache.org/spec/#equality-delete-files
>>>> The spec says that for "a table with the following data:
>>>>
>>>>  <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | 
>>>> 3: name 
>>>> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|---------
>>>>  <https://iceberg.apache.org/spec/#__codelineno-1-3> 1     | marsupial   | 
>>>> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2     | toy      
>>>>    | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3     | 
>>>> NULL        | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 
>>>> 4     | NULL        | Polar
>>>>
>>>> The delete id = 3 could be written as either of the following equality
>>>> delete files:
>>>>
>>>>  <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] 
>>>> <https://iceberg.apache.org/spec/#__codelineno-2-2> 
>>>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id 
>>>> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- 
>>>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3
>>>>
>>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> 
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 
>>>> 3: name 
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>>>  <https://iceberg.apache.org/spec/#__codelineno-3-5> 3     | NULL        | 
>>>> Grizzly
>>>>
>>>> "
>>>>
>>>> 1. Are the options either (a) write only the column(s) listed in
>>>> equality_ids or (b) write all the columns? i.e, no in between.
>>>> 2. If we write all the columns, are only columns listed in equality_ids
>>>> considered? What happens if a non-equality_id column does not match? e.g.,
>>>>
>>>
>>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2:
>>>> category | 3: name 
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>>>
>>>> Is that (a) invalid, or does that (b) still result in deleting id = 3,
>>>> or (c) result in deleting no rows?
>>>>
>>>> The spec says "Each row of the delete file produces one equality
>>>> predicate that matches any row where the delete columns are equal. Multiple
>>>> columns can be thought of as an AND of equality predicates." That
>>>> could be interpreted to mean (c).
>>>>
>>>> Thanks,
>>>> Wing Yew
>>>>
>>>>

Reply via email to