Re: spec question on equality deletes

Renjie Liu Mon, 15 Apr 2024 23:04:37 -0700

Hi, Wing:

I totally agree that we should clearly define the expected behavior in
spec. I lean towards a), e.g. the row should be completed ignored or
completed same as original row, intermediate state should be defined as
invalid.


On Tue, Apr 16, 2024 at 8:40 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> Hi Yufei,
> Thank you for your response.
> It sounds like on 2, your thinking is that (b) is the correct behavior.
> Indeed, I have tried it out with Spark and afaict, it does (b). However,
> that does not mean that it is the correct behavior. The spec should clearly
> define it.
> - Wing Yew
>
>
> On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Hi Wing Yew Poon,
>>
>> Here is my understanding, but not necessarily how an engine implements it.
>> It should only consider the columns in equality_ids when we apply eq
>> deletes. Also the engine should ignore the unrelated columns.
>> It will still delete the row with id 3 in the following case you
>> described even if the name doesn't match.
>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category
>> | 3: name 
>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>
>> To verify the behavior, we can check the test case
>> like TestSparkReaderDeletes::testReadEqualityDeleteRows.
>>
>> Yufei
>>
>>
>> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon
>> <wyp...@cloudera.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> I have some questions on the current Iceberg spec regarding equality
>>> deletes:
>>> https://iceberg.apache.org/spec/#equality-delete-files
>>> The spec says that for "a table with the following data:
>>>
>>>  <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | 
>>> 3: name 
>>> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|---------
>>>  <https://iceberg.apache.org/spec/#__codelineno-1-3> 1     | marsupial   | 
>>> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2     | toy       
>>>   | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3     | NULL  
>>>       | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4     | 
>>> NULL        | Polar
>>>
>>> The delete id = 3 could be written as either of the following equality
>>> delete files:
>>>
>>>  <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] 
>>> <https://iceberg.apache.org/spec/#__codelineno-2-2> 
>>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id 
>>> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- 
>>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3
>>>
>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> 
>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 
>>> 3: name 
>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>>  <https://iceberg.apache.org/spec/#__codelineno-3-5> 3     | NULL        | 
>>> Grizzly
>>>
>>> "
>>>
>>> 1. Are the options either (a) write only the column(s) listed in
>>> equality_ids or (b) write all the columns? i.e, no in between.
>>> 2. If we write all the columns, are only columns listed in equality_ids
>>> considered? What happens if a non-equality_id column does not match? e.g.,
>>>
>>
>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category
>>> | 3: name 
>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>>
>>> Is that (a) invalid, or does that (b) still result in deleting id = 3,
>>> or (c) result in deleting no rows?
>>>
>>> The spec says "Each row of the delete file produces one equality
>>> predicate that matches any row where the delete columns are equal. Multiple
>>> columns can be thought of as an AND of equality predicates." That could
>>> be interpreted to mean (c).
>>>
>>> Thanks,
>>> Wing Yew
>>>
>>>

Re: spec question on equality deletes

Reply via email to