Re: spec question on equality deletes

Wing Yew Poon Mon, 15 Apr 2024 17:40:22 -0700

Hi Yufei,
Thank you for your response.
It sounds like on 2, your thinking is that (b) is the correct behavior.
Indeed, I have tried it out with Spark and afaict, it does (b). However,
that does not mean that it is the correct behavior. The spec should clearly
define it.
- Wing Yew



On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <[email protected]> wrote:

> Hi Wing Yew Poon,
>
> Here is my understanding, but not necessarily how an engine implements it.
> It should only consider the columns in equality_ids when we apply eq
> deletes. Also the engine should ignore the unrelated columns.
> It will still delete the row with id 3 in the following case you described
> even if the name doesn't match.
> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category |
> 3: name 
> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>
> To verify the behavior, we can check the test case
> like TestSparkReaderDeletes::testReadEqualityDeleteRows.
>
> Yufei
>
>
> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have some questions on the current Iceberg spec regarding equality
>> deletes:
>> https://iceberg.apache.org/spec/#equality-delete-files
>> The spec says that for "a table with the following data:
>>
>>  <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | 
>> 3: name 
>> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|---------
>>  <https://iceberg.apache.org/spec/#__codelineno-1-3> 1     | marsupial   | 
>> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2     | toy        
>>  | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3     | NULL    
>>     | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4     | 
>> NULL        | Polar
>>
>> The delete id = 3 could be written as either of the following equality
>> delete files:
>>
>>  <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] 
>> <https://iceberg.apache.org/spec/#__codelineno-2-2> 
>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id 
>> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- 
>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3
>>
>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> 
>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 3: 
>> name 
>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>  <https://iceberg.apache.org/spec/#__codelineno-3-5> 3     | NULL        | 
>> Grizzly
>>
>> "
>>
>> 1. Are the options either (a) write only the column(s) listed in
>> equality_ids or (b) write all the columns? i.e, no in between.
>> 2. If we write all the columns, are only columns listed in equality_ids
>> considered? What happens if a non-equality_id column does not match? e.g.,
>>
>
>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category
>> | 3: name 
>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>
>> Is that (a) invalid, or does that (b) still result in deleting id = 3,
>> or (c) result in deleting no rows?
>>
>> The spec says "Each row of the delete file produces one equality
>> predicate that matches any row where the delete columns are equal. Multiple
>> columns can be thought of as an AND of equality predicates." That could
>> be interpreted to mean (c).
>>
>> Thanks,
>> Wing Yew
>>
>>

Re: spec question on equality deletes

Reply via email to