Re: spec question on equality deletes

Yufei Gu Mon, 15 Apr 2024 17:25:46 -0700

Hi Wing Yew Poon,

Here is my understanding, but not necessarily how an engine implements it.
It should only consider the columns in equality_ids when we apply eq
deletes. Also the engine should ignore the unrelated columns.
It will still delete the row with id 3 in the following case you described
even if the name doesn't match.
equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
<https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category |
3: name 
<https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
<https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar


To verify the behavior, we can check the test case
like TestSparkReaderDeletes::testReadEqualityDeleteRows.

Yufei


On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> Hi,
>
> I have some questions on the current Iceberg spec regarding equality
> deletes:
> https://iceberg.apache.org/spec/#equality-delete-files
> The spec says that for "a table with the following data:
>
>  <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | 3: 
> name 
> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|---------
>  <https://iceberg.apache.org/spec/#__codelineno-1-3> 1     | marsupial   | 
> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2     | toy         
> | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3     | NULL      
>   | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4     | NULL  
>       | Polar
>
> The delete id = 3 could be written as either of the following equality
> delete files:
>
>  <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] 
> <https://iceberg.apache.org/spec/#__codelineno-2-2> 
> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id 
> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- 
> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3
>
> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> 
> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 3: 
> name 
> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>  <https://iceberg.apache.org/spec/#__codelineno-3-5> 3     | NULL        | 
> Grizzly
>
> "
>
> 1. Are the options either (a) write only the column(s) listed in
> equality_ids or (b) write all the columns? i.e, no in between.
> 2. If we write all the columns, are only columns listed in equality_ids
> considered? What happens if a non-equality_id column does not match? e.g.,
>

> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category |
> 3: name 
> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>
> Is that (a) invalid, or does that (b) still result in deleting id = 3, or
> (c) result in deleting no rows?
>
> The spec says "Each row of the delete file produces one equality
> predicate that matches any row where the delete columns are equal. Multiple
> columns can be thought of as an AND of equality predicates." That could
> be interpreted to mean (c).
>
> Thanks,
> Wing Yew
>
>

Re: spec question on equality deletes

Reply via email to