+1 on defining it clearly in the spec. Note the “spec doc” is the spec
itself, which requires more accurate description than doc. We may also need
spec test to check whether compute engine conforms to spec, not the other
way around.

Yufei Gu <flyrain...@gmail.com>于2024年4月17日 周三01:08写道:

> For me, (b) is the right behavior, we may just be clearer in the spec doc,
> but open for suggestions in case I missed something.
>
> Yufei
>
>
> On Mon, Apr 15, 2024 at 11:02 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
>
>> Hi, Wing:
>>
>> I totally agree that we should clearly define the expected behavior in
>> spec. I lean towards a), e.g. the row should be completed ignored or
>> completed same as original row, intermediate state should be defined as
>> invalid.
>>
>> On Tue, Apr 16, 2024 at 8:40 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Yufei,
>>> Thank you for your response.
>>> It sounds like on 2, your thinking is that (b) is the correct behavior.
>>> Indeed, I have tried it out with Spark and afaict, it does (b). However,
>>> that does not mean that it is the correct behavior. The spec should clearly
>>> define it.
>>> - Wing Yew
>>>
>>>
>>> On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>
>>>> Hi Wing Yew Poon,
>>>>
>>>> Here is my understanding, but not necessarily how an engine implements
>>>> it.
>>>> It should only consider the columns in equality_ids when we apply eq
>>>> deletes. Also the engine should ignore the unrelated columns.
>>>> It will still delete the row with id 3 in the following case you
>>>> described even if the name doesn't match.
>>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2:
>>>> category | 3: name 
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>>>
>>>> To verify the behavior, we can check the test case
>>>> like TestSparkReaderDeletes::testReadEqualityDeleteRows.
>>>>
>>>> Yufei
>>>>
>>>>
>>>> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon
>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have some questions on the current Iceberg spec regarding equality
>>>>> deletes:
>>>>> https://iceberg.apache.org/spec/#equality-delete-files
>>>>> The spec says that for "a table with the following data:
>>>>>
>>>>>  <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category 
>>>>> | 3: name 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-1-2>-------|-------------|---------
>>>>>  <https://iceberg.apache.org/spec/#__codelineno-1-3> 1     | marsupial   
>>>>> | Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2     | toy   
>>>>>       | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3     | 
>>>>> NULL        | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 
>>>>> 4     | NULL        | Polar
>>>>>
>>>>> The delete id = 3 could be written as either of the following
>>>>> equality delete files:
>>>>>
>>>>>  <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-2-2> 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-2-4>------- 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3
>>>>>
>>>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 
>>>>> 3: name 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>>>>  <https://iceberg.apache.org/spec/#__codelineno-3-5> 3     | NULL        
>>>>> | Grizzly
>>>>>
>>>>> "
>>>>>
>>>>> 1. Are the options either (a) write only the column(s) listed in
>>>>> equality_ids or (b) write all the columns? i.e, no in between.
>>>>> 2. If we write all the columns, are only columns listed in
>>>>> equality_ids considered? What happens if a non-equality_id column does not
>>>>> match? e.g.,
>>>>>
>>>>
>>>>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>>>>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2:
>>>>> category | 3: name 
>>>>> <https://iceberg.apache.org/spec/#__codelineno-3-4>-------|-------------|---------
>>>>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>>>>
>>>>> Is that (a) invalid, or does that (b) still result in deleting id = 3,
>>>>> or (c) result in deleting no rows?
>>>>>
>>>>> The spec says "Each row of the delete file produces one equality
>>>>> predicate that matches any row where the delete columns are equal. 
>>>>> Multiple
>>>>> columns can be thought of as an AND of equality predicates." That
>>>>> could be interpreted to mean (c).
>>>>>
>>>>> Thanks,
>>>>> Wing Yew
>>>>>
>>>>>

Reply via email to