Hi, Wing:

1. Are the options either (a) write only the column(s) listed in equality_ids 
or (b) write all the columns? i.e, no in between.


Yes, I think so.

2. If we write all the columns, are only columns listed in equality_ids 
considered? What happens if a non-equality_id column does not match? e.g.,

equality_ids=[1] 1: id | 2: category | 3: name -------|-------------|--------- 
3 | NULL | Polar

Is that (a) invalid, or does that (b) still result in deleting id = 3, or (c) 
result in deleting no rows?

What columns are considered are depent:

  *   Only columns listed in eqality_ids are considered when applying deletions.
  *   If other columns are filled, they are considered during planning, e.g. 
helps to prune equal deletion files that should be applied to data file.

I think it’s considered as invalid since it may produce wrong results, e.g. 
pruning extra deletion file.

The spec says "Each row of the delete file produces one equality predicate that 
matches any row where the delete columns are equal. Multiple columns can be 
thought of as an AND of equality predicates." That could be interpreted to mean 
(c).

Whether it’s incorrect depends on how the compute engine works. If the compute 
engine doesn’t try to prune deletion files, then inconsistent  column data may  
not affect the result. But in general it should be considered as incorrect data.

From: Wing Yew Poon <wyp...@cloudera.com.INVALID>
Date: Saturday, April 13, 2024 at 02:16
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Subject: spec question on equality deletes
Hi,

I have some questions on the current Iceberg spec regarding equality deletes:
https://iceberg.apache.org/spec/#equality-delete-files
The spec says that for "a table with the following data:

 1: id | 2: category | 3: name

-------|-------------|---------

 1     | marsupial   | Koala

 2     | toy         | Teddy

 3     | NULL        | Grizzly

 4     | NULL        | Polar

The delete id = 3 could be written as either of the following equality delete 
files:

equality_ids=[1]



 1: id

-------

 3

equality_ids=[1]



 1: id | 2: category | 3: name

-------|-------------|---------

 3     | NULL        | Grizzly
"

1. Are the options either (a) write only the column(s) listed in equality_ids 
or (b) write all the columns? i.e, no in between.
2. If we write all the columns, are only columns listed in equality_ids 
considered? What happens if a non-equality_id column does not match? e.g.,

equality_ids=[1] 1: id | 2: category | 3: name -------|-------------|--------- 
3 | NULL | Polar

Is that (a) invalid, or does that (b) still result in deleting id = 3, or (c) 
result in deleting no rows?

The spec says "Each row of the delete file produces one equality predicate that 
matches any row where the delete columns are equal. Multiple columns can be 
thought of as an AND of equality predicates." That could be interpreted to mean 
(c).

Thanks,
Wing Yew

Reply via email to