Hi Renjie,
Thank you for your perspective.
On 1, I am inclined to the same view as you.
On 2, I feel that the spec should clearly define the expected behavior; it
should not be left to engines. At worst, the spec can say, e.g., that the
correct behavior is (b) but it is acceptable for an engine to throw an
error (a); or that the correct behavior is (c). We cannot have some engines
doing (b) and some doing (c), as (b) and (c) are basically opposite.
I'm interested in other perspectives.
- Wing Yew

Btw, I go by Wing Yew, not Wing.


On Sat, Apr 13, 2024 at 6:12 AM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi, Wing:
>
>
>
> 1. Are the options either (a) write only the column(s) listed in
> equality_ids or (b) write all the columns? i.e, no in between.
>
>
>
> Yes, I think so.
>
>
>
> 2. If we write all the columns, are only columns listed in equality_ids
> considered? What happens if a non-equality_id column does not match? e.g.,
>
>
>
> equality_ids=[1] 1: id | 2: category | 3: name
> -------|-------------|--------- 3 | NULL | Polar
>
>
>
> Is that (a) invalid, or does that (b) still result in deleting id = 3, or
> (c) result in deleting no rows?
>
>
>
> What columns are considered are depent:
>
>    - Only columns listed in eqality_ids are considered when applying
>    deletions.
>    - If other columns are filled, they are considered during planning,
>    e.g. helps to prune equal deletion files that should be applied to data
>    file.
>
>
>
> I think it’s considered as invalid since it may produce wrong results,
> e.g. pruning extra deletion file.
>
>
>
> The spec says "Each row of the delete file produces one equality
> predicate that matches any row where the delete columns are equal. Multiple
> columns can be thought of as an AND of equality predicates." That could
> be interpreted to mean (c).
>
>
>
> Whether it’s incorrect depends on how the compute engine works. If the
> compute engine doesn’t try to prune deletion files, then inconsistent
>  column data may  not affect the result. But in general it should be
> considered as incorrect data.
>
>
>
> *From: *Wing Yew Poon <wyp...@cloudera.com.INVALID>
> *Date: *Saturday, April 13, 2024 at 02:16
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *spec question on equality deletes
>
> Hi,
>
>
>
> I have some questions on the current Iceberg spec regarding equality
> deletes:
>
> https://iceberg.apache.org/spec/#equality-delete-files
>
> The spec says that for "a table with the following data:
>
>  1: id | 2: category | 3: name
>
> -------|-------------|---------
>
>  1     | marsupial   | Koala
>
>  2     | toy         | Teddy
>
>  3     | NULL        | Grizzly
>
>  4     | NULL        | Polar
>
> The delete id = 3 could be written as either of the following equality
> delete files:
>
> equality_ids=[1]
>
>
>
>  1: id
>
> -------
>
>  3
>
> equality_ids=[1]
>
>
>
>  1: id | 2: category | 3: name
>
> -------|-------------|---------
>
>  3     | NULL        | Grizzly
>
> "
>
>
>
> 1. Are the options either (a) write only the column(s) listed in
> equality_ids or (b) write all the columns? i.e, no in between.
>
> 2. If we write all the columns, are only columns listed in equality_ids
> considered? What happens if a non-equality_id column does not match? e.g.,
>
>
>
> equality_ids=[1] 1: id | 2: category | 3: name -------|-------------|---------
> 3 | NULL | Polar
>
>
>
> Is that (a) invalid, or does that (b) still result in deleting id = 3, or
> (c) result in deleting no rows?
>
>
>
> The spec says "Each row of the delete file produces one equality
> predicate that matches any row where the delete columns are equal. Multiple
> columns can be thought of as an AND of equality predicates." That could
> be interpreted to mean (c).
>
>
>
> Thanks,
>
> Wing Yew
>
>
>

Reply via email to