Hi, Wing: 1. Are the options either (a) write only the column(s) listed in equality_ids or (b) write all the columns? i.e, no in between.
Yes, I think so. 2. If we write all the columns, are only columns listed in equality_ids considered? What happens if a non-equality_id column does not match? e.g., equality_ids=[1] 1: id | 2: category | 3: name -------|-------------|--------- 3 | NULL | Polar Is that (a) invalid, or does that (b) still result in deleting id = 3, or (c) result in deleting no rows? What columns are considered are depent: * Only columns listed in eqality_ids are considered when applying deletions. * If other columns are filled, they are considered during planning, e.g. helps to prune equal deletion files that should be applied to data file. I think it’s considered as invalid since it may produce wrong results, e.g. pruning extra deletion file. The spec says "Each row of the delete file produces one equality predicate that matches any row where the delete columns are equal. Multiple columns can be thought of as an AND of equality predicates." That could be interpreted to mean (c). Whether it’s incorrect depends on how the compute engine works. If the compute engine doesn’t try to prune deletion files, then inconsistent column data may not affect the result. But in general it should be considered as incorrect data. From: Wing Yew Poon <wyp...@cloudera.com.INVALID> Date: Saturday, April 13, 2024 at 02:16 To: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: spec question on equality deletes Hi, I have some questions on the current Iceberg spec regarding equality deletes: https://iceberg.apache.org/spec/#equality-delete-files The spec says that for "a table with the following data: 1: id | 2: category | 3: name -------|-------------|--------- 1 | marsupial | Koala 2 | toy | Teddy 3 | NULL | Grizzly 4 | NULL | Polar The delete id = 3 could be written as either of the following equality delete files: equality_ids=[1] 1: id ------- 3 equality_ids=[1] 1: id | 2: category | 3: name -------|-------------|--------- 3 | NULL | Grizzly " 1. Are the options either (a) write only the column(s) listed in equality_ids or (b) write all the columns? i.e, no in between. 2. If we write all the columns, are only columns listed in equality_ids considered? What happens if a non-equality_id column does not match? e.g., equality_ids=[1] 1: id | 2: category | 3: name -------|-------------|--------- 3 | NULL | Polar Is that (a) invalid, or does that (b) still result in deleting id = 3, or (c) result in deleting no rows? The spec says "Each row of the delete file produces one equality predicate that matches any row where the delete columns are equal. Multiple columns can be thought of as an AND of equality predicates." That could be interpreted to mean (c). Thanks, Wing Yew