RE: Duplicates are getting inserted into Iceberg tables even after de-duplication

Lewis, William Thu, 07 Nov 2024 14:12:10 -0800

On 2024/03/13 22:38:06 Shwetha Dharmarajan wrote:
> We are using Apache Iceberg with AWS Glue. We are seeing an issue where 
> duplicates are getting inserted into the table, even after making sure there 
> are no duplicates in the data being upserted into the table. We use MERGE sql 
> to upsert data into the table.
> 
> We also see an issue where duplicates appear in the SELECT sql query, when 
> queried using spark SQL. But when we query the same table using Athena, we 
> don’t see any duplicates in the table.


Did you ever find a solution to this? We’re experiencing what seems to be a 
very similar issue:

- Problem occurs only in some tables, and (as far as we can tell) only in 
Glue/Spark, not Athena/Trino
- Iceberg 1.0.0 as found in Glue 4.0; newer Iceberg as used by Athena
- Table writes are via MERGE INTO sql
- Not (explicitly) using any branching or tagging features

Additionally, we're using Iceberg format 2, with write.merge.mode=merge-on-read 
(our writes are mostly inserts). One of our jobs occasionally sprays the table 
with a largish number (~20k-50k) of tiny parquet files, which eventually get 
coalesced by iceberg's rewrite_data_files() procedure - that's the only thing 
that we can think of that is different about the problem tables.

Because merge-on-read seems to be a less commonly used mode, or at least less 
common in the 1.0.0 era, I wonder if there is a bug in the merging of updates 
during read.

If this reminds anyone of a known issue in older versions of Iceberg, I would 
very much appreciate any pointers to more info (issue tracker, commits, vague 
anecdotes, etc.).

RE: Duplicates are getting inserted into Iceberg tables even after de-duplication

Reply via email to