On 2024/03/13 22:38:06 Shwetha Dharmarajan wrote: > We are using Apache Iceberg with AWS Glue. We are seeing an issue where > duplicates are getting inserted into the table, even after making sure there > are no duplicates in the data being upserted into the table. We use MERGE sql > to upsert data into the table. > > We also see an issue where duplicates appear in the SELECT sql query, when > queried using spark SQL. But when we query the same table using Athena, we > don’t see any duplicates in the table.
Did you ever find a solution to this? We’re experiencing what seems to be a very similar issue: - Problem occurs only in some tables, and (as far as we can tell) only in Glue/Spark, not Athena/Trino - Iceberg 1.0.0 as found in Glue 4.0; newer Iceberg as used by Athena - Table writes are via MERGE INTO sql - Not (explicitly) using any branching or tagging features Additionally, we're using Iceberg format 2, with write.merge.mode=merge-on-read (our writes are mostly inserts). One of our jobs occasionally sprays the table with a largish number (~20k-50k) of tiny parquet files, which eventually get coalesced by iceberg's rewrite_data_files() procedure - that's the only thing that we can think of that is different about the problem tables. Because merge-on-read seems to be a less commonly used mode, or at least less common in the 1.0.0 era, I wonder if there is a bug in the merging of updates during read. If this reminds anyone of a known issue in older versions of Iceberg, I would very much appreciate any pointers to more info (issue tracker, commits, vague anecdotes, etc.).