I do recall an issue where duplicate data/delete files where possible, but
I'm not sure if that's the underlying cause in your case.
The issue was fixed by #10007
<https://github.com/apache/iceberg/pull/10007> and
was shipped with Iceberg 1.6.0.

On Thu, Nov 7, 2024 at 11:12 PM Lewis, William <wimle...@amazon.com.invalid>
wrote:

> On 2024/03/13 22:38:06 Shwetha Dharmarajan wrote:
> > We are using Apache Iceberg with AWS Glue. We are seeing an issue where
> duplicates are getting inserted into the table, even after making sure
> there are no duplicates in the data being upserted into the table. We use
> MERGE sql to upsert data into the table.
> >
> > We also see an issue where duplicates appear in the SELECT sql query,
> when queried using spark SQL. But when we query the same table using
> Athena, we don’t see any duplicates in the table.
>
> Did you ever find a solution to this? We’re experiencing what seems to be
> a very similar issue:
>
> - Problem occurs only in some tables, and (as far as we can tell) only in
> Glue/Spark, not Athena/Trino
> - Iceberg 1.0.0 as found in Glue 4.0; newer Iceberg as used by Athena
> - Table writes are via MERGE INTO sql
> - Not (explicitly) using any branching or tagging features
>
> Additionally, we're using Iceberg format 2, with
> write.merge.mode=merge-on-read (our writes are mostly inserts). One of our
> jobs occasionally sprays the table with a largish number (~20k-50k) of tiny
> parquet files, which eventually get coalesced by iceberg's
> rewrite_data_files() procedure - that's the only thing that we can think of
> that is different about the problem tables.
>
> Because merge-on-read seems to be a less commonly used mode, or at least
> less common in the 1.0.0 era, I wonder if there is a bug in the merging of
> updates during read.
>
> If this reminds anyone of a known issue in older versions of Iceberg, I
> would very much appreciate any pointers to more info (issue tracker,
> commits, vague anecdotes, etc.).
>
>
>
>

Reply via email to