Re: Seeking Input on Handling Ambiguity in Generating Changelogs

Jack Ye Sun, 23 Apr 2023 00:00:23 -0700

When the row identifier is defined in a table, I think the table can be
viewed in one of the 3 modes:
1. *unenforced mode*: the table operates as if it has no primary key.
2. *semi-enforced mode*: some/most writers try to enforce the primary key,
but it is not guaranteed that there are completely no duplicates. It is
expected that duplicated rows will be removed as a part of table maintenance
2. *enforced mode*: the table's primary key must be enforced by the
writers. Having duplicated rows is unexpected and is considered an illegal
state.

In your example, it seems like:
1. under unenforced mode, the changelog produced is technically correct
although it is ambiguous.
2. under semi-enforced mode, users could accept this ambiguous result
temporarily and expect the inconsistency to be resolved soon.
3. under enforced mode, it makes sense to throw an exception to notify the
table owner that the table has entered an illegal state.

To illustrate the semi-enforced case, a table maintenance process can
continuously do an aggregation and delete the duplicated rows based on some
merge-key definition. Suppose in your example the 'name' column is the
merge key and larger value in comparison wins (just for the sake of
discussion, probably not a very intuitive example, usually it's something
like a timestamp column), after the dedupe process runs, the changelog
including the new transactions would always be:

(1, 'a', DELETE)
(1, 'b', DELETE)
(1, 'c', INSERT)
(1, 'd', INSERT)
(1, 'c', DELETE)

a.k.a.

(1, 'a', DELETE)
(1, 'b', DELETE)
(1, 'd', INSERT)

and there is only a single record of (1, 'd') in the end regardless of
which SQL in the original change was actually run.

So going back to the original question, when the user does not expect the
table primary key to always be strictly enforced, I think it still has
value for users to have a solution, even if it might be ambiguous and might
not be the unique and correct solution. That solution might already be good
enough, or might eventually correct itself. If we follow this logic,
throwing an exception could be based on a config, just like in CDC we have
upsert mode as a specific mode to turn on. Otherwise people developing a
change data feed based on this might have to be blocked by such error until
the table is repaired and the duplicate rows are removed.

Any thoughts?

Best,
Jack Ye

On Thu, Apr 20, 2023, 11:59 PM Yufei Gu <flyrain...@gmail.com> wrote:

> Hi folks,
>
> I am reaching out to request your insights on addressing the ambiguous
> behavior of generating changelogs in Iceberg.
>
> To provide some context, Iceberg does not enforce row uniqueness even when
> configured with identifier fields (a.k.a primary key in the other database
> system) during write operations. That means that it is possible to have
> multiple rows with the same identifier fields values. For example, let's
> consider a table "customer" with columns "id" (int) and "name" (string),
> and the identifier field set as "id." It is still possible to write
> multiple rows with the same "id" values, as shown below:
>
> (1, 'A')
> (1, 'B')
> (2, 'X')
> (2, 'Y')
>
> The CreateChangelogViewProcedure
> <https://github.com/apache/iceberg/blob/master/docs/spark-procedures.md#change-data-capture>
> can reconstruct updates based on identifier fields. It works effectively
> when there is only one row per identifier value. However, handling multiple
> rows with the same identifier values can be challenging. For example, a
> `Merge into` or `Update` command can result the following changes:
>
> (1, 'a', DELETE)
> (1, 'b', DELETE)
> (1, 'c', INSERT)
> (1, 'd', INSERT)
>
> Unfortunately, it is impossible to determine whether "c" or "d" updated
> "a". For example, both of the following commands are valid even though
> there is an identifier column id.
>
> UPDATE table SET data = 'c' WHERE data = 'a';
> UPDATE table SET data = 'd' WHERE data = 'b';
>
> Or
>
> UPDATE table SET data = 'd' WHERE data = 'a';
> UPDATE table SET data = 'c' WHERE data = 'b';
>
> Due to this uncertainty, we have allowed the procedure to throw an
> exception in such cases. A relevant pull request can be found [here
> <https://github.com/apache/iceberg/pull/7388>]. I would appreciate any
> thoughts or suggestions.
>
> Best,
>
> Yufei
>
> `This is not a contribution`
>

Re: Seeking Input on Handling Ambiguity in Generating Changelogs

Reply via email to