Thank you for the proposal Xiaoxuan! I think I agree with Zheng and
Steven's point that it'll probably be more helpful to start out with more
specific "what" and "why" (known areas of improvement for Iceberg and
driven by any use cases) before we get too deep into the "how".

In my mind, the specific known area of improvement for Iceberg related to
this proposal is improving streaming upsert behavior. One area this
improvement is beneficial for is being able to provide better data
freshness for Iceberg CDC mirror tables without the heavy read +
maintenance cost that currently exist with Flink upserts.

As you mentioned, equality deletes have the benefit of being very cheap to
write but can come at a high and unpredictable cost at read time.
Challenges with equality deletes have been discussed in the past [1].
I'll also add that if one of the goals is to improving streaming upserts
(e.g. for applying CDC change streams into Iceberg mirror tables), then
there are alternatives that I think we should compare against to make
the tradeoffs clear. These alternatives include leveraging the known
changelog view or merge patterns [2] or improving the existing maintenance
procedures.

I think the potential for being able to use a inverted index for upsert
cases to more directly identify positions in a file to directly write DVs
is very exciting, but before getting too far into the weeds, I think it'd
first be helpful
to make sure we agree on the specific problem we're trying to solve when we
talk about performance improvements along with any use cases, followed by
comparison with known alternatives (ideally we can get numbers that
demonstrate the read/write/storage/cost tradeoffs for the proposed inverted
index).

[1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
[2]https://www.tabular.io/blog/hello-world-of-cdc/

Thanks,
Amogh Jahagirdar

Reply via email to