Hi everyone,

As AWS Glue data catalog is now offering support for auto-compaction of
Apache Iceberg tables
<https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/>,
we have discovered some room in improving user experience for Iceberg users
across different vendors.

Specifically, today there are many users that use the registerTable
Spark/Trino procedure to register Iceberg tables managed by other platforms
like Tabular, Snowflake, Dremio, Databricks Delta Uniform, etc. to Glue
data catalog in order to use other governance and AWS analytic engine
integration features. There are also automated processes like Tabular-Glue
mirroring <https://docs.tabular.io/glue#enable-glue-mirroring>, Glue
Crawler,
<https://aws.amazon.com/about-aws/whats-new/2023/07/aws-glue-crawlers-apache-iceberg-tables/>
etc. that do this job for all tables in a catalog or S3 bucket.

Those tables appear in almost the same way as any normal Iceberg table in
Glue. Usually there are some table properties indicating they are
registered, but those are not officially enforced in the spec and thus
cannot be used as reliable indicators. Enabling Glue compaction on these
tables is meaningless, because whatever that is compacted will be
overwritten in the next mirroring. In general, these tables should just be
read-only in Glue, since Glue does not own the transaction system of the
catalog that is managing the table.

This is technically a solved problem in the traditional database system
world, with the EXTERNAL/MANAGED semantics. If a table is created with
CREATE EXTERNAL TABLE, then the table should be a read-only metadata
definition, which means it cannot be written to and data will not be
deleted in any form of DROP TABLE. If a table is created with CREATE TABLE,
then the table should be managed by the underlying catalog. Writes will all
work and DROP TABLE removes all data.

As we are having an increasingly diverse list of vendors supporting
Iceberg, I think it would be a great opportunity for Iceberg to properly
respect this semantics, so Iceberg tables owned by one vendor do not get
accidentally updated by another and cause any unintentional side effects.

>From code perspective, this means to introduce the concept of external (or
maybe an equivalent concept like "managed by") in table and catalog spec,
and update the engines to respect it: do not perform any writes or data
deletion against the table if the engine is not working with the catalog
that is managing the table.

With such a construct, we can also enable other interesting features. For
example, users of one catalog can freely choose another catalog to manage
the table, by providing an update to the specific spec field to update the
managing catalog of the table. You could imagine that it can map to some
SQL like *ALTER TABLE vendor1_table AS MANAGED BY vendor2*.

Do people think this is valuable to add natively in Iceberg? If so I can
work on a more detailed proposal and code changes for further discussion.

p.s. sorry for the links to vendor features and if this sounds like an ad,
but I think this topic would be better understood when referencing concrete
Iceberg vendor interactions. I will ensure the technical discussion remains
vendor-neutral.

Best,
Jack Ye

Reply via email to