Hi everyone, As AWS Glue data catalog is now offering support for auto-compaction of Apache Iceberg tables <https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/>, we have discovered some room in improving user experience for Iceberg users across different vendors.
Specifically, today there are many users that use the registerTable Spark/Trino procedure to register Iceberg tables managed by other platforms like Tabular, Snowflake, Dremio, Databricks Delta Uniform, etc. to Glue data catalog in order to use other governance and AWS analytic engine integration features. There are also automated processes like Tabular-Glue mirroring <https://docs.tabular.io/glue#enable-glue-mirroring>, Glue Crawler, <https://aws.amazon.com/about-aws/whats-new/2023/07/aws-glue-crawlers-apache-iceberg-tables/> etc. that do this job for all tables in a catalog or S3 bucket. Those tables appear in almost the same way as any normal Iceberg table in Glue. Usually there are some table properties indicating they are registered, but those are not officially enforced in the spec and thus cannot be used as reliable indicators. Enabling Glue compaction on these tables is meaningless, because whatever that is compacted will be overwritten in the next mirroring. In general, these tables should just be read-only in Glue, since Glue does not own the transaction system of the catalog that is managing the table. This is technically a solved problem in the traditional database system world, with the EXTERNAL/MANAGED semantics. If a table is created with CREATE EXTERNAL TABLE, then the table should be a read-only metadata definition, which means it cannot be written to and data will not be deleted in any form of DROP TABLE. If a table is created with CREATE TABLE, then the table should be managed by the underlying catalog. Writes will all work and DROP TABLE removes all data. As we are having an increasingly diverse list of vendors supporting Iceberg, I think it would be a great opportunity for Iceberg to properly respect this semantics, so Iceberg tables owned by one vendor do not get accidentally updated by another and cause any unintentional side effects. >From code perspective, this means to introduce the concept of external (or maybe an equivalent concept like "managed by") in table and catalog spec, and update the engines to respect it: do not perform any writes or data deletion against the table if the engine is not working with the catalog that is managing the table. With such a construct, we can also enable other interesting features. For example, users of one catalog can freely choose another catalog to manage the table, by providing an update to the specific spec field to update the managing catalog of the table. You could imagine that it can map to some SQL like *ALTER TABLE vendor1_table AS MANAGED BY vendor2*. Do people think this is valuable to add natively in Iceberg? If so I can work on a more detailed proposal and code changes for further discussion. p.s. sorry for the links to vendor features and if this sounds like an ad, but I think this topic would be better understood when referencing concrete Iceberg vendor interactions. I will ensure the technical discussion remains vendor-neutral. Best, Jack Ye