Thanks Jack for initiating this valuable discussion. I'm also seeing issues with migrate procedure <https://github.com/apache/iceberg/issues/8425>, where external / managed semantics need to be defined clearly.
Nonetheless, is behavior consistent for *external* tables across engines and catalogs? For example, in the Hive/Spark world, external tables can be written to but won't be deleted with any form of DROP TABLE (like purge). For Hadoop catalog, it looks there's only managed table but metadata can be left over if it's not stored at the table path. How about using another keyword as reserved table properties like "format-version" such that different engines and catalogs can align on? Happy Thanksgiving! Manu On Tue, Nov 21, 2023 at 5:48 AM Jack Ye <yezhao...@gmail.com> wrote: > Hi everyone, > > As AWS Glue data catalog is now offering support for auto-compaction of > Apache Iceberg tables > <https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/>, > we have discovered some room in improving user experience for Iceberg users > across different vendors. > > Specifically, today there are many users that use the registerTable > Spark/Trino procedure to register Iceberg tables managed by other platforms > like Tabular, Snowflake, Dremio, Databricks Delta Uniform, etc. to Glue > data catalog in order to use other governance and AWS analytic engine > integration features. There are also automated processes like Tabular-Glue > mirroring <https://docs.tabular.io/glue#enable-glue-mirroring>, Glue > Crawler, > <https://aws.amazon.com/about-aws/whats-new/2023/07/aws-glue-crawlers-apache-iceberg-tables/> > etc. that do this job for all tables in a catalog or S3 bucket. > > Those tables appear in almost the same way as any normal Iceberg table in > Glue. Usually there are some table properties indicating they are > registered, but those are not officially enforced in the spec and thus > cannot be used as reliable indicators. Enabling Glue compaction on these > tables is meaningless, because whatever that is compacted will be > overwritten in the next mirroring. In general, these tables should just be > read-only in Glue, since Glue does not own the transaction system of the > catalog that is managing the table. > > This is technically a solved problem in the traditional database system > world, with the EXTERNAL/MANAGED semantics. If a table is created with > CREATE EXTERNAL TABLE, then the table should be a read-only metadata > definition, which means it cannot be written to and data will not be > deleted in any form of DROP TABLE. If a table is created with CREATE TABLE, > then the table should be managed by the underlying catalog. Writes will all > work and DROP TABLE removes all data. > > As we are having an increasingly diverse list of vendors supporting > Iceberg, I think it would be a great opportunity for Iceberg to properly > respect this semantics, so Iceberg tables owned by one vendor do not get > accidentally updated by another and cause any unintentional side effects. > > From code perspective, this means to introduce the concept of external (or > maybe an equivalent concept like "managed by") in table and catalog spec, > and update the engines to respect it: do not perform any writes or data > deletion against the table if the engine is not working with the catalog > that is managing the table. > > With such a construct, we can also enable other interesting features. For > example, users of one catalog can freely choose another catalog to manage > the table, by providing an update to the specific spec field to update the > managing catalog of the table. You could imagine that it can map to some > SQL like *ALTER TABLE vendor1_table AS MANAGED BY vendor2*. > > Do people think this is valuable to add natively in Iceberg? If so I can > work on a more detailed proposal and code changes for further discussion. > > p.s. sorry for the links to vendor features and if this sounds like an ad, > but I think this topic would be better understood when referencing concrete > Iceberg vendor interactions. I will ensure the technical discussion remains > vendor-neutral. > > Best, > Jack Ye >