Thanks for including a clear rationale for the proposal. I share Manu's concern about whether EXTERNAL is actually a well-defined concept. There are products that support external tables as read-only, but Spark is a good example of inconsistency. I think we've also had a suggestion in the past to add support for external tables with the Spark behavior, where tables are mutable but data files don't get cleaned up when tables are dropped. I'm skeptical that external is the right solution.
In addition, it is unclear how to support external (meaning read-only) tables. The main problem is that the root metadata JSON file is shared. If we keep a read-only flag in that metadata file then it would be read-only in every catalog. The mechanism for making a table read-only must be handled by catalogs rather than covered by the Iceberg table spec. That's quite a bit harder because not all catalogs can support it. For instance, the Hadoop catalog has no metadata other than locations and root metadata JSON (maybe that's an argument for removing it in v3...). We may be able to add this idea to the REST catalog spec. When I think about the problem you're trying to solve, I think we can make progress in other ways. There are two parts. First, there's the idea of ownership for the data underneath a table. I still think that we can make a lot of progress on this by introducing table locations (owned and unowned) in the v3 spec. That will allow you to handle use cases like tables that share data files much more easily. The second part of the problem is how to handle secondary references, like syncing a Tabular warehouse to Glue. I don't think it makes sense right now to invest in this because I consider this a temporary feature. We should not be mirroring catalogs to increase connectivity in the future. A goal of the REST protocol was to make this problem go away: everything should use the same protocol. You should be able to connect directly to the source of truth. Ryan On Wed, Nov 22, 2023 at 5:54 PM Manu Zhang <owenzhang1...@gmail.com> wrote: > Thanks Jack for initiating this valuable discussion. I'm also seeing issues > with migrate procedure <https://github.com/apache/iceberg/issues/8425>, > where external / managed semantics need to be defined clearly. > > Nonetheless, is behavior consistent for *external* tables across engines > and catalogs? For example, in the Hive/Spark world, external tables can be > written to but won't be deleted with any form of DROP TABLE (like purge). > For Hadoop catalog, it looks there's only managed table but metadata can be > left over if it's not stored at the table path. > > How about using another keyword as reserved table properties like > "format-version" such that different engines and catalogs can align on? > > Happy Thanksgiving! > Manu > > On Tue, Nov 21, 2023 at 5:48 AM Jack Ye <yezhao...@gmail.com> wrote: > >> Hi everyone, >> >> As AWS Glue data catalog is now offering support for auto-compaction of >> Apache Iceberg tables >> <https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/>, >> we have discovered some room in improving user experience for Iceberg users >> across different vendors. >> >> Specifically, today there are many users that use the registerTable >> Spark/Trino procedure to register Iceberg tables managed by other platforms >> like Tabular, Snowflake, Dremio, Databricks Delta Uniform, etc. to Glue >> data catalog in order to use other governance and AWS analytic engine >> integration features. There are also automated processes like Tabular-Glue >> mirroring <https://docs.tabular.io/glue#enable-glue-mirroring>, Glue >> Crawler, >> <https://aws.amazon.com/about-aws/whats-new/2023/07/aws-glue-crawlers-apache-iceberg-tables/> >> etc. that do this job for all tables in a catalog or S3 bucket. >> >> Those tables appear in almost the same way as any normal Iceberg table in >> Glue. Usually there are some table properties indicating they are >> registered, but those are not officially enforced in the spec and thus >> cannot be used as reliable indicators. Enabling Glue compaction on these >> tables is meaningless, because whatever that is compacted will be >> overwritten in the next mirroring. In general, these tables should just be >> read-only in Glue, since Glue does not own the transaction system of the >> catalog that is managing the table. >> >> This is technically a solved problem in the traditional database system >> world, with the EXTERNAL/MANAGED semantics. If a table is created with >> CREATE EXTERNAL TABLE, then the table should be a read-only metadata >> definition, which means it cannot be written to and data will not be >> deleted in any form of DROP TABLE. If a table is created with CREATE TABLE, >> then the table should be managed by the underlying catalog. Writes will all >> work and DROP TABLE removes all data. >> >> As we are having an increasingly diverse list of vendors supporting >> Iceberg, I think it would be a great opportunity for Iceberg to properly >> respect this semantics, so Iceberg tables owned by one vendor do not get >> accidentally updated by another and cause any unintentional side effects. >> >> From code perspective, this means to introduce the concept of external >> (or maybe an equivalent concept like "managed by") in table and catalog >> spec, and update the engines to respect it: do not perform any writes or >> data deletion against the table if the engine is not working with the >> catalog that is managing the table. >> >> With such a construct, we can also enable other interesting features. For >> example, users of one catalog can freely choose another catalog to manage >> the table, by providing an update to the specific spec field to update the >> managing catalog of the table. You could imagine that it can map to some >> SQL like *ALTER TABLE vendor1_table AS MANAGED BY vendor2*. >> >> Do people think this is valuable to add natively in Iceberg? If so I can >> work on a more detailed proposal and code changes for further discussion. >> >> p.s. sorry for the links to vendor features and if this sounds like an >> ad, but I think this topic would be better understood when referencing >> concrete Iceberg vendor interactions. I will ensure the technical >> discussion remains vendor-neutral. >> >> Best, >> Jack Ye >> > -- Ryan Blue Tabular