Re: Discuss support for EXTERNAL/MANAGED semantics

Jack Ye Tue, 05 Dec 2023 21:19:26 -0800

Yes I agree that ownership is the key concept here. But I think there is
some value in connecting that to the external/managed table concept.

*Re-define managed table concept under the multi-engine data lake world*
I think the concept of EXTERNAL is not "not well-defined", but is just out
of date. It is a great chance to provide an up-to-date definition of what
exactly does it mean to be a managed table in the multi-engine data lake
world. In my opinion, the definition should be that, *the catalog that
maintains the table's ACID transactions manages a table*. An engine can
interact with the system, and *participate* in the table management and
writing process with the catalog, if the catalog shares a way for
transactional commit interaction (e.g. through REST catalog, Glue
optimistic update).

*Make Iceberg more rigorous as a database/warehouse system*
This might be just my personal opinion, but I think many of the
inconsistencies were due to prioritizing convenience. It is convenient to
append files to an external Hive table, and map that to an INSERT INTO
statement. It is convenient to keep data files during DROP so users do not
accidentally delete data. I think many breaking features are designed
specifically to manually work with non-transactional Hive tables that
people stitch together in cloud storage, but as now we are getting closer
and closer to becoming an actual database system with automated table
management, Iceberg would benefit from some rigorous definition to provide
a consistent user experience for data engineers. Maybe this would entail
having a backwards incompatible format v3 (btw I also like the managed
location idea, with some small concern regarding S3 integration which we
can discuss separately), but I think it's worth it to make the definition
clear as early as possible.

*Make Iceberg a universal format behind REST catalog*
Another benefit, coincidentally brought up in the OneTable discussion, is
that Iceberg could be the universal format if we introduce this concept of
external Iceberg table in REST catalog. I totally agree with you that the
REST catalog is the way we are trying to standardize towards, and there
could be some pretty cool use cases for exposing other non-Iceberg tables
as external Iceberg tables in the REST catalog. I have actually seen an
internal Amazon customer with a REST catalog that was able to query Hive,
Delta, and their internal table formats directly through the REST catalog
using all the supported engines, just by adding some additional features in
the Iceberg core library (that we are planning to contribute in upcoming
days).

I would imagine this can make many company's migration experience much
easier. Instead of going through a table format conversion, you just stand
up the Iceberg REST catalog endpoint, and you can now use the Iceberg
connector to read all the data in any common format. This is essentially
the traditional concept of external tables, which is used to connect to
external systems for read purpose. I think this is a quite powerful feature
to be added to the community.

Please let me know what you think!

Best,
Jack Ye

On Tue, Dec 5, 2023 at 4:02 PM Ryan Blue <b...@tabular.io> wrote:

> Thanks for including a clear rationale for the proposal. I share Manu's
> concern about whether EXTERNAL is actually a well-defined concept. There
> are products that support external tables as read-only, but Spark is a good
> example of inconsistency. I think we've also had a suggestion in the past
> to add support for external tables with the Spark behavior, where tables
> are mutable but data files don't get cleaned up when tables are dropped.
> I'm skeptical that external is the right solution.
>
> In addition, it is unclear how to support external (meaning read-only)
> tables. The main problem is that the root metadata JSON file is shared. If
> we keep a read-only flag in that metadata file then it would be read-only
> in every catalog. The mechanism for making a table read-only must be
> handled by catalogs rather than covered by the Iceberg table spec. That's
> quite a bit harder because not all catalogs can support it. For instance,
> the Hadoop catalog has no metadata other than locations and root metadata
> JSON (maybe that's an argument for removing it in v3...). We may be able to
> add this idea to the REST catalog spec.
>
> When I think about the problem you're trying to solve, I think we can make
> progress in other ways. There are two parts. First, there's the idea of
> ownership for the data underneath a table. I still think that we can make a
> lot of progress on this by introducing table locations (owned and unowned)
> in the v3 spec. That will allow you to handle use cases like tables that
> share data files much more easily.
>
> The second part of the problem is how to handle secondary references, like
> syncing a Tabular warehouse to Glue. I don't think it makes sense right now
> to invest in this because I consider this a temporary feature. We should
> not be mirroring catalogs to increase connectivity in the future. A goal of
> the REST protocol was to make this problem go away: everything should use
> the same protocol. You should be able to connect directly to the source of
> truth.
>
> Ryan
>
> On Wed, Nov 22, 2023 at 5:54 PM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Thanks Jack for initiating this valuable discussion. I'm also seeing issues
>> with migrate procedure <https://github.com/apache/iceberg/issues/8425>,
>> where external / managed semantics need to be defined clearly.
>>
>> Nonetheless, is behavior consistent for *external* tables across engines
>> and catalogs? For example, in the Hive/Spark world, external tables can be
>> written to but won't be deleted with any form of DROP TABLE (like purge).
>> For Hadoop catalog, it looks there's only managed table but metadata can be
>> left over if it's not stored at the table path.
>>
>> How about using another keyword as reserved table properties like
>> "format-version" such that different engines and catalogs can align on?
>>
>> Happy Thanksgiving!
>> Manu
>>
>> On Tue, Nov 21, 2023 at 5:48 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> As AWS Glue data catalog is now offering support for auto-compaction of
>>> Apache Iceberg tables
>>> <https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/>,
>>> we have discovered some room in improving user experience for Iceberg users
>>> across different vendors.
>>>
>>> Specifically, today there are many users that use the registerTable
>>> Spark/Trino procedure to register Iceberg tables managed by other platforms
>>> like Tabular, Snowflake, Dremio, Databricks Delta Uniform, etc. to Glue
>>> data catalog in order to use other governance and AWS analytic engine
>>> integration features. There are also automated processes like Tabular-Glue
>>> mirroring <https://docs.tabular.io/glue#enable-glue-mirroring>, Glue
>>> Crawler,
>>> <https://aws.amazon.com/about-aws/whats-new/2023/07/aws-glue-crawlers-apache-iceberg-tables/>
>>> etc. that do this job for all tables in a catalog or S3 bucket.
>>>
>>> Those tables appear in almost the same way as any normal Iceberg table
>>> in Glue. Usually there are some table properties indicating they are
>>> registered, but those are not officially enforced in the spec and thus
>>> cannot be used as reliable indicators. Enabling Glue compaction on these
>>> tables is meaningless, because whatever that is compacted will be
>>> overwritten in the next mirroring. In general, these tables should just be
>>> read-only in Glue, since Glue does not own the transaction system of the
>>> catalog that is managing the table.
>>>
>>> This is technically a solved problem in the traditional database system
>>> world, with the EXTERNAL/MANAGED semantics. If a table is created with
>>> CREATE EXTERNAL TABLE, then the table should be a read-only metadata
>>> definition, which means it cannot be written to and data will not be
>>> deleted in any form of DROP TABLE. If a table is created with CREATE TABLE,
>>> then the table should be managed by the underlying catalog. Writes will all
>>> work and DROP TABLE removes all data.
>>>
>>> As we are having an increasingly diverse list of vendors supporting
>>> Iceberg, I think it would be a great opportunity for Iceberg to properly
>>> respect this semantics, so Iceberg tables owned by one vendor do not get
>>> accidentally updated by another and cause any unintentional side effects.
>>>
>>> From code perspective, this means to introduce the concept of external
>>> (or maybe an equivalent concept like "managed by") in table and catalog
>>> spec, and update the engines to respect it: do not perform any writes or
>>> data deletion against the table if the engine is not working with the
>>> catalog that is managing the table.
>>>
>>> With such a construct, we can also enable other interesting features.
>>> For example, users of one catalog can freely choose another catalog to
>>> manage the table, by providing an update to the specific spec field to
>>> update the managing catalog of the table. You could imagine that it can map
>>> to some SQL like *ALTER TABLE vendor1_table AS MANAGED BY vendor2*.
>>>
>>> Do people think this is valuable to add natively in Iceberg? If so I can
>>> work on a more detailed proposal and code changes for further discussion.
>>>
>>> p.s. sorry for the links to vendor features and if this sounds like an
>>> ad, but I think this topic would be better understood when referencing
>>> concrete Iceberg vendor interactions. I will ensure the technical
>>> discussion remains vendor-neutral.
>>>
>>> Best,
>>> Jack Ye
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Discuss support for EXTERNAL/MANAGED semantics

Reply via email to