Hey Jack It's a proposal in another thread (community effort on Catalog RFC).
Regards JB On Thu, Feb 29, 2024 at 9:19 PM Jack Ye <yezhao...@gmail.com> wrote: > > > The implementation of the spec can either be compliant or not. > > This is exactly the problem we are talking about right? Just to give an > example, we cannot technically say that tables/views in the Tabular catalog > are Iceberg tables/views, because a REST spec-compliant catalog does not need > to follow the Iceberg spec to commit or store metadata. Even if you say it > is, there is no way to really prove that, because the REST spec does not > enforce it. > > JB, what do you mean by participating on the Catalog RFC? Is there already an > ongoing RFC? > > -Jack > > > On Thu, Feb 29, 2024 at 12:08 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: >> >> Hi Dan, >> >> I agree with your statement about REST Spec is not an implement but I >> strongly disagree with your statement "impl of the spec can either be >> compliant or not". >> >> The REST Catalog spec impl should be consistent with the REST Spec. That's >> why a reference implementation in Iceberg would be a must, with a TCK. >> >> The REST Spec should bridge/give access to Table/View metadata. I think it >> would make sense to have a resource to GET the Table/View metadata, also >> supporting PUT to update. >> JSON Schema and eventually JSON RPC could help on some area here (compliant >> with OpenAPI). >> >> In another thread, I propose to work on a Catalog RFC, exactly to target >> this. I think it would make sense to have the REST/Catalog RFC as the main >> catalog API, so it has to be both consistent (giving access to table/view >> metadata) and extensible (via OpenAPI Extensions for instance). >> >> So, I agree with Jack: the minimum would be to have JSON metadata exposed by >> the REST Spec. >> >> @Jack, short term I'm in favor of your proposal, long term, I propose to >> participate on the Catalog RFC (REST Spec). WDYT ? >> >> Thanks ! >> Regards >> JB >> >> >> Le jeu. 29 févr. 2024 à 20:47, Daniel Weeks <daniel.c.we...@gmail.com> a >> écrit : >>> >>> Hey Jack, >>> >>> I'm not sure I agree with the framing of this argument. The REST Spec >>> defines a protocol, not an implementation. >>> >>> The implementation of the spec can either be compliant or not. So a REST >>> Implementation that adheres to all the requirements (atomic location swap, >>> json representation, etc.), would be compliant. There's no requirement >>> around who performs these operations and with REST, that is delegated to >>> the server. The optional metadata location doesn't mean that there isn't a >>> metadata location, just that it may not be exposed directly in the response. >>> >>> Therefore, an implementation where you just store the table metadata in a >>> Glue Table object, would not be compliant, currently. >>> >>> We've periodically discussed removing the storage requirement and I think >>> there's a path forward to do that and would agree that standardizing on >>> REST, but I wouldn't say the justification for making this push is that >>> REST is not compliant so we can just ignore the table spec requirements. >>> >>> There are a few more things to consider, which is that not everything can >>> use REST currently and making a hard cut away from file based metadata >>> could bifurcate access to Iceberg data. There are also aspects to the spec >>> that reference the metadata paths (like metadata log, though it's >>> optional), but would likely need to be addressed. >>> >>> -Dan >>> >>> >>> >>> On Thu, Feb 29, 2024 at 11:13 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>> Hi everyone, >>>> >>>> Just want to pull this specific topic out of the materialized view >>>> discussion thread. I noticed this during the MV discussion, and I think it >>>> is important to clarify this not just for the MV topic, but also for the >>>> ongoing discussion to consolidate all the different catalogs. >>>> >>>> How the table/view spec defines Iceberg table/view >>>> >>>> If we look into the table/view spec, the optimistic concurrency section >>>> requires the existence of a metadata file, and the atomic swap of the >>>> metadata file ensures serializable isolation. This implies 2 things: >>>> 1. the metadata file in a storage that holds the information described in >>>> the rest of the spec. >>>> 2. there is an object in a catalog that holds the pointer of the metadata >>>> file. What object and what catalog is implementation dependent, but these >>>> generalized concepts are always intact. >>>> >>>> The JSON serialization parts of the spec plus the reader requirements also >>>> implies that the metadata file is in JSON format. >>>> >>>> So when we talk about an Iceberg table/view that is compliant with the >>>> spec, it is the combination of all these 5 requirements: >>>> 1. there is an object in the catalog representing this table/view >>>> 2. there is a pointer to a JSON metadata file in the object >>>> 3. the JSON metadata file exists in storage and contains the table/view >>>> metadata content >>>> 4. the metadata content is compliant with the standard described in the >>>> spec >>>> 5. serializable isolation is achieved by atomic swap of the object pointer >>>> >>>> How non-REST catalogs are compliant with the table/view spec >>>> >>>> An implementation of the Iceberg table/view is essentially specifying: >>>> 1. what is the exact implementation of the catalog, e.g. JDBC, Hive >>>> metastore (HMS), Glue, etc. >>>> 2. what is the object that represents a table, e.g. a row in the >>>> "iceberg_tables" table in JDBC, a Table object in HMS/Glue, etc. >>>> 3. how is the JSON metadata file pointer stored, e.g. a column in the >>>> table's row in JDBC, metadata_location key in the Table's parameter map in >>>> HMS/Glue, etc. >>>> 4. how the atomic swap is implemented, e.g. SQL atomic update in JDBC, >>>> conditional parameter update in HMS, conditional version update in Glue, >>>> etc. >>>> >>>> How the REST spec is NOT compliant with the table/view spec >>>> >>>> The REST spec technically does not match the following table/view spec >>>> requirements: >>>> 2. there is a pointer to a JSON metadata file in the object >>>> 3. the JSON metadata file exists in storage and contains the table/view >>>> metadata content >>>> 5. serializable isolation is achieved by atomic swap of the object pointer >>>> >>>> The key parts in REST spec that are not compliant are: >>>> 1. metadata-location field is optional in LoadTableResponse >>>> 2. pointer swap is not enforced in the UpdateTable operation >>>> >>>> Therefore, it opens the door for a REST service to be completely not >>>> dependent on a JSON metadata file, store the Iceberg table/view metadata >>>> not as a file, and achieve much better performance characteristics than >>>> other catalogs. This technically gives a unique advantage for REST catalog >>>> adopters that is not there for non-REST catalogs like HMS and Glue. >>>> >>>> How can we fix this? >>>> >>>> I suggest the following: >>>> >>>> Firstly, I think it is good that we try to remove the requirements of JSON >>>> metadata file pointer and atomic pointer swap. We know these requirements >>>> have perf limitations based on production usage, especially when the >>>> metadata file is large. If that is the direction, we should make it >>>> official by changing the table/view spec to say that those requirements >>>> are catalog level implementation details that are no longer required. >>>> >>>> Secondly, once we do that, we should declare REST spec as the official >>>> catalog spec to interact with Iceberg tables. Otherwise at least I will be >>>> very tempted to just break the atomic pointer swap pattern and store the >>>> entire metadata using the Glue Table object to achieve much better >>>> performance and also Glue native feature integrations, and I think other >>>> players will be equally motivated to do something similar. That will lead >>>> to even more chaos in the Iceberg catalog space. >>>> >>>> With REST spec as the official catalog spec, we can actually support >>>> non-REST catalogs by using the HTTP execution chain handler. Dan has >>>> already done a prototype here that is based on this discussion in the past >>>> about using AWS Lambda as an alternative HTTP client for REST catalog. The >>>> same approach can be used to talk to HMS/Glue/JDBC/... while users will >>>> only interact with the RESTCatalog as the entry point. >>>> >>>> I think this can provide a good path forward overall for the catalog >>>> consolidation story, interested to know what others think. >>>> >>>> Best, >>>> Jack Ye >>>>