Hey Dan imho, the REST Spec should provide access to the Iceberg spec layer. I don't say both should be in sync, but REST Spec should expose the resources of the Iceberg Spec.
Else, I would consider it incomplete and limited in terms of features. Regards JB On Thu, Feb 29, 2024 at 9:28 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote: > > > REST spec-compliant catalog does not need to follow the Iceberg spec to > > commit or store metadata > > If the REST implementation doesn't follow the Iceberg spec for commit > requirements, it's not compliant with the spec. There's no exemption that > says if you're using REST you don't need to follow the spec. Why do you > think that's the case? > > I don't believe there's a reason to say that the REST spec needs to enforce > the commit requirements either, that's a requirement of the Iceberg spec and > still needs to be complied with. > > -Dan > > On Thu, Feb 29, 2024 at 12:19 PM Jack Ye <yezhao...@gmail.com> wrote: >> >> > The implementation of the spec can either be compliant or not. >> >> This is exactly the problem we are talking about right? Just to give an >> example, we cannot technically say that tables/views in the Tabular catalog >> are Iceberg tables/views, because a REST spec-compliant catalog does not >> need to follow the Iceberg spec to commit or store metadata. Even if you say >> it is, there is no way to really prove that, because the REST spec does not >> enforce it. >> >> JB, what do you mean by participating on the Catalog RFC? Is there already >> an ongoing RFC? >> >> -Jack >> >> >> On Thu, Feb 29, 2024 at 12:08 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >>> >>> Hi Dan, >>> >>> I agree with your statement about REST Spec is not an implement but I >>> strongly disagree with your statement "impl of the spec can either be >>> compliant or not". >>> >>> The REST Catalog spec impl should be consistent with the REST Spec. That's >>> why a reference implementation in Iceberg would be a must, with a TCK. >>> >>> The REST Spec should bridge/give access to Table/View metadata. I think it >>> would make sense to have a resource to GET the Table/View metadata, also >>> supporting PUT to update. >>> JSON Schema and eventually JSON RPC could help on some area here (compliant >>> with OpenAPI). >>> >>> In another thread, I propose to work on a Catalog RFC, exactly to target >>> this. I think it would make sense to have the REST/Catalog RFC as the main >>> catalog API, so it has to be both consistent (giving access to table/view >>> metadata) and extensible (via OpenAPI Extensions for instance). >>> >>> So, I agree with Jack: the minimum would be to have JSON metadata exposed >>> by the REST Spec. >>> >>> @Jack, short term I'm in favor of your proposal, long term, I propose to >>> participate on the Catalog RFC (REST Spec). WDYT ? >>> >>> Thanks ! >>> Regards >>> JB >>> >>> >>> Le jeu. 29 févr. 2024 à 20:47, Daniel Weeks <daniel.c.we...@gmail.com> a >>> écrit : >>>> >>>> Hey Jack, >>>> >>>> I'm not sure I agree with the framing of this argument. The REST Spec >>>> defines a protocol, not an implementation. >>>> >>>> The implementation of the spec can either be compliant or not. So a REST >>>> Implementation that adheres to all the requirements (atomic location swap, >>>> json representation, etc.), would be compliant. There's no requirement >>>> around who performs these operations and with REST, that is delegated to >>>> the server. The optional metadata location doesn't mean that there isn't >>>> a metadata location, just that it may not be exposed directly in the >>>> response. >>>> >>>> Therefore, an implementation where you just store the table metadata in a >>>> Glue Table object, would not be compliant, currently. >>>> >>>> We've periodically discussed removing the storage requirement and I think >>>> there's a path forward to do that and would agree that standardizing on >>>> REST, but I wouldn't say the justification for making this push is that >>>> REST is not compliant so we can just ignore the table spec requirements. >>>> >>>> There are a few more things to consider, which is that not everything can >>>> use REST currently and making a hard cut away from file based metadata >>>> could bifurcate access to Iceberg data. There are also aspects to the >>>> spec that reference the metadata paths (like metadata log, though it's >>>> optional), but would likely need to be addressed. >>>> >>>> -Dan >>>> >>>> >>>> >>>> On Thu, Feb 29, 2024 at 11:13 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> Just want to pull this specific topic out of the materialized view >>>>> discussion thread. I noticed this during the MV discussion, and I think >>>>> it is important to clarify this not just for the MV topic, but also for >>>>> the ongoing discussion to consolidate all the different catalogs. >>>>> >>>>> How the table/view spec defines Iceberg table/view >>>>> >>>>> If we look into the table/view spec, the optimistic concurrency section >>>>> requires the existence of a metadata file, and the atomic swap of the >>>>> metadata file ensures serializable isolation. This implies 2 things: >>>>> 1. the metadata file in a storage that holds the information described in >>>>> the rest of the spec. >>>>> 2. there is an object in a catalog that holds the pointer of the metadata >>>>> file. What object and what catalog is implementation dependent, but these >>>>> generalized concepts are always intact. >>>>> >>>>> The JSON serialization parts of the spec plus the reader requirements >>>>> also implies that the metadata file is in JSON format. >>>>> >>>>> So when we talk about an Iceberg table/view that is compliant with the >>>>> spec, it is the combination of all these 5 requirements: >>>>> 1. there is an object in the catalog representing this table/view >>>>> 2. there is a pointer to a JSON metadata file in the object >>>>> 3. the JSON metadata file exists in storage and contains the table/view >>>>> metadata content >>>>> 4. the metadata content is compliant with the standard described in the >>>>> spec >>>>> 5. serializable isolation is achieved by atomic swap of the object pointer >>>>> >>>>> How non-REST catalogs are compliant with the table/view spec >>>>> >>>>> An implementation of the Iceberg table/view is essentially specifying: >>>>> 1. what is the exact implementation of the catalog, e.g. JDBC, Hive >>>>> metastore (HMS), Glue, etc. >>>>> 2. what is the object that represents a table, e.g. a row in the >>>>> "iceberg_tables" table in JDBC, a Table object in HMS/Glue, etc. >>>>> 3. how is the JSON metadata file pointer stored, e.g. a column in the >>>>> table's row in JDBC, metadata_location key in the Table's parameter map >>>>> in HMS/Glue, etc. >>>>> 4. how the atomic swap is implemented, e.g. SQL atomic update in JDBC, >>>>> conditional parameter update in HMS, conditional version update in Glue, >>>>> etc. >>>>> >>>>> How the REST spec is NOT compliant with the table/view spec >>>>> >>>>> The REST spec technically does not match the following table/view spec >>>>> requirements: >>>>> 2. there is a pointer to a JSON metadata file in the object >>>>> 3. the JSON metadata file exists in storage and contains the table/view >>>>> metadata content >>>>> 5. serializable isolation is achieved by atomic swap of the object pointer >>>>> >>>>> The key parts in REST spec that are not compliant are: >>>>> 1. metadata-location field is optional in LoadTableResponse >>>>> 2. pointer swap is not enforced in the UpdateTable operation >>>>> >>>>> Therefore, it opens the door for a REST service to be completely not >>>>> dependent on a JSON metadata file, store the Iceberg table/view metadata >>>>> not as a file, and achieve much better performance characteristics than >>>>> other catalogs. This technically gives a unique advantage for REST >>>>> catalog adopters that is not there for non-REST catalogs like HMS and >>>>> Glue. >>>>> >>>>> How can we fix this? >>>>> >>>>> I suggest the following: >>>>> >>>>> Firstly, I think it is good that we try to remove the requirements of >>>>> JSON metadata file pointer and atomic pointer swap. We know these >>>>> requirements have perf limitations based on production usage, especially >>>>> when the metadata file is large. If that is the direction, we should make >>>>> it official by changing the table/view spec to say that those >>>>> requirements are catalog level implementation details that are no longer >>>>> required. >>>>> >>>>> Secondly, once we do that, we should declare REST spec as the official >>>>> catalog spec to interact with Iceberg tables. Otherwise at least I will >>>>> be very tempted to just break the atomic pointer swap pattern and store >>>>> the entire metadata using the Glue Table object to achieve much better >>>>> performance and also Glue native feature integrations, and I think other >>>>> players will be equally motivated to do something similar. That will lead >>>>> to even more chaos in the Iceberg catalog space. >>>>> >>>>> With REST spec as the official catalog spec, we can actually support >>>>> non-REST catalogs by using the HTTP execution chain handler. Dan has >>>>> already done a prototype here that is based on this discussion in the >>>>> past about using AWS Lambda as an alternative HTTP client for REST >>>>> catalog. The same approach can be used to talk to HMS/Glue/JDBC/... while >>>>> users will only interact with the RESTCatalog as the entry point. >>>>> >>>>> I think this can provide a good path forward overall for the catalog >>>>> consolidation story, interested to know what others think. >>>>> >>>>> Best, >>>>> Jack Ye >>>>>