Hey Dan

imho, the REST Spec should provide access to the Iceberg spec layer. I
don't say both should be in sync, but REST Spec should expose the
resources of the Iceberg Spec.

Else, I would consider it incomplete and limited in terms of features.


On Thu, Feb 29, 2024 at 9:28 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote:
> > REST spec-compliant catalog does not need to follow the Iceberg spec to 
> > commit or store metadata
> If the REST implementation doesn't follow the Iceberg spec for commit 
> requirements, it's not compliant with the spec.  There's no exemption that 
> says if you're using REST you don't need to follow the spec.  Why do you 
> think that's the case?
> I don't believe there's a reason to say that the REST spec needs to enforce 
> the commit requirements either, that's a requirement of the Iceberg spec and 
> still needs to be complied with.
> -Dan
> On Thu, Feb 29, 2024 at 12:19 PM Jack Ye <yezhao...@gmail.com> wrote:
>> > The implementation of the spec can either be compliant or not.
>> This is exactly the problem we are talking about right? Just to give an 
>> example, we cannot technically say that tables/views in the Tabular catalog 
>> are Iceberg tables/views, because a REST spec-compliant catalog does not 
>> need to follow the Iceberg spec to commit or store metadata. Even if you say 
>> it is, there is no way to really prove that, because the REST spec does not 
>> enforce it.
>> JB, what do you mean by participating on the Catalog RFC? Is there already 
>> an ongoing RFC?
>> -Jack
>> On Thu, Feb 29, 2024 at 12:08 PM Jean-Baptiste Onofré <j...@nanthrax.net> 
>> wrote:
>>> Hi Dan,
>>> I agree with your statement about REST Spec is not an implement but I 
>>> strongly disagree with your statement "impl of the spec can either be 
>>> compliant or not".
>>> The REST Catalog spec impl should be consistent with the REST Spec. That's 
>>> why a reference implementation in Iceberg would be a must, with a TCK.
>>> The REST Spec should bridge/give access to Table/View metadata. I think it 
>>> would make sense to have a resource to GET the Table/View metadata, also 
>>> supporting PUT to update.
>>> JSON Schema and eventually JSON RPC could help on some area here (compliant 
>>> with OpenAPI).
>>> In another thread, I propose to work on a Catalog RFC, exactly to target 
>>> this. I think it would make sense to have the REST/Catalog RFC as the main 
>>> catalog API, so it has to be both consistent (giving access to table/view 
>>> metadata) and extensible (via OpenAPI Extensions for instance).
>>> So, I agree with Jack: the minimum would be to have JSON metadata exposed 
>>> by the REST Spec.
>>> @Jack, short term I'm in favor of your proposal, long term, I propose to 
>>> participate on the Catalog RFC (REST Spec). WDYT ?
>>> Thanks !
>>> Regards
>>> JB
>>> Le jeu. 29 févr. 2024 à 20:47, Daniel Weeks <daniel.c.we...@gmail.com> a 
>>> écrit :
>>>> Hey Jack,
>>>> I'm not sure I agree with the framing of this argument.  The REST Spec 
>>>> defines a protocol, not an implementation.
>>>> The implementation of the spec can either be compliant or not.  So a REST 
>>>> Implementation that adheres to all the requirements (atomic location swap, 
>>>> json representation, etc.), would be compliant.  There's no requirement 
>>>> around who performs these operations and with REST, that is delegated to 
>>>> the server.  The optional metadata location doesn't mean that there isn't 
>>>> a metadata location, just that it may not be exposed directly in the 
>>>> response.
>>>> Therefore, an implementation where you just store the table metadata in a 
>>>> Glue Table object, would not be compliant, currently.
>>>> We've periodically discussed removing the storage requirement and I think 
>>>> there's a path forward to do that and would agree that standardizing on 
>>>> REST, but I wouldn't say the justification for making this push is that 
>>>> REST is not compliant so we can just ignore the table spec requirements.
>>>> There are a few more things to consider, which is that not everything can 
>>>> use REST currently and making a hard cut away from file based metadata 
>>>> could bifurcate access to Iceberg data.  There are also aspects to the 
>>>> spec that reference the metadata paths (like metadata log, though it's 
>>>> optional), but would likely need to be addressed.
>>>> -Dan
>>>> On Thu, Feb 29, 2024 at 11:13 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>> Hi everyone,
>>>>> Just want to pull this specific topic out of the materialized view 
>>>>> discussion thread. I noticed this during the MV discussion, and I think 
>>>>> it is important to clarify this not just for the MV topic, but also for 
>>>>> the ongoing discussion to consolidate all the different catalogs.
>>>>> How the table/view spec defines Iceberg table/view
>>>>> If we look into the table/view spec, the optimistic concurrency section 
>>>>> requires the existence of a metadata file, and the atomic swap of the 
>>>>> metadata file ensures serializable isolation. This implies 2 things:
>>>>> 1. the metadata file in a storage that holds the information described in 
>>>>> the rest of the spec.
>>>>> 2. there is an object in a catalog that holds the pointer of the metadata 
>>>>> file. What object and what catalog is implementation dependent, but these 
>>>>> generalized concepts are always intact.
>>>>> The JSON serialization parts of the spec plus the reader requirements 
>>>>> also implies that the metadata file is in JSON format.
>>>>> So when we talk about an Iceberg table/view that is compliant with the 
>>>>> spec, it is the combination of all these 5 requirements:
>>>>> 1. there is an object in the catalog representing this table/view
>>>>> 2. there is a pointer to a JSON metadata file in the object
>>>>> 3. the JSON metadata file exists in storage and contains the table/view 
>>>>> metadata content
>>>>> 4. the metadata content is compliant with the standard described in the 
>>>>> spec
>>>>> 5. serializable isolation is achieved by atomic swap of the object pointer
>>>>> How non-REST catalogs are compliant with the table/view spec
>>>>> An implementation of the Iceberg table/view is essentially specifying:
>>>>> 1. what is the exact implementation of the catalog, e.g. JDBC, Hive 
>>>>> metastore (HMS), Glue, etc.
>>>>> 2. what is the object that represents a table, e.g. a row in the 
>>>>> "iceberg_tables" table in JDBC, a Table object in HMS/Glue, etc.
>>>>> 3. how is the JSON metadata file pointer stored, e.g. a column in the 
>>>>> table's row in JDBC, metadata_location key in the Table's parameter map 
>>>>> in HMS/Glue, etc.
>>>>> 4. how the atomic swap is implemented, e.g. SQL atomic update in JDBC, 
>>>>> conditional parameter update in HMS, conditional version update in Glue, 
>>>>> etc.
>>>>> How the REST spec is NOT compliant with the table/view spec
>>>>> The REST spec technically does not match the following table/view spec 
>>>>> requirements:
>>>>> 2. there is a pointer to a JSON metadata file in the object
>>>>> 3. the JSON metadata file exists in storage and contains the table/view 
>>>>> metadata content
>>>>> 5. serializable isolation is achieved by atomic swap of the object pointer
>>>>> The key parts in REST spec that are not compliant are:
>>>>> 1. metadata-location field is optional in LoadTableResponse
>>>>> 2. pointer swap is not enforced in the UpdateTable operation
>>>>> Therefore, it opens the door for a REST service to be completely not 
>>>>> dependent on a JSON metadata file, store the Iceberg table/view metadata 
>>>>> not as a file, and achieve much better performance characteristics than 
>>>>> other catalogs. This technically gives a unique advantage for REST 
>>>>> catalog adopters that is not there for non-REST catalogs like HMS and 
>>>>> Glue.
>>>>> How can we fix this?
>>>>> I suggest the following:
>>>>> Firstly, I think it is good that we try to remove the requirements of 
>>>>> JSON metadata file pointer and atomic pointer swap. We know these 
>>>>> requirements have perf limitations based on production usage, especially 
>>>>> when the metadata file is large. If that is the direction, we should make 
>>>>> it official by changing the table/view spec to say that those 
>>>>> requirements are catalog level implementation details that are no longer 
>>>>> required.
>>>>> Secondly, once we do that, we should declare REST spec as the official 
>>>>> catalog spec to interact with Iceberg tables. Otherwise at least I will 
>>>>> be very tempted to just break the atomic pointer swap pattern and store 
>>>>> the entire metadata using the Glue Table object to achieve much better 
>>>>> performance and also Glue native feature integrations, and I think other 
>>>>> players will be equally motivated to do something similar. That will lead 
>>>>> to even more chaos in the Iceberg catalog space.
>>>>> With REST spec as the official catalog spec, we can actually support 
>>>>> non-REST catalogs by using the HTTP execution chain handler. Dan has 
>>>>> already done a prototype here that is based on this discussion in the 
>>>>> past about using AWS Lambda as an alternative HTTP client for REST 
>>>>> catalog. The same approach can be used to talk to HMS/Glue/JDBC/... while 
>>>>> users will only interact with the RESTCatalog as the entry point.
>>>>> I think this can provide a good path forward overall for the catalog 
>>>>> consolidation story, interested to know what others think.
>>>>> Best,
>>>>> Jack Ye

Reply via email to