Hey Jack

It's a proposal in another thread (community effort on Catalog RFC).

Regards
JB

On Thu, Feb 29, 2024 at 9:19 PM Jack Ye <yezhao...@gmail.com> wrote:
>
> > The implementation of the spec can either be compliant or not.
>
> This is exactly the problem we are talking about right? Just to give an 
> example, we cannot technically say that tables/views in the Tabular catalog 
> are Iceberg tables/views, because a REST spec-compliant catalog does not need 
> to follow the Iceberg spec to commit or store metadata. Even if you say it 
> is, there is no way to really prove that, because the REST spec does not 
> enforce it.
>
> JB, what do you mean by participating on the Catalog RFC? Is there already an 
> ongoing RFC?
>
> -Jack
>
>
> On Thu, Feb 29, 2024 at 12:08 PM Jean-Baptiste Onofré <j...@nanthrax.net> 
> wrote:
>>
>> Hi Dan,
>>
>> I agree with your statement about REST Spec is not an implement but I 
>> strongly disagree with your statement "impl of the spec can either be 
>> compliant or not".
>>
>> The REST Catalog spec impl should be consistent with the REST Spec. That's 
>> why a reference implementation in Iceberg would be a must, with a TCK.
>>
>> The REST Spec should bridge/give access to Table/View metadata. I think it 
>> would make sense to have a resource to GET the Table/View metadata, also 
>> supporting PUT to update.
>> JSON Schema and eventually JSON RPC could help on some area here (compliant 
>> with OpenAPI).
>>
>> In another thread, I propose to work on a Catalog RFC, exactly to target 
>> this. I think it would make sense to have the REST/Catalog RFC as the main 
>> catalog API, so it has to be both consistent (giving access to table/view 
>> metadata) and extensible (via OpenAPI Extensions for instance).
>>
>> So, I agree with Jack: the minimum would be to have JSON metadata exposed by 
>> the REST Spec.
>>
>> @Jack, short term I'm in favor of your proposal, long term, I propose to 
>> participate on the Catalog RFC (REST Spec). WDYT ?
>>
>> Thanks !
>> Regards
>> JB
>>
>>
>> Le jeu. 29 févr. 2024 à 20:47, Daniel Weeks <daniel.c.we...@gmail.com> a 
>> écrit :
>>>
>>> Hey Jack,
>>>
>>> I'm not sure I agree with the framing of this argument.  The REST Spec 
>>> defines a protocol, not an implementation.
>>>
>>> The implementation of the spec can either be compliant or not.  So a REST 
>>> Implementation that adheres to all the requirements (atomic location swap, 
>>> json representation, etc.), would be compliant.  There's no requirement 
>>> around who performs these operations and with REST, that is delegated to 
>>> the server.  The optional metadata location doesn't mean that there isn't a 
>>> metadata location, just that it may not be exposed directly in the response.
>>>
>>> Therefore, an implementation where you just store the table metadata in a 
>>> Glue Table object, would not be compliant, currently.
>>>
>>> We've periodically discussed removing the storage requirement and I think 
>>> there's a path forward to do that and would agree that standardizing on 
>>> REST, but I wouldn't say the justification for making this push is that 
>>> REST is not compliant so we can just ignore the table spec requirements.
>>>
>>> There are a few more things to consider, which is that not everything can 
>>> use REST currently and making a hard cut away from file based metadata 
>>> could bifurcate access to Iceberg data.  There are also aspects to the spec 
>>> that reference the metadata paths (like metadata log, though it's 
>>> optional), but would likely need to be addressed.
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Thu, Feb 29, 2024 at 11:13 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> Just want to pull this specific topic out of the materialized view 
>>>> discussion thread. I noticed this during the MV discussion, and I think it 
>>>> is important to clarify this not just for the MV topic, but also for the 
>>>> ongoing discussion to consolidate all the different catalogs.
>>>>
>>>> How the table/view spec defines Iceberg table/view
>>>>
>>>> If we look into the table/view spec, the optimistic concurrency section 
>>>> requires the existence of a metadata file, and the atomic swap of the 
>>>> metadata file ensures serializable isolation. This implies 2 things:
>>>> 1. the metadata file in a storage that holds the information described in 
>>>> the rest of the spec.
>>>> 2. there is an object in a catalog that holds the pointer of the metadata 
>>>> file. What object and what catalog is implementation dependent, but these 
>>>> generalized concepts are always intact.
>>>>
>>>> The JSON serialization parts of the spec plus the reader requirements also 
>>>> implies that the metadata file is in JSON format.
>>>>
>>>> So when we talk about an Iceberg table/view that is compliant with the 
>>>> spec, it is the combination of all these 5 requirements:
>>>> 1. there is an object in the catalog representing this table/view
>>>> 2. there is a pointer to a JSON metadata file in the object
>>>> 3. the JSON metadata file exists in storage and contains the table/view 
>>>> metadata content
>>>> 4. the metadata content is compliant with the standard described in the 
>>>> spec
>>>> 5. serializable isolation is achieved by atomic swap of the object pointer
>>>>
>>>> How non-REST catalogs are compliant with the table/view spec
>>>>
>>>> An implementation of the Iceberg table/view is essentially specifying:
>>>> 1. what is the exact implementation of the catalog, e.g. JDBC, Hive 
>>>> metastore (HMS), Glue, etc.
>>>> 2. what is the object that represents a table, e.g. a row in the 
>>>> "iceberg_tables" table in JDBC, a Table object in HMS/Glue, etc.
>>>> 3. how is the JSON metadata file pointer stored, e.g. a column in the 
>>>> table's row in JDBC, metadata_location key in the Table's parameter map in 
>>>> HMS/Glue, etc.
>>>> 4. how the atomic swap is implemented, e.g. SQL atomic update in JDBC, 
>>>> conditional parameter update in HMS, conditional version update in Glue, 
>>>> etc.
>>>>
>>>> How the REST spec is NOT compliant with the table/view spec
>>>>
>>>> The REST spec technically does not match the following table/view spec 
>>>> requirements:
>>>> 2. there is a pointer to a JSON metadata file in the object
>>>> 3. the JSON metadata file exists in storage and contains the table/view 
>>>> metadata content
>>>> 5. serializable isolation is achieved by atomic swap of the object pointer
>>>>
>>>> The key parts in REST spec that are not compliant are:
>>>> 1. metadata-location field is optional in LoadTableResponse
>>>> 2. pointer swap is not enforced in the UpdateTable operation
>>>>
>>>> Therefore, it opens the door for a REST service to be completely not 
>>>> dependent on a JSON metadata file, store the Iceberg table/view metadata 
>>>> not as a file, and achieve much better performance characteristics than 
>>>> other catalogs. This technically gives a unique advantage for REST catalog 
>>>> adopters that is not there for non-REST catalogs like HMS and Glue.
>>>>
>>>> How can we fix this?
>>>>
>>>> I suggest the following:
>>>>
>>>> Firstly, I think it is good that we try to remove the requirements of JSON 
>>>> metadata file pointer and atomic pointer swap. We know these requirements 
>>>> have perf limitations based on production usage, especially when the 
>>>> metadata file is large. If that is the direction, we should make it 
>>>> official by changing the table/view spec to say that those requirements 
>>>> are catalog level implementation details that are no longer required.
>>>>
>>>> Secondly, once we do that, we should declare REST spec as the official 
>>>> catalog spec to interact with Iceberg tables. Otherwise at least I will be 
>>>> very tempted to just break the atomic pointer swap pattern and store the 
>>>> entire metadata using the Glue Table object to achieve much better 
>>>> performance and also Glue native feature integrations, and I think other 
>>>> players will be equally motivated to do something similar. That will lead 
>>>> to even more chaos in the Iceberg catalog space.
>>>>
>>>> With REST spec as the official catalog spec, we can actually support 
>>>> non-REST catalogs by using the HTTP execution chain handler. Dan has 
>>>> already done a prototype here that is based on this discussion in the past 
>>>> about using AWS Lambda as an alternative HTTP client for REST catalog. The 
>>>> same approach can be used to talk to HMS/Glue/JDBC/... while users will 
>>>> only interact with the RESTCatalog as the entry point.
>>>>
>>>> I think this can provide a good path forward overall for the catalog 
>>>> consolidation story, interested to know what others think.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>

Reply via email to