Re: REST catalog proposal

Kyle Bendickson Tue, 14 Dec 2021 22:38:01 -0800

Hi Ryan,

Sorry for the late response.


I feel Jack and Ryan have summed up things very well.

I will also answer the questions from my perspective, as you did ask and I
do a few thoughts outside of what was shared.

For starters, this is an additional catalog. The other catalogs, as well as
the ability to write your own catalog (which I hear about quite a lot), are
in no way affected by this. If catalogs choose to evolve to work similarly,
that’s ok but by no means expected. This is just another catalog.

As for the word specification: I think there’s been some confusion around
this. The current PR is labeled a specification because it’s an OpenAPI
spec. The present PR nor really the work on the REST catalog are intended
as a catalog specification, especially not in the way that table spec v1
and table spec v2 are.

I think in the long term, it would be great to be able to point to
something as a specification for Catalogs. However, I’m hesitant to say
that it would be the REST catalog. I’m not personally uncomfortable with
the interfaces as the source of truth, as they have been for some time.
Plus, catalogs change from engine to engine a good bit.

As for the individual questions, they’ve mostly been answered I feel. I’ll
address the ones that seem unanswered in more detail and answer them fully
for the record.

* Is this a spec at the level that the table spec exists or is this an
informative PR to agree on the REST api of _a_ catalog?


Informative PR to agree on the REST api of _a_ catalog.


* Is it meant to enshrine the `Catalog` interface into a spec? This came up
on a python sync also


Answered above, but no. Someday that might be good, and it might be
beneficial for non-Java devs to be able to see curl examples with
representative JSON as a form of documentation via the REST catalogs docs,
but no.


* Will there be both server and client modules in the iceberg codebase? I
would expect that at least a reference implementation of a server would be
a good thing but this would be the first part of the codebase that runs as
a server instead of as client code in an engine. On the other side an open
api spec and a client impl w/o a server sounds like it's missing something.
*


Answered by others, but I don’t personally believe in just mocking for
tests. So there will likely be some minimum implementation in that regard.
I also think with time we’ll see more intricate things open sourced (even
like we’ve seen with the Aliyun catalog and their tests).


But I don’t think that a client impl and an OpenAPI spec are missing
something in this case, because the idea is very much to decouple the
server logic from the catalog. And to allow users to use their own set of
tools, just like many people have rather complicated HMS shims. But with
time I like to think we’ll see something open sourced, and at least for
testing we will need a minimum implementation.


* It may be early to say for sure but does a server implementation imply
authn/z, database backends, deployment artifacts and all the other fun
things that go into a server side component?


I think it’s too early to say. For now, we’re just trying to get the very
fundamentals for a REST catalog in order. I don’t know about publishing
artifacts ourselves, especially given ASF rules on things could somehow
weirdly come into play (similar to how we haven’t officially had a Python
client). Outside of that, I think it’s too early to say. We’ll see as time
goes on.


- Kyle (GitHub @kbendick)


On Mon, Dec 13, 2021 at 7:34 PM Ryan Blue <b...@tabular.io> wrote:

> I think Jack does a great job of pointing out a lot of the advantages. I
> agree with him, but I’ll add my perspective as well. I suggested the REST
> catalog a couple (few?) months ago when we were talking about the DynamoDB
> catalog and it stuck with me as the solution to quite a few problems.
>
> First, although you can plug a catalog implementation into the classpath
> for clients, that’s not always a good idea. JDBC is a good example, where
> you probably don’t want a ton of connections going directly to a database.
> An intermediate service is a great way to scale such a metastore. As Jack
> noted, it’s also nice to implement catalogs like JDBC with a service so
> that you get the exact same behavior across languages without implementing
> it twice with different DB APIs.
>
> Along the same lines, many hosted processing engines aren’t going to
> support customers plugging arbitrary code into processing engines. When I
> was at Netflix, we used a custom metastore to track tables. That worked
> great, but it meant that our platform was incompatible with things like AWS
> Athena because we’d either have to plug in a Jar or get them to implement a
> bespoke REST protocol just for us. Right now, catalog customization is only
> available if you use the Hive thrift API, which is not a fun way to go if
> you just want to try out a hosted processing engine. By building a common
> protocol and client, we can hopefully get engines to support the client so
> we can point them to any metastore.
>
> A third problem that the REST catalog helps address is version upgrades.
> Because the metadata JSON file is written by clients, upgrading to support
> new features is difficult. For example, the snapshot reference map that
> we’re using to implement tagging isn’t supported by any existing writer.
> The change is backward-compatible, so older Iceberg versions can read
> tables with refs, but if those versions write to a table, the refs get
> dropped. That means we have to upgrade versions at the same time across all
> writers. On the other hand, with a change-based protocol we can update the
> Iceberg version of just the service that writes the metadata JSON files.
> Then with only one version writing metadata, no metadata is unsupported and
> dropped.
>
> At Tabular, we’re interested in all 3 of these. We’re building a
> REST-based catalog and we could add it in a vendor module just like the
> Glue catalog. But I think our time is better spent doing this with the
> community so that we all can use a common client, no matter what metastore
> service you use or build.
>
> I’ll also reply to the specific questions:
>
> Is this a spec at the level that the table spec exists or is this an
> informative PR to agree on the REST api of *a* catalog?
>
> I think this is *a* catalog. We want to document the protocol so it is
> generally useful, but we’re not aiming to get rid of the existing catalogs.
> I think they are complimentary. Jack noted some great ideas for what you
> can do with the change-based API.
>
> Is it meant to enshrine the Catalog interface into a spec?
>
> This is meant to be able to do everything that Catalog,
> SupportsNamespaces, and TableOperations currently do, since those are what
> you customize when you plug in a catalog implementation. Setting up things
> like the location provider implementation and FileIO settings are included.
>
> Will there be both server and client modules in the iceberg codebase?
>
> Iceberg doesn’t provide services, it provides a library. I wouldn’t want
> to change that to avoiding scope creep. That said, I think a very basic
> implementation that translates back to the Catalog API is probably the best
> way to test it, so I could see having something like that in tests.
>
> It may be early to say for sure but does a server implementation imply
> authn/z, database backends, deployment artifacts and all the other fun
> things that go into a server side component?
>
> This is exactly why Iceberg has always provided a library and not
> services. There are so many concerns here that I think it would be a
> separate project. I don’t think that Iceberg should do this, just like I
> think it’s healthy that Nessie is a separate project.
>
> Ryan
>
> On Mon, Dec 13, 2021 at 1:17 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> Thanks for starting the thread! Just want to share some of my thoughts
>> related to this topic.
>>
>> I think the AWS Glue, DynamoDB and JDBC catalogs will continue to live, I
>> don't see a unification through REST as we are not going to build a REST
>> server between Iceberg and the related AWS services, and I think anyone can
>> continue to add more implementations in this route if they want (although
>> not recommended). In my opinion, everything in Iceberg will be client side,
>> there is not going to be a server module because that is what the extension
>> point is. REST catalog is just a parallel implementation to
>> BaseMetastoreCatalog, but it does not enforce writing a JSON metadata file.
>> Instead, it only tells the server side the set of changes and let the
>> server handle those changes, the server can choose to write the JSON file
>> in whatever way it wants, or even not write it at all. Just like Glue,
>> DynamoDB and JDBC all extend BaseMetastoreCatalog, other catalogs can
>> choose to "extend" the REST catalog, but the extension is through the
>> OpenAPI REST client but not just Java inheritance.
>>
>> The biggest benefit I see out of this development is that catalog
>> providers can focus on the server side implementation to build really great
>> catalog services with all sorts of nice features, and no new integration
>> and maintenance is needed when Iceberg rolls out new catalog features or
>> support for new languages because everything goes through the base REST
>> implementation. Because of such simplification in open source
>> compatibility, I think most new catalog providers will prefer integration
>> through REST. In addition, systems that only have exposure to a
>> non-Java/Python language can also be used as a catalog provider using a
>> client generated from the OpenAPI spec. It does not need to have any Java
>> compatibility. Just like there are people who prefer DynamoDB catalog over
>> Glue catalog, we also have use cases in AWS for catalog implementations
>> that would only be achievable through a REST catalog, which I will
>> contribute in the future after the REST catalog is finalized.
>>
>> The fact that the REST catalog server receives table changes instead of
>> rewriting the entire table metadata also means the catalog service can
>> optimize a lot of performance aspects. We have seen issues in streaming
>> where the table metadata JSON file size gets too big and impact read, we
>> also generally agree that small table metadata update through rewriting the
>> entire metadata file is very inefficient. All these issues could be fixed
>> by moving to a client-server model for a scalable service to handle and
>> store these changes.
>>
>> Best,
>> Jack Ye
>>
>> On Mon, Dec 13, 2021 at 12:28 PM Ryan Murray <rym...@dremio.com> wrote:
>>
>>> Hi all,
>>>
>>>
>>> For those of you who haven't been following there has been some
>>> interesting discussion around the proposal for a REST based catalog[1].
>>>
>>>
>>> One of the primary questions I had while reading it was 'what is the
>>> overall goal of the API?'. Given the size of this question I thought it
>>> might be better to pose it on the mailing list than to clutter the PR.
>>>
>>>
>>> So I guess primarily for Kyle: what is the long term goal/vision for the
>>> REST catalog? Eg what are the use cases and who are the users? Do you see
>>> this unifying the other existing catalogs or do you see it as another
>>> catalog to compliment existing choices?
>>>
>>>
>>> Additionally,
>>>
>>> * Is this a spec at the level that the table spec exists or is this an
>>> informative PR to agree on the REST api of _a_ catalog?
>>>
>>> * Is it meant to enshrine the `Catalog` interface into a spec? This came
>>> up on a python sync also
>>>
>>> * Will there be both server and client modules in the iceberg codebase?
>>> I would expect that at least a reference implementation of a server would
>>> be a good thing but this would be the first part of the codebase that runs
>>> as a server instead of as client code in an engine. On the other side an
>>> open api spec and a client impl w/o a server sounds like it's missing
>>> something.
>>>
>>> * It may be early to say for sure but does a server implementation imply
>>> authn/z, database backends, deployment artifacts and all the other fun
>>> things that go into a server side component?
>>>
>>>
>>> That's just a few things I have been thinking about. Curious to see if
>>> anyone else has been thinking similarly and very excited to hear your
>>> thoughts Kyle. Also very excited to see this catalog develop. The activity
>>> on the PR speaks to how excited people are about it landing.
>>>
>>>
>>> Best,
>>>
>>> Ryan
>>>
>>>
>>> [1] https://github.com/apache/iceberg/pull/3561
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: REST catalog proposal

Reply via email to