Re: [DISCUSS] Table updates in a REST catalog

Ryan Blue Fri, 07 Jan 2022 16:59:17 -0800

Hi Yufei, I’ll reply to your questions inline:

On Tue, Jan 4, 2022 at 3:12 PM Yufei Gu <flyrain...@gmail.com> wrote:


1. Are we going to open for other ways to store the table metadata instead
> of the metadata.json files? For example, a relational database or a
> key-value database. This will be a big change if that’s so, since it
> changes the assumption that the table version is preserved as a file.
>
We can relax the spec here in the future. My original intent with the
TableOperations API was to allow flexibility to use a database, but the
metadata JSON files have been an important part of compatibility. For now,
this is required by the spec, but there is nothing stopping a catalog
implementation from not writing the metadata JSON file.

2. Assume that we still keep metadata.json files, does that mean the
> catalog server needs to write a new metadata.json for each commit, which
> needs the permission to access the file system where the table is located?
> It is tricky to do that for some users whose catalog doesn’t have the
> access of the file system.
>
That’s right. To follow the spec, you’d need the service to write a
metadata JSON file, although I don’t think it is likely that the client
would actually attempt to read it — as long as the service returns
metadata, it would just be extra work. So you could choose whether to make
sure your service can write the file, or privately violate the spec. My
recommendation here is to make sure the service can write the file. It
provides an important backup of table state in case anything happens to the
DB, and is the most portable representation of table metadata.

3. What’s the major benefits of the REST catalog APIs? Considering that the
> current user has to make a lot of changes to adopt it, either from client
> side and server side. The client side should be fine since Iceberg may
> provide it, but the server side is a big task for existing users.
>
There are quite a few benefits:

   - It is a standard client interface that can be used by processing
   engines. It’s hard to get a catalog client into products like Trino.
   Organizations that use a custom metastore can’t expect to get a Jar into
   hosted processing services, so it makes sense to use one common client and
   customize a service endpoint.
   - Customizing the service with a standard client also allows writing
   just one catalog implementation that can be used across languages, rather
   than trying to implement the catalog the same way across Java, Python, etc.
   - A catalog service can do a better job with resources like JDBC
   connections. You probably don’t want hundreds or thousands of connections
   directly to the catalog database.
   - A change-based API has several advantages as well:
      - Conflict detection can be more granular: commits to different
      branches don’t need to conflict with one another, nor should metadata
      commits
      - Only one writer version is producing metadata JSON files so you
      don’t have to update every writer in your infrastructure to use new
      metadata structures
      - Reads and writes send data directly to/from S3 or another object
      store because metadata JSON writes are handled by the service

4. How does an existing catalog support the new APIs? For example, HMS may
> need to add an extra layer or a plugin to support the server side of the
> APIs.
>
The existing catalogs will continue to work as they are. There’s no need
for HMS to change since it already has a running service and working API.
But this would benefit other catalogs like JDBC (as I noted above) and
would help with cross-language compatibility. I think the advantages are
significant enough that we will probably see most people choosing to use
the REST catalog, but I could be wrong.

Ryan
-- 
Ryan Blue
Tabular

Re: [DISCUSS] Table updates in a REST catalog

Reply via email to