Hi Yufei, I’ll reply to your questions inline: On Tue, Jan 4, 2022 at 3:12 PM Yufei Gu <flyrain...@gmail.com> wrote:
1. Are we going to open for other ways to store the table metadata instead > of the metadata.json files? For example, a relational database or a > key-value database. This will be a big change if that’s so, since it > changes the assumption that the table version is preserved as a file. > We can relax the spec here in the future. My original intent with the TableOperations API was to allow flexibility to use a database, but the metadata JSON files have been an important part of compatibility. For now, this is required by the spec, but there is nothing stopping a catalog implementation from not writing the metadata JSON file. 2. Assume that we still keep metadata.json files, does that mean the > catalog server needs to write a new metadata.json for each commit, which > needs the permission to access the file system where the table is located? > It is tricky to do that for some users whose catalog doesn’t have the > access of the file system. > That’s right. To follow the spec, you’d need the service to write a metadata JSON file, although I don’t think it is likely that the client would actually attempt to read it — as long as the service returns metadata, it would just be extra work. So you could choose whether to make sure your service can write the file, or privately violate the spec. My recommendation here is to make sure the service can write the file. It provides an important backup of table state in case anything happens to the DB, and is the most portable representation of table metadata. 3. What’s the major benefits of the REST catalog APIs? Considering that the > current user has to make a lot of changes to adopt it, either from client > side and server side. The client side should be fine since Iceberg may > provide it, but the server side is a big task for existing users. > There are quite a few benefits: - It is a standard client interface that can be used by processing engines. It’s hard to get a catalog client into products like Trino. Organizations that use a custom metastore can’t expect to get a Jar into hosted processing services, so it makes sense to use one common client and customize a service endpoint. - Customizing the service with a standard client also allows writing just one catalog implementation that can be used across languages, rather than trying to implement the catalog the same way across Java, Python, etc. - A catalog service can do a better job with resources like JDBC connections. You probably don’t want hundreds or thousands of connections directly to the catalog database. - A change-based API has several advantages as well: - Conflict detection can be more granular: commits to different branches don’t need to conflict with one another, nor should metadata commits - Only one writer version is producing metadata JSON files so you don’t have to update every writer in your infrastructure to use new metadata structures - Reads and writes send data directly to/from S3 or another object store because metadata JSON writes are handled by the service 4. How does an existing catalog support the new APIs? For example, HMS may > need to add an extra layer or a plugin to support the server side of the > APIs. > The existing catalogs will continue to work as they are. There’s no need for HMS to change since it already has a running service and working API. But this would benefit other catalogs like JDBC (as I noted above) and would help with cross-language compatibility. I think the advantages are significant enough that we will probably see most people choosing to use the REST catalog, but I could be wrong. Ryan -- Ryan Blue Tabular