In the DSv2 sync up, we tried to discuss the Table metadata proposal but were side-tracked on its use of TableIdentifier. There were good points about how Spark should identify tables, views, functions, etc, and I want to start a discussion here.
Identifiers are orthogonal to the TableCatalog proposal that can be updated to use whatever identifier class we choose. That proposal is concerned with what information should be passed to define a table, and how to pass that information. The main question for *this* discussion is: *how should Spark identify tables, views, and functions when it supports multiple catalogs?* There are two main approaches: 1. Use a 3-part identifier, catalog.database.table 2. Use an identifier with an arbitrary number of parts *Option 1: use 3-part identifiers* The argument for option #1 is that it is simple. If an external data store has additional logical hierarchy layers, then that hierarchy would be mapped to multiple catalogs in Spark. Spark can support show tables and show databases without much trouble. This is the approach used by Presto, so there is some precedent for it. The drawback is that mapping a more complex hierarchy into Spark requires more configuration. If an external DB has a 3-level hierarchy — say, for example, schema.database.table — then option #1 requires users to configure a catalog for each top-level structure, each schema. When a new schema is added, it is not automatically accessible. Catalog implementations could use session options could provide a rough work-around by changing the plugin’s “current” schema. I think this is an anti-pattern, so another strike against this option is that it encourages bad practices. *Option 2: use n-part identifiers* That drawback for option #1 is the main argument for option #2: Spark should allow users to easily interact with the native structure of an external store. For option #2, a full identifier would be an arbitrary-length list of identifiers. For the example above, using catalog.schema.database.table is allowed. An identifier would be something like this: case class CatalogIdentifier(parts: Seq[String]) The problem with option #2 is how to implement a listing and discovery API, for operations like SHOW TABLES. If the catalog API requires a list(ident: CatalogIdentifier), what does it return? There is no guarantee that the listed objects are tables and not nested namespaces. How would Spark handle arbitrary nesting that differs across catalogs? Hopefully, I’ve captured the design question well enough for a productive discussion. Thanks! rb -- Ryan Blue Software Engineer Netflix