Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL and Postgres are the two most popular open source database systems, and both support database → schema → table 3 part identification, so Spark supporting only 2 part name passing to the data source (option 1) isn't sufficient.
For the issues you brought up w.r.t. nesting - what's the challenge in supporting it? I can also see us not supporting it for now (no nesting allowed, leaf - 1 level can only contain leaf tables), and adding support for nesting in the future. On Sun, Jan 13, 2019 at 1:38 PM, Ryan Blue < rb...@netflix.com.invalid > wrote: > > > > In the DSv2 sync up, we tried to discuss the Table metadata proposal but > were side-tracked on its use of TableIdentifier. There were good points > about how Spark should identify tables, views, functions, etc, and I want > to start a discussion here. > > > > Identifiers are orthogonal to the TableCatalog proposal that can be > updated to use whatever identifier class we choose. That proposal is > concerned with what information should be passed to define a table, and > how to pass that information. > > > > The main question for this discussion is: *how should Spark identify > tables, views, and functions when it supports multiple catalogs?* > > > > There are two main approaches: > > * Use a 3-part identifier, catalog.database.table > * Use an identifier with an arbitrary number of parts > > > *Option 1: use 3-part identifiers* > > > > The argument for option #1 is that it is simple. If an external data store > has additional logical hierarchy layers, then that hierarchy would be > mapped to multiple catalogs in Spark. Spark can support show tables and > show databases without much trouble. This is the approach used by Presto, > so there is some precedent for it. > > > > The drawback is that mapping a more complex hierarchy into Spark requires > more configuration. If an external DB has a 3-level hierarchy — say, for > example, schema.database.table — then option #1 requires users to configure > a catalog for each top-level structure, each schema. When a new schema is > added, it is not automatically accessible. > > > > Catalog implementations could use session options could provide a rough > work-around by changing the plugin’s “current” schema. I think this is an > anti-pattern, so another strike against this option is that it encourages > bad practices. > > > > *Option 2: use n-part identifiers* > > > > That drawback for option #1 is the main argument for option #2: Spark > should allow users to easily interact with the native structure of an > external store. For option #2, a full identifier would be an > arbitrary-length list of identifiers. For the example above, using > catalog.schema.database.table > is allowed. An identifier would be something like this: > > case class CatalogIdentifier(parts: Seq[String]) > > The problem with option #2 is how to implement a listing and discovery > API, for operations like SHOW TABLES. If the catalog API requires a > list(ident: > CatalogIdentifier) , what does it return? There is no guarantee that the > listed objects are tables and not nested namespaces. How would Spark > handle arbitrary nesting that differs across catalogs? > > > > Hopefully, I’ve captured the design question well enough for a productive > discussion. Thanks! > > > > rb > > > -- > Ryan Blue > Software Engineer > Netflix >