Re: [DISCUSS] Identifiers with multi-catalog support

Reynold Xin Sun, 13 Jan 2019 14:04:56 -0800

Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL 
and Postgres are the two most popular open source database systems, and both 
support database → schema → table 3 part identification, so Spark supporting 
only 2 part name passing to the data source (option 1) isn't sufficient.


For the issues you brought up w.r.t. nesting - what's the challenge in 
supporting it? I can also see us not supporting it for now (no nesting allowed, 
leaf - 1 level can only contain leaf tables), and adding support for nesting in 
the future.

On Sun, Jan 13, 2019 at 1:38 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> 
> 
> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
> 
> 
> 
> Identifiers are orthogonal to the TableCatalog proposal that can be
> updated to use whatever identifier class we choose. That proposal is
> concerned with what information should be passed to define a table, and
> how to pass that information.
> 
> 
> 
> The main question for this discussion is: *how should Spark identify
> tables, views, and functions when it supports multiple catalogs?*
> 
> 
> 
> There are two main approaches:
> 
> * Use a 3-part identifier, catalog.database.table
> * Use an identifier with an arbitrary number of parts
> 
> 
> *Option 1: use 3-part identifiers*
> 
> 
> 
> The argument for option #1 is that it is simple. If an external data store
> has additional logical hierarchy layers, then that hierarchy would be
> mapped to multiple catalogs in Spark. Spark can support show tables and
> show databases without much trouble. This is the approach used by Presto,
> so there is some precedent for it.
> 
> 
> 
> The drawback is that mapping a more complex hierarchy into Spark requires
> more configuration. If an external DB has a 3-level hierarchy — say, for
> example, schema.database.table — then option #1 requires users to configure
> a catalog for each top-level structure, each schema. When a new schema is
> added, it is not automatically accessible.
> 
> 
> 
> Catalog implementations could use session options could provide a rough
> work-around by changing the plugin’s “current” schema. I think this is an
> anti-pattern, so another strike against this option is that it encourages
> bad practices.
> 
> 
> 
> *Option 2: use n-part identifiers*
> 
> 
> 
> That drawback for option #1 is the main argument for option #2: Spark
> should allow users to easily interact with the native structure of an
> external store. For option #2, a full identifier would be an
> arbitrary-length list of identifiers. For the example above, using 
> catalog.schema.database.table
> is allowed. An identifier would be something like this:
> 
> case class CatalogIdentifier(parts: Seq[String])
> 
> The problem with option #2 is how to implement a listing and discovery
> API, for operations like SHOW TABLES. If the catalog API requires a 
> list(ident:
> CatalogIdentifier) , what does it return? There is no guarantee that the
> listed objects are tables and not nested namespaces. How would Spark
> handle arbitrary nesting that differs across catalogs?
> 
> 
> 
> Hopefully, I’ve captured the design question well enough for a productive
> discussion. Thanks!
> 
> 
> 
> rb
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Identifiers with multi-catalog support

Reply via email to