In the DSv2 sync up, we tried to discuss the Table metadata proposal but
were side-tracked on its use of TableIdentifier. There were good points
about how Spark should identify tables, views, functions, etc, and I want
to start a discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated
to use whatever identifier class we choose. That proposal is concerned with
what information should be passed to define a table, and how to pass that
information.

The main question for *this* discussion is: *how should Spark identify
tables, views, and functions when it supports multiple catalogs?*

There are two main approaches:

   1. Use a 3-part identifier, catalog.database.table
   2. Use an identifier with an arbitrary number of parts

*Option 1: use 3-part identifiers*

The argument for option #1 is that it is simple. If an external data store
has additional logical hierarchy layers, then that hierarchy would be
mapped to multiple catalogs in Spark. Spark can support show tables and
show databases without much trouble. This is the approach used by Presto,
so there is some precedent for it.

The drawback is that mapping a more complex hierarchy into Spark requires
more configuration. If an external DB has a 3-level hierarchy — say, for
example, schema.database.table — then option #1 requires users to configure
a catalog for each top-level structure, each schema. When a new schema is
added, it is not automatically accessible.

Catalog implementations could use session options could provide a rough
work-around by changing the plugin’s “current” schema. I think this is an
anti-pattern, so another strike against this option is that it encourages
bad practices.

*Option 2: use n-part identifiers*

That drawback for option #1 is the main argument for option #2: Spark
should allow users to easily interact with the native structure of an
external store. For option #2, a full identifier would be an
arbitrary-length list of identifiers. For the example above, using
catalog.schema.database.table is allowed. An identifier would be something
like this:

case class CatalogIdentifier(parts: Seq[String])

The problem with option #2 is how to implement a listing and discovery API,
for operations like SHOW TABLES. If the catalog API requires a list(ident:
CatalogIdentifier), what does it return? There is no guarantee that the
listed objects are tables and not nested namespaces. How would Spark handle
arbitrary nesting that differs across catalogs?

Hopefully, I’ve captured the design question well enough for a productive
discussion. Thanks!

rb
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to