Re: [DISCUSS] Multiple catalog support

Ryan Blue Sun, 29 Jul 2018 08:43:32 -0700

Wenchen, what I'm suggesting is a bit of both of your proposals.

I think that USING should be optional like your first option. USING (or
format(...) in the DF side) should configure the source or implementation,
while the catalog should be part of the table identifier. They serve two
different purposes: configuring the storage within the catalog, and
choosing which catalog to pass create or other calls to. I think that's
pretty much what you suggest in #1. The USING syntax would continue to be
used to configure storage within a catalog.


(Side note: I don't think this needs to be tied to a particular
implementation. We currently use 'parquet' to tell the Spark catalog to use
the Parquet source, but another catalog could also use 'parquet' to store
data in Parquet format without using the Spark built-in source.)

The second option suggests separating the catalog API from data source. In
#21306 <https://github.com/apache/spark/pull/21306>, I add the proposed
catalog API and a reflection-based loader like the v1 sources use (and v2
sources have used so far). I think that it makes much more sense to start
with a catalog and then get the data source for operations like CTAS. This
is compatible with the behavior from your point #1: the catalog chooses the
source implementation and USING is optional.

The reason why we considered an API to get a catalog from the source is
because we defined the source API first, but it doesn't make sense to get a
catalog from the data source. Catalogs can share data sources (e.g. prod
and test environments). Plus, it makes more sense to determine the catalog
and then have it return the source implementation because it may require a
specific one, like JDBC or Iceberg would. With standard logical plans we
always know the catalog when creating the plan: either the table identifier
includes an explicit one, or the default catalog is used.

In the PR I mentioned above, the catalog implementation's class is
determined by Spark config properties, so there's no need to use
ServiceLoader and we can use the same implementation class for multiple
catalogs with different configs (e.g. prod and test environments).

Your last point about path-based tables deserves some attention. But, we
also need to define the behavior of path-based tables. Part of what we want
to preserve is flexibility, like how you don't need to alter the schema in
JSON tables, you just write different data. For the path-based syntax, I
suggest looking up source first and using the source if there is one. If
not, then look up the catalog. That way existing tables work, but we can
migrate to catalogs with names that don't conflict.

rb

Re: [DISCUSS] Multiple catalog support

Reply via email to