Re: [DISCUSS] Multiple catalog support

Ryan Blue Tue, 31 Jul 2018 13:47:56 -0700

Wenchen, I think the misunderstanding is around how the v2 API should work
with multiple catalogs.

Data sources are read/write implementations that resolve to a single JVM
class. When we consider how these implementations should work with multiple
table catalogs, I think it is clear that the catalog needs to be able to
choose the implementation and should be able to share implementations
across catalogs. Those requirements are incompatible with the idea that
Spark should get a catalog from the data source.

An easy way to think about this is the Parquet example from my earlier
email. *Why would using format("parquet") determine the catalog where a
table is created?*

The conclusion I came to is that to support CTAS and other operations that
require a catalog, Spark should determine that catalog first, not the
storage implementation (data source) first. The catalog should return a
Table that implements ReadSupport and WriteSupport. The actual
implementation class doesn’t need to be chosen by users.

That leaves a few open questions.

First open question: *How can we support reading tables without metadata?*
This is your load example: df.read.format("xyz").option(...).load.

I think we should continue to use the DataSource v1 loader to load a
DataSourceV2, then define a way for that to return a Table with ReadSupport
and WriteSupport, like this:

interface DataSourceV2 {
  public Table anonymousTable(Map<String, String> tableOptions);
}

While I agree that these tables without metadata should be supported, many
of the current uses are actually working around missing multi-catalog
support. JDBC is a good example. You have to point directly to a JDBC table
using the source and options because we don’t have a way to connect to JDBC
as a catalog. If we make catalog definition easy, then we can support CTAS
to JDBC, make it simpler to load several tables in the same remote
database, etc. This would also improve working with persistent JDBC tables
because it would connect to the source of truth for table metadata instead
of copying it into the ExternalCatalog from the Spark session.

In other words, the case we should be primarily targeting is catalog-based
tables, not tables without metadata.

Second open question: *How should the format method and USING clause work?*

I think these should be passed to the catalog and the catalog can decide
what to do. Formats like “parquet” and “json” are currently replaced with a
concrete Java class, so there’s precedent for these as information for the
catalog and not concrete implementations. These should be optional and
should get passed to any catalog.

The implementation of TableCatalog backed by the current ExternalCatalog
can continue to use format / USING to choose the data source directly, but
there’s no requirement for other catalogs to do that because there are no
other catalogs right now. Passing this to an Iceberg catalog could
determine whether Iceberg’s underlying storage is “avro” or “parquet”, even
though Iceberg uses a different data source implementation.

Third open question: *How should path-based tables work?*

First, path-based tables need clearly defined behavior. That’s missing
today. I’ve heard people cite the “feature” that you can write a different
schema to a path-based JSON table without needing to run an “alter table”
on it to update the schema. If this is behavior we want to preserve (and I
think it is) then we need to clearly state what that behavior is.

Second, I think that we can build a TableCatalog-like interface to handle
path tables.

rb

On Tue, Jul 31, 2018 at 7:58 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Here is my interpretation of your proposal, please correct me if something
> is wrong.
>
> End users can read/write a data source with its name and some options.
> e.g. `df.read.format("xyz").option(...).load`. This is currently the only
> end-user API for data source v2, and is widely used by Spark users to
> read/write data source v1 and file sources, we should still support it. We
> will add more end-user APIs in the future, once we standardize the DDL
> logical plans.
>
> If a data source wants to be used with tables, then it must implement some
> catalog functionalities. At least it needs to support
> create/lookup/alter/drop table, and optionally more features like managing
> functions/views and supporting the USING syntax. This means, to use file
> source with tables, we need another data source that has full catalog
> functionalities. We can implement a Hive data source with all catalog
> functionalities backed by HMS, or a Glue data source backed by AWS Glue.
> They should both support USING syntax and thus support file sources. If
> USING is not specified, the default storage(hive tables) should be used.
>
> For path-based tables, we can create a special API for it and define the
> rule to resolve ambiguity when looking up tables.
>
> If we go with this direction, one problem is that, data source may not be
> a good name anymore, since a data source can provide catalog
> functionalities.
>
> Under the hood, I feel this proposal is very similar to my second
> proposal, except that a catalog implementation must provide a default data
> source/storage, and different rule for looking up tables.
>
>
> On Sun, Jul 29, 2018 at 11:43 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Wenchen, what I'm suggesting is a bit of both of your proposals.
>>
>> I think that USING should be optional like your first option. USING (or
>> format(...) in the DF side) should configure the source or implementation,
>> while the catalog should be part of the table identifier. They serve two
>> different purposes: configuring the storage within the catalog, and
>> choosing which catalog to pass create or other calls to. I think that's
>> pretty much what you suggest in #1. The USING syntax would continue to be
>> used to configure storage within a catalog.
>>
>> (Side note: I don't think this needs to be tied to a particular
>> implementation. We currently use 'parquet' to tell the Spark catalog to use
>> the Parquet source, but another catalog could also use 'parquet' to store
>> data in Parquet format without using the Spark built-in source.)
>>
>> The second option suggests separating the catalog API from data source. In
>>  #21306 <https://github.com/apache/spark/pull/21306>, I add the proposed
>> catalog API and a reflection-based loader like the v1 sources use (and v2
>> sources have used so far). I think that it makes much more sense to
>> start with a catalog and then get the data source for operations like CTAS.
>> This is compatible with the behavior from your point #1: the catalog
>> chooses the source implementation and USING is optional.
>>
>> The reason why we considered an API to get a catalog from the source is
>> because we defined the source API first, but it doesn't make sense to get a
>> catalog from the data source. Catalogs can share data sources (e.g. prod
>> and test environments). Plus, it makes more sense to determine the catalog
>> and then have it return the source implementation because it may require a
>> specific one, like JDBC or Iceberg would. With standard logical plans we
>> always know the catalog when creating the plan: either the table identifier
>> includes an explicit one, or the default catalog is used.
>>
>> In the PR I mentioned above, the catalog implementation's class is
>> determined by Spark config properties, so there's no need to use
>> ServiceLoader and we can use the same implementation class for multiple
>> catalogs with different configs (e.g. prod and test environments).
>>
>> Your last point about path-based tables deserves some attention. But, we
>> also need to define the behavior of path-based tables. Part of what we want
>> to preserve is flexibility, like how you don't need to alter the schema in
>> JSON tables, you just write different data. For the path-based syntax, I
>> suggest looking up source first and using the source if there is one. If
>> not, then look up the catalog. That way existing tables work, but we can
>> migrate to catalogs with names that don't conflict.
>>
>> rb
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Multiple catalog support

Reply via email to