Re: DataSourceV2 community sync #3

Thakrar, Jayesh Mon, 03 Dec 2018 10:12:38 -0800

Thank you Ryan and Xiao – sharing all this info really gives a very good 
insight!

From: Ryan Blue <rb...@netflix.com>
Reply-To: "rb...@netflix.com" <rb...@netflix.com>
Date: Monday, December 3, 2018 at 12:05 PM
To: "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Cc: Xiao Li <gatorsm...@gmail.com>, Spark Dev List <dev@spark.apache.org>
Subject: Re: DataSourceV2 community sync #3

Jayesh,

I don’t think this need is very narrow.

To have reliable behavior for CTAS, you need to:
1.       Check whether a table exists and fail. Right now, it is up to the 
source whether to continue with the write if the table already exists or to 
throw an exception, which is unreliable across sources.
2.       Create a table if it doesn’t exist.
3.       Drop the table if writing failed. In the current implementation, this 
can’t be done reliably because #1 is unreliable. So a failed CTAS has a 
side-effect that the table is created in some cases and a subsequent retry can 
fail because the table exists.

Leaving these operations up to the read/write API is why behavior isn’t 
consistent today. It also increases the amount of work that a source needs to 
do and mixes concerns (what to do in a write when the table doesn’t exist). 
Spark is going to be a lot more predictable if we decompose the behavior of 
these operations into create, drop, write, etc.

And in addition to CTAS, we want these operations to be exposed for sources. If 
Spark can create a table, why wouldn’t you be able to run DROP TABLE to remove 
it?

Last, Spark must be able to interact with the source of truth for tables. If 
Spark can’t create a table in Cassandra, it should reject a CTAS operation.

On Mon, Dec 3, 2018 at 9:52 AM Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote:
Thank you Xiao – I was wondering what was the motivation for the catalog.
If CTAS is the only candidate, would it suffice to have that as part of the 
data source interface only?

If we look at BI, ETL and reporting tools which interface with many tables from 
different data sources at the same time, it makes sense to have a metadata 
catalog as the catalog is used to “design” the work for that tool (e.g. ETL 
processing unit, etc). Furthermore, the catalog serves as a data mapping to map 
external data types to the tool’s data types.

Is the vision to move in that direction for Spark with the catalog 
support/feature?
Also, is the vision to also incorporate the “options” specified for the data 
source into the catalog too?
That may be helpful in some situations (e.g. the JDBC connect string being 
available from the catalog).
From: Xiao Li <gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>
Date: Monday, December 3, 2018 at 10:44 AM
To: "Thakrar, Jayesh" 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>>
Cc: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>>, 
"u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: DataSourceV2 community sync #3

Hi, Jayesh,

This is a good question. Spark is a unified analytics engine for various data 
sources. We are able to get the table schema from the underlying data sources 
via our data source APIs. Thus, it resolves most of the user requirements. 
Spark does not need the other info (like database, function, and views) that 
are stored in the local catalog. Note, Spark is not a query engine for a 
specific data source. Thus, we did not accept any public API that does not have 
an implementation in the past. I believe this still holds.

The catalog is part of the Spark SQL in the initial design and implementation. 
For the data sources that do not have catalog, they can use our catalog as a 
single source of truth. If they already have their own catalog, normally, they 
use the underlying data sources as the single source of truth. The table 
metadata in the Spark catalog is kind of a view of their physical schema that 
are stored in their local catalog. To support an atomic CREATE TABLE AS SELECT 
that requires modifying the catalog and data, we can add an interface for data 
sources but that is not part of catalog interface. The CTAS will not bypass our 
catalog. We will still register it in our catalog and the schema may or may not 
be stored in our catalog.

Will we define a super-feature catalog that can support all the data sources?

Based on my understanding, it is very hard. The priority is low based on our 
current scope of Spark SQL. If you want to do it, your design needs to consider 
how it works between global and local catalogs. This also requires a SPIP and 
voting. If you want to develop it incrementally without a design, I would 
suggest you to do it in your own fork. In the past, Spark on K8S was developed 
in a separate fork and then merged to the upstream of Apache Spark.

Welcome your contributions and let us make Spark great!

Cheers,

Xiao

Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> 
于2018年12月1日周六 下午9:10写道：
Just curious on the need for a catalog within Spark.

So Spark interface with other systems – many of which have a catalog of their 
own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS, 
filesyststem, etc).
So what is the purpose of having this catalog within Spark for tables defined 
in Spark (which could be a front for other “catalogs”)?
Is it trying to fulfill some void/need…..
Also, would the Spark catalog be the common denominator of the other catalogs 
(least featured) or a super-feature catalog?

From: Xiao Li <gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>
Date: Saturday, December 1, 2018 at 10:49 PM
To: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>>
Cc: "u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: DataSourceV2 community sync #3

Hi, Ryan,

Let us first focus on answering the most fundamental problem before discussing 
various related topics. What is a catalog in Spark SQL?

My definition of catalog is based on the database catalog. Basically, the 
catalog provides a service that manage the metadata/definitions of database 
objects (e.g., database, views, tables, functions, user roles, and so on).

In Spark SQL, all the external objects accessed through our data source APIs 
are called "tables". I do not think we will expand the support in the near 
future. That means, the metadata we need from the external data sources are for 
table only.

These data sources should not use the Catalog identifier to identify. That 
means, in "catalog.database.table", catalog is only used to identify the actual 
catalog instead of data sources.

For a Spark cluster, we could mount multiple catalogs (e.g., hive_metastore_1, 
hive_metastore_2 and glue_1) at the same time. We could get the metadata of the 
tables, database, functions by accessing different catalog: 
"hive_metastore_1.db1.tab1", "hive_metastore_2.db2.tab2", "glue.db3.tab2". In 
the future, if Spark has its own catalog implementation, we might have 
something like, "spark_catalog1.db3.tab2". The catalog will be used for 
registering all the external data sources, various Spark UDFs and so on.

At the same time, we should NOT mix the table-level data sources with catalog 
support. That means, "Cassandra1.db1.tab1", "Kafka2.db2.tab1", 
"Hbase3.db1.tab2" will not appear.

Do you agree on my definition of catalog in Spark SQL?

Xiao

Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> 于2018年12月1日周六 下午7:25写道：

I try to avoid discussing each specific topic about the catalog federation 
before we deciding the framework of multi-catalog supports.

I’ve tried to open discussions on this for the last 6+ months because we need 
it. I understand that you’d like a comprehensive plan for supporting more than 
one catalog before moving forward, but I think most of us are okay with the 
incremental approach. It’s better to make progress toward the goal.

In general, data source API V2 and catalog API should be orthogonal
I agree with you, and they are. The API that Wenchen is working on for reading 
and writing data and the TableCatalog API are orthogonal efforts. As I said, 
they only overlap with the Table interface, and clearly tables loaded from a 
catalog need to be able to plug into the read/write API.

The reason these two efforts are related is that the community voted to 
standardize logical 
plans<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
 Those standard plans have well-defined behavior for operations like CTAS, 
instead of relying on the data source plugin to do … something undefined. To 
implement this, we need a way for Spark to create tables, drop tables, etc. 
That’s why we need a way for sources to plug in Table-related catalog 
operations. (Sorry if this was already clear. I know I talked about it at 
length in the first v2 sync up.)

While the two APIs are orthogonal and serve different purposes, implementing 
common operations requires that we have both.

I would not call it a table catalog. I do not expect the data source 
should/need to implement a catalog. Since you might want an atomic CTAS, we can 
improve the table metadata resolution logic to support it with different 
resolution priorities. For example, try to get the metadata from the external 
data source, if the table metadata is not available in the catalog.

It sounds like your definition of a “catalog” is different. I think you’re 
referring to a global catalog? Could you explain what you’re talking about here?

I’m talking about an API to interface with an external data source, which I 
think we need for the reasons I outlined above. I don’t care what we call it, 
but your comment seems to hint that there would be an API to look up tables in 
external sources. That’s the thing I’m talking about.

CatalogTableIdentifier: The PR is doing nothing but adding an interface.

Yes. I opened this PR to discuss how Spark should track tables from different 
catalogs and avoid passing those references to code paths that don’t support 
them. The use of table identifiers with a catalog part was discussed in the 
“Multiple catalog support” thread. I’ve also brought it up and pointed out how 
I think it should be used in syncs a couple of times.

Sorry if this discussion isn’t how you would have done it, but it’s a fairly 
simple idea that I don’t think requires its own doc.

On Sat, Dec 1, 2018 at 5:12 PM Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
Hi, Ryan,

I try to avoid discussing each specific topic about the catalog federation 
before we deciding the framework of multi-catalog supports.

-  CatalogTableIdentifier: The PR https://github.com/apache/spark/pull/21978 is 
doing nothing but adding an interface. In the PR, we did not discuss how to 
resolve it, any restriction on the naming and what is a catalog.This requires 
more doc for explaining it. For example, 
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-2017
 Normally, we do not merge a PR without showing how to use it.

- TableCatalog: First, I would not call it a table catalog. I do not expect the 
data source should/need to implement a catalog. Since you might want an atomic 
CTAS, we can improve the table metadata resolution logic to support it with 
different resolution priorities. For example, try to get the metadata from the 
external data source, if the table metadata is not available in the catalog. 
However, the catalog should do what the catalog is expected to do. If we follow 
what our data source API V2 is doing, basically, the data source is just a 
table. It is not related to database, view, or function. Mixing catalog with 
data source API V2 just makes the whole things more complex.

In general, data source API V2 and catalog API should be orthogonal. I believe 
the data source API V2 and catalog APIs are two separate projects. Hopefully, 
you understand my concern. If we really want to mix them together, I want to 
read the design of your multi-catalog support and understand more details.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> 于2018年12月1日周六 下午3:22写道：
Xiao,

I do have opinions about how multi-catalog support should work, but I don't 
think we are at a point where there is consensus. That's why I've started 
discussion threads and added the CatalogTableIdentifier PR instead of a 
comprehensive design doc. You have opinions about how users should interact 
with catalogs as well (your "federated catalog") and we should discuss our 
options here.

But the crucial point is that the user interaction doesn't need to be 
completely decided in order to move forward. A design for multi-catalog support 
isn't what we need right now; we need an API that plugins can implement to 
expose table operations.

I've proposed that API, TableCatalog, and a way to manage catalog plugins. I've 
made an argument for why I think that API is flexible enough for the task and 
still fairly simple.

I think that we can add TableCatalog now and work on multi-catalog support 
incrementally, and I have yet to hear your argument for why that is not the 
case.

rb

On Sat, Dec 1, 2018 at 12:36 PM Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
Hi, Ryan,

Catalog is a really important component for Spark SQL or any analytics 
platform, I have to emphasize. Thus, a careful design is needed to ensure it 
works as expected. Based on my previous discussion with many community members, 
Spark SQL needs a catalog interface so that we can mount multiple external 
physical catalogs and they can be presented as a single logical catalog [which 
is a so-called global federated catalog]. In the future, we can use this 
interface to develop our own catalog (instead of Hive metastore) for more 
efficient metadata management. We can also plug in ACL management if needed.

Based on your previous answers, it sounds like you have many ideas in your mind 
about building a Catalog interface for Spark SQL, but it is not shown in the 
design doc. Could you write them down in a single doc? We can try to leave 
comments in the design doc, instead of discussing various issues in PRs, emails 
and meetings. It can also help the whole community understand your proposal and 
post their comments.

Thanks,

Xiao

Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> 于2018年11月29日周四 下午7:06写道：

Xiao,

For the questions in this last email about how catalogs interact and how 
functions and other future features work: we discussed those last night. As I 
said then, I think that the right approach is incremental. We don’t want to 
design all of that in one gigantic proposal up front. To do that is to put 
ourselves into analysis paralysis.

We don’t have a design for how catalogs interact with one another, but I think 
we made a strong case for two points: first, that the proposed structure 
doesn’t preclude any of those future decisions (hence we should proceed 
incrementally). Second, that those situations aren’t that hard to think through 
if you’re concerned about them: functions that can run in Spark can be run on 
any data, functions that run in external sources cannot be run on any data.

You’re right that I haven’t completely covered your new questions. But to the 
questions in your first email:
•         You asked how, for example, Glue may be plugged in. That is well 
covered in the PR that adds catalogs as a 
plugin<https://github.com/apache/spark/pull/21306#issue-187572913>, the 
response I sent to Wenchen’s questions, and the earlier discussion thread I 
posted to this list with the subject “[DISCUSS] Multiple catalog support”. The 
short answer is that implementations are configured with Spark config 
properties and loaded with reflection.
•         You asked how users implement an external catalog without adding new 
data sources. That’s also covered in the “Multiple catalog support” proposal, 
the table catalog PR, and ongoing discussions on the v2 redesign. The answer is 
that a catalog returns a table instance that implements the various interfaces 
from Wenchen’s work. A table may implement them directly or return other 
existing implementations. Here’s how it worked in the old 
API<https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>.

I hope that you don’t think I expect you to go “without seeing the design”!

rb

On Thu, Nov 29, 2018 at 3:17 PM Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
Ryan,

All the proposal I read is only related to Table metadata. Catalog contains the 
metadata of database, functions, columns, views, and so on. When we have 
multiple catalogs, how these catalogs interact with each other? How the global 
catalog works? How a view, table, function, database and column is resolved? Do 
we have nickname, mapping, wrapper?

Or I might miss the design docs you send? Could you post the doc?

Thanks,

Xiao

Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> 于2018年11月29日周四 下午3:06写道：
Xiao,

Please have a look at the pull requests and documents I've posted over the last 
few months.

If you still have questions about how you might plug in Glue, let me know and I 
can clarify.

rb

On Thu, Nov 29, 2018 at 2:56 PM Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
Ryan,

Thanks for leading the discussion and sending out the memo!

Xiao suggested that there are restrictions for how tables and functions 
interact. Because of this, he doesn’t think that separate TableCatalog and 
FunctionCatalog APIs are feasible.

Anything is possible. It depends on how we design the two interfaces. Now, most 
parts are unknown to me without seeing the design.

I think we need to see the user stories, and high-level design before working 
on a small portion of Catalog federation. We do not need an exhaustive design 
in the current stage, but we need to know how the new proposal works. For 
example, how to plug in a new Hive metastore? How to plug in a Glue? How do 
users implement a new external catalog without adding any new data sources? 
Without knowing more details, it is hard to say whether this TableCatalog can 
satisfy all the requirements.

Cheers,

Xiao

Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：

Hi everyone,

Here are my notes from last night’s sync. Some attendees that joined during 
discussion may be missing, since I made the list while we were waiting for 
people to join.

If you have topic suggestions for the next sync, please start sending them to 
me. Thank you!

Attendees:

Ryan Blue
John Zhuge
Jamison Bennett
Yuanjian Li
Xiao Li
stczwd
Matt Cheah
Wenchen Fan
Genglian Wang
Kevin Yu
Maryann Xue
Cody Koeninger
Bruce Robbins
Rohit Karlupia

Agenda:
•         Follow-up issues or discussion on Wenchen’s PR #23086
•         TableCatalog proposal
•         CatalogTableIdentifier

Notes:
•         Discussion about PR #23086
o    Where should the catalog API live since it needs to be accessible to 
catalyst rules, but the catalyst module is private?
o    Wenchen suggested creating a sql-api module for v2 API interfaces, making 
catalyst depend on it
o    Consensus was to use Wenchen’s suggestion
•         In discussion about #23086, Xiao asked how adding catalog to a table 
identifier will work
o    Background from Ryan: existing code paths use TableIdentifier and don’t 
expect a catalog portion. If an identifier with a catalog were passed to 
existing code, that code may use the default catalog not knowing that a 
different one was requested, which would be incorrect behavior.
o    Ryan: The proposal for CatalogTableIdentifier addresses this problem. 
TableIdentifier is used for identifiers that have no catalog set. By enforcing 
that requirement, passing a TableIdentifier to old code ensures that no 
catalogs leak into that code. This is also used when the catalog is set from 
context. For example, the TableCatalog API accepts only TableIdentifier because 
the catalog is already determined.
•         Xiao asked whether FunctionIdentifier needs to be updated in the same 
way as CatalogTableIdentifier.
o    Ryan: Yes, when a FunctionCatalog API is added
•         The remaining time was spent discussing whether the plan to 
incrementally replace the current catalog API will work. [Not great notes here, 
feel free to add your take in a reply]
o    Xiao suggested that there are restrictions for how tables and functions 
interact. Because of this, he doesn’t think that separate TableCatalog and 
FunctionCatalog APIs are feasible.
o    Wenchen and Ryan think that functions should be orthogonal to data sources
o    Matt and Ryan think that catalog design can be done incrementally as new 
interfaces (i.e. FunctionCatalog) are added and that the proposed TableCatalog 
does not preclude designing for Xiao’s concerns later
o    [I forget who] pointed out that there are restrictions in some databases 
for views from different sources
o    There was some discussion about when functions or views cannot be 
orthogonal. For example, where the code runs is important. Functions pushed to 
sources cannot necessarily be run on other sources and Spark functions cannot 
necessarily be pushed down to sources.
o    Xiao would like a full catalog replacement design, including views, 
databases, and functions and how they interact, before moving forward with the 
proposed TableCatalog API
o    Ryan [and Matt, I think] think that TableCatalog is compatible with future 
decisions and the best path forward is to build incrementally. An exhaustive 
design process blocks progress on v2.

On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue 
<rb...@netflix.com<mailto:rb...@netflix.com>> wrote:

Hi everyone,

I just sent out an invite for the next DSv2 community sync for Wednesday, 28 
Nov at 5PM PST.

We have a few topics left over from last time to cover. A few people wanted to 
cover catalog APIs, so I put two items on the agenda:
•         The TableCatalog proposal (and other catalog APIs)
•         Using CatalogTableIdentifier to separate v1 and v2 code paths and 
avoid unintended behavior changes

As I noted in the summary last time, please send topics ahead of time so we can 
get started more quickly.

If you would like to be added to the google hangout invite, please let me know 
and I’ll add you. Thanks!

rb
--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 community sync #3

Reply via email to