Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Ryan Blue
I agree with Reynold's sentiment here. We don't want to create too many capabilities because it makes everything more complicated for both sources and Spark. Let's just go with the capability to read missing columns for now and we can add support for default values if and when Spark DDL begins to s

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Ryan Blue
Alessandro, yes. This was one of the use cases that motivated the capability API I proposed. After this discussion, I think we probably need a couple of capabilities. First, the capability that indicates reads will fill in some default value for missing columns. That way, Spark allows writes to co

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Reynold Xin
I'd only do any of the schema evolution things as add-on on top. This is an extremely complicated area and we could risk never shipping anything because there would be a lot of different requirements. On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer < russell.spit...@gmail.com > wrote: > > I

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Russell Spitzer
I definitely would like to have a "column can be missing" capability, allowing the underlying datasource to fill in a default if it wants to (or not). On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hello, > I agree that Spark should check whether t

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Alessandro Solimando
Hello, I agree that Spark should check whether the underlying datasource support default values or not, and adjust its behavior accordingly. If we follow this direction, do you see the default-values capability in scope of the "DataSourceV2 capability API"? Best regards, Alessandro On Fri, 21 De

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Wenchen Fan
Hi Ryan, That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports. Following this direction, it makes more sense to delegate everything to data sources. As the first step, maybe we should no

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Ryan Blue
I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases. On Thu, Dec 20, 2018

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Russell Spitzer
I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior? > pass the default through to the underlying data source This is one way to implement the behavior. On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote: > I don't think we have to change the syntax. Isn

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception. On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan wrote: > The standard ADD COLUMN SQL syntax is: AL

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value]; If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we sh

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me. On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan wrote: > Note that the design we make here will affect both data source developers > and end-users. It's better to provide reliable behav

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data. If we do want to

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Russell Spitzer
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard. 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTE

[DISCUSS] Default values and data sources

2018-12-18 Thread Ryan Blue
Hi everyone, This thread is a follow-up to a discussion that we started in the DSv2 community sync last week. The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a