I definitely would like to have a "column can be missing" capability, allowing the underlying datasource to fill in a default if it wants to (or not).
On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hello, > I agree that Spark should check whether the underlying datasource > support default values or not, and adjust its behavior accordingly. > > If we follow this direction, do you see the default-values capability > in scope of the "DataSourceV2 capability API"? > > Best regards, > Alessandro > > On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <cloud0...@gmail.com> wrote: > > > > Hi Ryan, > > > > That's a good point. Since in this case Spark is just a channel to pass > user's action to the data source, we should think of what actions the data > source supports. > > > > Following this direction, it makes more sense to delegate everything to > data sources. > > > > As the first step, maybe we should not add DDL commands to change schema > of data source, but just use the capability API to let data source decide > what to do when input schema doesn't match the table schema during writing. > Users can use native client of data source to change schema. > > > > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <rb...@netflix.com> wrote: > >> > >> I think it is good to know that not all sources support default values. > That makes me think that we should delegate this behavior to the source and > have a way for sources to signal that they accept default values in DDL (a > capability) and assume that they do not in most cases. > >> > >> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >>> > >>> I guess my question is why is this a Spark level behavior? Say the > user has an underlying source where they have a different behavior at the > source level. In Spark they set a new default behavior and it's added to > the catalogue, is the Source expected to propagate this? Or does the user > have to be aware that their own Source settings may be different for a > client connecting via Spark or via a native driver. > >>> > >>> For example say i'm using C* (sorry but obviously I'm always thinking > about C*), and I add a new column to the database. When i connect to the > database with a non-spark application I expect to be able to insert to the > table given that I satisfy the required columns. In Spark someone sets the > columns as having a default value (there is no such feature in C*), now > depending on how I connect to the source I have two different behaviors. If > I insert from the native app I get empty cells, if I insert from spark i > get a default value inserted. That sounds more confusing to an end-user to > than having a consistent behavior between native clients and Spark clients. > This is why I asked if the goal was to just have a common "Spark" behavior > because I don't think it makes sense if you consider multiple interaction > points for a source. > >>> > >>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <cloud0...@gmail.com> > wrote: > >>>> > >>>> So you agree with my proposal that we should follow RDBMS/SQL > standard regarding the behavior? > >>>> > >>>> > pass the default through to the underlying data source > >>>> > >>>> This is one way to implement the behavior. > >>>> > >>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote: > >>>>> > >>>>> I don't think we have to change the syntax. Isn't the right thing > (for option 1) to pass the default through to the underlying data source? > Sources that don't support defaults would throw an exception. > >>>>> > >>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> > wrote: > >>>>>> > >>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD > COLUMN column_name datatype [DEFAULT value]; > >>>>>> > >>>>>> If the DEFAULT statement is not specified, then the default value > is null. If we are going to change the behavior and say the default value > is decided by the underlying data source, we should use a new SQL syntax(I > don't have a proposal in mind), instead of reusing the existing syntax, to > be SQL compatible. > >>>>>> > >>>>>> Personally I don't like re-invent wheels. It's better to just > implement the SQL standard ADD COLUMN command, which means the default > value is decided by the end-users. > >>>>>> > >>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> > wrote: > >>>>>>> > >>>>>>> Wenchen, can you give more detail about the different ADD COLUMN > syntax? That sounds confusing to end users to me. > >>>>>>> > >>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> > wrote: > >>>>>>>> > >>>>>>>> Note that the design we make here will affect both data source > developers and end-users. It's better to provide reliable behaviors to > end-users, instead of asking them to read the spec of the data source and > know which value will be used for missing columns, when they write data. > >>>>>>>> > >>>>>>>> If we do want to go with the "data source decides default value" > approach, we should create a new SQL syntax for ADD COLUMN, as its behavior > is different from the SQL standard ADD COLUMN command. > >>>>>>>> > >>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we > want 2 is for a unified way of dealing with missing columns? I feel like > that probably should be left up to the underlying datasource > implementation. For example if you have missing columns with a database the > Datasource can choose a value based on the Database's metadata if such a > thing exists, I don't think Spark should really have a this level of detail > but I've also missed out on all of these meetings (sorry it's family dinner > time :) ) so I may be missing something. > >>>>>>>>> > >>>>>>>>> So my tldr is, Let a datasource report whether or not missing > columns are OK and let the Datasource deal with the missing data based on > it's underlying storage. > >>>>>>>>> > >>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> > wrote: > >>>>>>>>>> > >>>>>>>>>> I agree that we should not rewrite existing parquet files when > a new column is added, but we should also try out best to make the behavior > same as RDBMS/SQL standard. > >>>>>>>>>> > >>>>>>>>>> 1. it should be the user who decides the default value of a > column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER > COLUMN. > >>>>>>>>>> 2. When adding a new column, the default value should be > effective for all the existing data, and newly written data. > >>>>>>>>>> 3. When altering an existing column and change the default > value, it should be effective for newly written data only. > >>>>>>>>>> > >>>>>>>>>> A possible implementation: > >>>>>>>>>> 1. a columnn has 2 default values: the initial one and the > latest one. > >>>>>>>>>> 2. when adding a column with a default value, set both the > initial one and the latest one to this value. But do not update existing > data. > >>>>>>>>>> 3. when reading data, fill the missing column with the initial > default value > >>>>>>>>>> 4. when writing data, fill the missing column with the latest > default value > >>>>>>>>>> 5. when altering a column to change its default value, only > update the latest default value. > >>>>>>>>>> > >>>>>>>>>> This works because: > >>>>>>>>>> 1. new files will be written with the latest default value, > nothing we need to worry about at read time. > >>>>>>>>>> 2. old files will be read with the initial default value, which > returns expected result. > >>>>>>>>>> > >>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue > <rb...@netflix.com.invalid> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi everyone, > >>>>>>>>>>> > >>>>>>>>>>> This thread is a follow-up to a discussion that we started in > the DSv2 community sync last week. > >>>>>>>>>>> > >>>>>>>>>>> The problem I’m trying to solve is that the format I’m using > DSv2 to integrate supports schema evolution. Specifically, adding a new > optional column so that rows without that column get a default value (null > for Iceberg). The current validation rule for an append in DSv2 fails a > write if it is missing a column, so adding a column to an existing table > will cause currently-scheduled jobs that insert data to start failing. > Clearly, schema evolution shouldn't break existing jobs that produce valid > data. > >>>>>>>>>>> > >>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for > Spark to check whether to fail when an optional column is missing. Other > contributors in the sync thought that Spark should go with option 2: > Spark’s schema should have defaults and Spark should handle filling in > defaults the same way across all sources, like other databases. > >>>>>>>>>>> > >>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is > that it is very hard to implement. > >>>>>>>>>>> > >>>>>>>>>>> A source might manage data stored in millions of immutable > Parquet files, so adding a default value isn’t possible. Spark would need > to fill in defaults for files written before the column was added at read > time (it could fill in defaults in new files at write time). Filling in > defaults at read time would require Spark to fill in defaults for only some > of the files in a scan, so Spark would need different handling for each > task depending on the schema of that task. Tasks would also be required to > produce a consistent schema, so a file without the new column couldn’t be > combined into a task with a file that has the new column. This adds quite a > bit of complexity. > >>>>>>>>>>> > >>>>>>>>>>> Other sources may not need Spark to fill in the default at > all. A JDBC source would be capable of filling in the default values > itself, so Spark would need some way to communicate the default to that > source. If the source had a different policy for default values (write time > instead of read time, for example) then behavior could still be > inconsistent. > >>>>>>>>>>> > >>>>>>>>>>> I think that this complexity probably isn’t worth consistency > in default values across sources, if that is even achievable. > >>>>>>>>>>> > >>>>>>>>>>> In the sync we thought it was a good idea to send this out to > the larger group to discuss. Please reply with comments! > >>>>>>>>>>> > >>>>>>>>>>> rb > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Ryan Blue > >>>>>>>>>>> Software Engineer > >>>>>>>>>>> Netflix > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Ryan Blue > >>>>>>> Software Engineer > >>>>>>> Netflix > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Ryan Blue > >>>>> Software Engineer > >>>>> Netflix > >> > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix >