So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?
> pass the default through to the underlying data source This is one way to implement the behavior. On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote: > I don't think we have to change the syntax. Isn't the right thing (for > option 1) to pass the default through to the underlying data source? > Sources that don't support defaults would throw an exception. > > On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN >> column_name datatype [DEFAULT value]; >> >> If the DEFAULT statement is not specified, then the default value is >> null. If we are going to change the behavior and say the default value is >> decided by the underlying data source, we should use a new SQL syntax(I >> don't have a proposal in mind), instead of reusing the existing syntax, to >> be SQL compatible. >> >> Personally I don't like re-invent wheels. It's better to just implement >> the SQL standard ADD COLUMN command, which means the default value is >> decided by the end-users. >> >> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> Wenchen, can you give more detail about the different ADD COLUMN syntax? >>> That sounds confusing to end users to me. >>> >>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> Note that the design we make here will affect both data source >>>> developers and end-users. It's better to provide reliable behaviors to >>>> end-users, instead of asking them to read the spec of the data source and >>>> know which value will be used for missing columns, when they write data. >>>> >>>> If we do want to go with the "data source decides default value" >>>> approach, we should create a new SQL syntax for ADD COLUMN, as its behavior >>>> is different from the SQL standard ADD COLUMN command. >>>> >>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want >>>>> 2 is for a unified way of dealing with missing columns? I feel like that >>>>> probably should be left up to the underlying datasource implementation. >>>>> For >>>>> example if you have missing columns with a database the Datasource can >>>>> choose a value based on the Database's metadata if such a thing exists, I >>>>> don't think Spark should really have a this level of detail but I've also >>>>> missed out on all of these meetings (sorry it's family dinner time :) ) so >>>>> I may be missing something. >>>>> >>>>> So my tldr is, Let a datasource report whether or not missing columns >>>>> are OK and let the Datasource deal with the missing data based on it's >>>>> underlying storage. >>>>> >>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> I agree that we should not rewrite existing parquet files when a new >>>>>> column is added, but we should also try out best to make the behavior >>>>>> same >>>>>> as RDBMS/SQL standard. >>>>>> >>>>>> 1. it should be the user who decides the default value of a column, >>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN. >>>>>> 2. When adding a new column, the default value should be effective >>>>>> for all the existing data, and newly written data. >>>>>> 3. When altering an existing column and change the default value, it >>>>>> should be effective for newly written data only. >>>>>> >>>>>> A possible implementation: >>>>>> 1. a columnn has 2 default values: the initial one and the latest one. >>>>>> 2. when adding a column with a default value, set both the initial >>>>>> one and the latest one to this value. But do not update existing data. >>>>>> 3. when reading data, fill the missing column with the initial >>>>>> default value >>>>>> 4. when writing data, fill the missing column with the latest default >>>>>> value >>>>>> 5. when altering a column to change its default value, only update >>>>>> the latest default value. >>>>>> >>>>>> This works because: >>>>>> 1. new files will be written with the latest default value, nothing >>>>>> we need to worry about at read time. >>>>>> 2. old files will be read with the initial default value, which >>>>>> returns expected result. >>>>>> >>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> This thread is a follow-up to a discussion that we started in the >>>>>>> DSv2 community sync last week. >>>>>>> >>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to >>>>>>> integrate supports schema evolution. Specifically, adding a new optional >>>>>>> column so that rows without that column get a default value (null for >>>>>>> Iceberg). The current validation rule for an append in DSv2 fails a >>>>>>> write >>>>>>> if it is missing a column, so adding a column to an existing table will >>>>>>> cause currently-scheduled jobs that insert data to start failing. >>>>>>> Clearly, >>>>>>> schema evolution shouldn't break existing jobs that produce valid data. >>>>>>> >>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to >>>>>>> check whether to fail when an optional column is missing. Other >>>>>>> contributors in the sync thought that Spark should go with option 2: >>>>>>> Spark’s schema should have defaults and Spark should handle filling in >>>>>>> defaults the same way across all sources, like other databases. >>>>>>> >>>>>>> I think we agree that option 2 would be ideal. The problem is that >>>>>>> it is very hard to implement. >>>>>>> >>>>>>> A source might manage data stored in millions of immutable Parquet >>>>>>> files, so adding a default value isn’t possible. Spark would need to >>>>>>> fill >>>>>>> in defaults for files written before the column was added at read time >>>>>>> (it >>>>>>> could fill in defaults in new files at write time). Filling in defaults >>>>>>> at >>>>>>> read time would require Spark to fill in defaults for only some of the >>>>>>> files in a scan, so Spark would need different handling for each task >>>>>>> depending on the schema of that task. Tasks would also be required to >>>>>>> produce a consistent schema, so a file without the new column couldn’t >>>>>>> be >>>>>>> combined into a task with a file that has the new column. This adds >>>>>>> quite a >>>>>>> bit of complexity. >>>>>>> >>>>>>> Other sources may not need Spark to fill in the default at all. A >>>>>>> JDBC source would be capable of filling in the default values itself, so >>>>>>> Spark would need some way to communicate the default to that source. If >>>>>>> the >>>>>>> source had a different policy for default values (write time instead of >>>>>>> read time, for example) then behavior could still be inconsistent. >>>>>>> >>>>>>> I think that this complexity probably isn’t worth consistency in >>>>>>> default values across sources, if that is even achievable. >>>>>>> >>>>>>> In the sync we thought it was a good idea to send this out to the >>>>>>> larger group to discuss. Please reply with comments! >>>>>>> >>>>>>> rb >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >