I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.
On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > I guess my question is why is this a Spark level behavior? Say the user > has an underlying source where they have a different behavior at the source > level. In Spark they set a new default behavior and it's added to the > catalogue, is the Source expected to propagate this? Or does the user have > to be aware that their own Source settings may be different for a client > connecting via Spark or via a native driver. > > For example say i'm using C* (sorry but obviously I'm always thinking > about C*), and I add a new column to the database. When i connect to the > database with a non-spark application I expect to be able to insert to the > table given that I satisfy the required columns. In Spark someone sets the > columns as having a default value (there is no such feature in C*), now > depending on how I connect to the source I have two different behaviors. If > I insert from the native app I get empty cells, if I insert from spark i > get a default value inserted. That sounds more confusing to an end-user to > than having a consistent behavior between native clients and Spark clients. > This is why I asked if the goal was to just have a common "Spark" behavior > because I don't think it makes sense if you consider multiple interaction > points for a source. > > On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> So you agree with my proposal that we should follow RDBMS/SQL standard >> regarding the behavior? >> >> > pass the default through to the underlying data source >> >> This is one way to implement the behavior. >> >> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> I don't think we have to change the syntax. Isn't the right thing (for >>> option 1) to pass the default through to the underlying data source? >>> Sources that don't support defaults would throw an exception. >>> >>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD >>>> COLUMN column_name datatype [DEFAULT value]; >>>> >>>> If the DEFAULT statement is not specified, then the default value is >>>> null. If we are going to change the behavior and say the default value is >>>> decided by the underlying data source, we should use a new SQL syntax(I >>>> don't have a proposal in mind), instead of reusing the existing syntax, to >>>> be SQL compatible. >>>> >>>> Personally I don't like re-invent wheels. It's better to just implement >>>> the SQL standard ADD COLUMN command, which means the default value is >>>> decided by the end-users. >>>> >>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> Wenchen, can you give more detail about the different ADD COLUMN >>>>> syntax? That sounds confusing to end users to me. >>>>> >>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> Note that the design we make here will affect both data source >>>>>> developers and end-users. It's better to provide reliable behaviors to >>>>>> end-users, instead of asking them to read the spec of the data source and >>>>>> know which value will be used for missing columns, when they write data. >>>>>> >>>>>> If we do want to go with the "data source decides default value" >>>>>> approach, we should create a new SQL syntax for ADD COLUMN, as its >>>>>> behavior >>>>>> is different from the SQL standard ADD COLUMN command. >>>>>> >>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we >>>>>>> want 2 is for a unified way of dealing with missing columns? I feel like >>>>>>> that probably should be left up to the underlying datasource >>>>>>> implementation. For example if you have missing columns with a database >>>>>>> the >>>>>>> Datasource can choose a value based on the Database's metadata if such a >>>>>>> thing exists, I don't think Spark should really have a this level of >>>>>>> detail >>>>>>> but I've also missed out on all of these meetings (sorry it's family >>>>>>> dinner >>>>>>> time :) ) so I may be missing something. >>>>>>> >>>>>>> So my tldr is, Let a datasource report whether or not missing >>>>>>> columns are OK and let the Datasource deal with the missing data based >>>>>>> on >>>>>>> it's underlying storage. >>>>>>> >>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree that we should not rewrite existing parquet files when a >>>>>>>> new column is added, but we should also try out best to make the >>>>>>>> behavior >>>>>>>> same as RDBMS/SQL standard. >>>>>>>> >>>>>>>> 1. it should be the user who decides the default value of a column, >>>>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER >>>>>>>> COLUMN. >>>>>>>> 2. When adding a new column, the default value should be effective >>>>>>>> for all the existing data, and newly written data. >>>>>>>> 3. When altering an existing column and change the default value, >>>>>>>> it should be effective for newly written data only. >>>>>>>> >>>>>>>> A possible implementation: >>>>>>>> 1. a columnn has 2 default values: the initial one and the latest >>>>>>>> one. >>>>>>>> 2. when adding a column with a default value, set both the initial >>>>>>>> one and the latest one to this value. But do not update existing data. >>>>>>>> 3. when reading data, fill the missing column with the initial >>>>>>>> default value >>>>>>>> 4. when writing data, fill the missing column with the latest >>>>>>>> default value >>>>>>>> 5. when altering a column to change its default value, only update >>>>>>>> the latest default value. >>>>>>>> >>>>>>>> This works because: >>>>>>>> 1. new files will be written with the latest default value, nothing >>>>>>>> we need to worry about at read time. >>>>>>>> 2. old files will be read with the initial default value, which >>>>>>>> returns expected result. >>>>>>>> >>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> This thread is a follow-up to a discussion that we started in the >>>>>>>>> DSv2 community sync last week. >>>>>>>>> >>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 >>>>>>>>> to integrate supports schema evolution. Specifically, adding a new >>>>>>>>> optional >>>>>>>>> column so that rows without that column get a default value (null for >>>>>>>>> Iceberg). The current validation rule for an append in DSv2 fails a >>>>>>>>> write >>>>>>>>> if it is missing a column, so adding a column to an existing table >>>>>>>>> will >>>>>>>>> cause currently-scheduled jobs that insert data to start failing. >>>>>>>>> Clearly, >>>>>>>>> schema evolution shouldn't break existing jobs that produce valid >>>>>>>>> data. >>>>>>>>> >>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark >>>>>>>>> to check whether to fail when an optional column is missing. Other >>>>>>>>> contributors in the sync thought that Spark should go with option 2: >>>>>>>>> Spark’s schema should have defaults and Spark should handle filling in >>>>>>>>> defaults the same way across all sources, like other databases. >>>>>>>>> >>>>>>>>> I think we agree that option 2 would be ideal. The problem is that >>>>>>>>> it is very hard to implement. >>>>>>>>> >>>>>>>>> A source might manage data stored in millions of immutable Parquet >>>>>>>>> files, so adding a default value isn’t possible. Spark would need to >>>>>>>>> fill >>>>>>>>> in defaults for files written before the column was added at read >>>>>>>>> time (it >>>>>>>>> could fill in defaults in new files at write time). Filling in >>>>>>>>> defaults at >>>>>>>>> read time would require Spark to fill in defaults for only some of the >>>>>>>>> files in a scan, so Spark would need different handling for each task >>>>>>>>> depending on the schema of that task. Tasks would also be required to >>>>>>>>> produce a consistent schema, so a file without the new column >>>>>>>>> couldn’t be >>>>>>>>> combined into a task with a file that has the new column. This adds >>>>>>>>> quite a >>>>>>>>> bit of complexity. >>>>>>>>> >>>>>>>>> Other sources may not need Spark to fill in the default at all. A >>>>>>>>> JDBC source would be capable of filling in the default values itself, >>>>>>>>> so >>>>>>>>> Spark would need some way to communicate the default to that source. >>>>>>>>> If the >>>>>>>>> source had a different policy for default values (write time instead >>>>>>>>> of >>>>>>>>> read time, for example) then behavior could still be inconsistent. >>>>>>>>> >>>>>>>>> I think that this complexity probably isn’t worth consistency in >>>>>>>>> default values across sources, if that is even achievable. >>>>>>>>> >>>>>>>>> In the sync we thought it was a good idea to send this out to the >>>>>>>>> larger group to discuss. Please reply with comments! >>>>>>>>> >>>>>>>>> rb >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> -- Ryan Blue Software Engineer Netflix