I'd only do any of the schema evolution things as add-on on top. This is an extremely complicated area and we could risk never shipping anything because there would be a lot of different requirements.
On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer < russell.spit...@gmail.com > wrote: > > I definitely would like to have a "column can be missing" capability, > allowing the underlying datasource to fill in a default if it wants to (or > not). > > On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando < alessandro. solimando@ > gmail. com ( alessandro.solima...@gmail.com ) > wrote: > > >> Hello, >> I agree that Spark should check whether the underlying datasource >> support default values or not, and adjust its behavior accordingly. >> >> If we follow this direction, do you see the default-values capability >> in scope of the "DataSourceV2 capability API"? >> >> Best regards, >> Alessandro >> >> On Fri, 21 Dec 2018 at 03:31, Wenchen Fan < cloud0fan@ gmail. com ( >> cloud0...@gmail.com ) > wrote: >> > >> > Hi Ryan, >> > >> > That's a good point. Since in this case Spark is just a channel to pass >> user's action to the data source, we should think of what actions the data >> source supports. >> > >> > Following this direction, it makes more sense to delegate everything to >> data sources. >> > >> > As the first step, maybe we should not add DDL commands to change schema >> of data source, but just use the capability API to let data source decide >> what to do when input schema doesn't match the table schema during >> writing. Users can use native client of data source to change schema. >> > >> > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue < rblue@ netflix. com ( >> rb...@netflix.com ) > wrote: >> >> >> >> I think it is good to know that not all sources support default values. >> That makes me think that we should delegate this behavior to the source >> and have a way for sources to signal that they accept default values in >> DDL (a capability) and assume that they do not in most cases. >> >> >> >> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer < russell. spitzer@ gmail. >> com ( russell.spit...@gmail.com ) > wrote: >> >>> >> >>> I guess my question is why is this a Spark level behavior? Say the >> user has an underlying source where they have a different behavior at the >> source level. In Spark they set a new default behavior and it's added to >> the catalogue, is the Source expected to propagate this? Or does the user >> have to be aware that their own Source settings may be different for a >> client connecting via Spark or via a native driver. >> >>> >> >>> For example say i'm using C* (sorry but obviously I'm always thinking >> about C*), and I add a new column to the database. When i connect to the >> database with a non-spark application I expect to be able to insert to the >> table given that I satisfy the required columns. In Spark someone sets the >> columns as having a default value (there is no such feature in C*), now >> depending on how I connect to the source I have two different behaviors. >> If I insert from the native app I get empty cells, if I insert from spark >> i get a default value inserted. That sounds more confusing to an end-user >> to than having a consistent behavior between native clients and Spark >> clients. This is why I asked if the goal was to just have a common "Spark" >> behavior because I don't think it makes sense if you consider multiple >> interaction points for a source. >> >>> >> >>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan < cloud0fan@ gmail. com ( >> cloud0...@gmail.com ) > wrote: >> >>>> >> >>>> So you agree with my proposal that we should follow RDBMS/SQL >> standard regarding the behavior? >> >>>> >> >>>> > pass the default through to the underlying data source >> >>>> >> >>>> This is one way to implement the behavior. >> >>>> >> >>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue < rblue@ netflix. com ( >> rb...@netflix.com ) > wrote: >> >>>>> >> >>>>> I don't think we have to change the syntax. Isn't the right thing >> (for option 1) to pass the default through to the underlying data source? >> Sources that don't support defaults would throw an exception. >> >>>>> >> >>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan < cloud0fan@ gmail. com ( >> cloud0...@gmail.com ) > wrote: >> >>>>>> >> >>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD >> COLUMN column_name datatype [DEFAULT value]; >> >>>>>> >> >>>>>> If the DEFAULT statement is not specified, then the default value >> is null. If we are going to change the behavior and say the default value >> is decided by the underlying data source, we should use a new SQL syntax(I >> don't have a proposal in mind), instead of reusing the existing syntax, to >> be SQL compatible. >> >>>>>> >> >>>>>> Personally I don't like re-invent wheels. It's better to just >> implement the SQL standard ADD COLUMN command, which means the default >> value is decided by the end-users. >> >>>>>> >> >>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue < rblue@ netflix. com ( >> rb...@netflix.com ) > wrote: >> >>>>>>> >> >>>>>>> Wenchen, can you give more detail about the different ADD COLUMN >> syntax? That sounds confusing to end users to me. >> >>>>>>> >> >>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan < cloud0fan@ gmail. com >> ( cloud0...@gmail.com ) > wrote: >> >>>>>>>> >> >>>>>>>> Note that the design we make here will affect both data source >> developers and end-users. It's better to provide reliable behaviors to >> end-users, instead of asking them to read the spec of the data source and >> know which value will be used for missing columns, when they write data. >> >>>>>>>> >> >>>>>>>> If we do want to go with the "data source decides default value" >> approach, we should create a new SQL syntax for ADD COLUMN, as its >> behavior is different from the SQL standard ADD COLUMN command. >> >>>>>>>> >> >>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer < russell. spitzer@ >> gmail. com ( russell.spit...@gmail.com ) > wrote: >> >>>>>>>>> >> >>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we >> want 2 is for a unified way of dealing with missing columns? I feel like >> that probably should be left up to the underlying datasource >> implementation. For example if you have missing columns with a database >> the Datasource can choose a value based on the Database's metadata if such >> a thing exists, I don't think Spark should really have a this level of >> detail but I've also missed out on all of these meetings (sorry it's >> family dinner time :) ) so I may be missing something. >> >>>>>>>>> >> >>>>>>>>> So my tldr is, Let a datasource report whether or not missing >> columns are OK and let the Datasource deal with the missing data based on >> it's underlying storage. >> >>>>>>>>> >> >>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan < cloud0fan@ gmail. com >> ( cloud0...@gmail.com ) > wrote: >> >>>>>>>>>> >> >>>>>>>>>> I agree that we should not rewrite existing parquet files when >> a new column is added, but we should also try out best to make the >> behavior same as RDBMS/SQL standard. >> >>>>>>>>>> >> >>>>>>>>>> 1. it should be the user who decides the default value of a >> column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER >> COLUMN. >> >>>>>>>>>> 2. When adding a new column, the default value should be >> effective for all the existing data, and newly written data. >> >>>>>>>>>> 3. When altering an existing column and change the default >> value, it should be effective for newly written data only. >> >>>>>>>>>> >> >>>>>>>>>> A possible implementation: >> >>>>>>>>>> 1. a columnn has 2 default values: the initial one and the >> latest one. >> >>>>>>>>>> 2. when adding a column with a default value, set both the >> initial one and the latest one to this value. But do not update existing >> data. >> >>>>>>>>>> 3. when reading data, fill the missing column with the initial >> default value >> >>>>>>>>>> 4. when writing data, fill the missing column with the latest >> default value >> >>>>>>>>>> 5. when altering a column to change its default value, only >> update the latest default value. >> >>>>>>>>>> >> >>>>>>>>>> This works because: >> >>>>>>>>>> 1. new files will be written with the latest default value, >> nothing we need to worry about at read time. >> >>>>>>>>>> 2. old files will be read with the initial default value, which >> returns expected result. >> >>>>>>>>>> >> >>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue < rblue@ netflix. com. >> invalid ( rb...@netflix.com.invalid ) > wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Hi everyone, >> >>>>>>>>>>> >> >>>>>>>>>>> This thread is a follow-up to a discussion that we started in >> the DSv2 community sync last week. >> >>>>>>>>>>> >> >>>>>>>>>>> The problem I’m trying to solve is that the format I’m using >> DSv2 to integrate supports schema evolution. Specifically, adding a new >> optional column so that rows without that column get a default value (null >> for Iceberg). The current validation rule for an append in DSv2 fails a >> write if it is missing a column, so adding a column to an existing table >> will cause currently-scheduled jobs that insert data to start failing. >> Clearly, schema evolution shouldn't break existing jobs that produce valid >> data. >> >>>>>>>>>>> >> >>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for >> Spark to check whether to fail when an optional column is missing. Other >> contributors in the sync thought that Spark should go with option 2: >> Spark’s schema should have defaults and Spark should handle filling in >> defaults the same way across all sources, like other databases. >> >>>>>>>>>>> >> >>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is >> that it is very hard to implement. >> >>>>>>>>>>> >> >>>>>>>>>>> A source might manage data stored in millions of immutable >> Parquet files, so adding a default value isn’t possible. Spark would need >> to fill in defaults for files written before the column was added at read >> time (it could fill in defaults in new files at write time). Filling in >> defaults at read time would require Spark to fill in defaults for only >> some of the files in a scan, so Spark would need different handling for >> each task depending on the schema of that task. Tasks would also be >> required to produce a consistent schema, so a file without the new column >> couldn’t be combined into a task with a file that has the new column. This >> adds quite a bit of complexity. >> >>>>>>>>>>> >> >>>>>>>>>>> Other sources may not need Spark to fill in the default at >> all. A JDBC source would be capable of filling in the default values >> itself, so Spark would need some way to communicate the default to that >> source. If the source had a different policy for default values (write >> time instead of read time, for example) then behavior could still be >> inconsistent. >> >>>>>>>>>>> >> >>>>>>>>>>> I think that this complexity probably isn’t worth consistency >> in default values across sources, if that is even achievable. >> >>>>>>>>>>> >> >>>>>>>>>>> In the sync we thought it was a good idea to send this out to >> the larger group to discuss. Please reply with comments! >> >>>>>>>>>>> >> >>>>>>>>>>> rb >> >>>>>>>>>>> >> >>>>>>>>>>> -- >> >>>>>>>>>>> Ryan Blue >> >>>>>>>>>>> Software Engineer >> >>>>>>>>>>> Netflix >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> Ryan Blue >> >>>>>>> Software Engineer >> >>>>>>> Netflix >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Ryan Blue >> >>>>> Software Engineer >> >>>>> Netflix >> >> >> >> >> >> >> >> -- >> >> Ryan Blue >> >> Software Engineer >> >> Netflix > > >