Re: [DISCUSS] Default values and data sources

Reynold Xin Fri, 21 Dec 2018 10:01:14 -0800

I'd only do any of the schema evolution things as add-on on top. This is an 
extremely complicated area and we could risk never shipping anything because 
there would be a lot of different requirements.


On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer < russell.spit...@gmail.com > 
wrote:

> 
> I definitely would like to have a "column can be missing" capability,
> allowing the underlying datasource to fill in a default if it wants to (or
> not).
> 
> On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando < alessandro. solimando@
> gmail. com ( alessandro.solima...@gmail.com ) > wrote:
> 
> 
>> Hello,
>> I agree that Spark should check whether the underlying datasource
>> support default values or not, and adjust its behavior accordingly.
>> 
>> If we follow this direction, do you see the default-values capability
>> in scope of the "DataSourceV2 capability API"?
>> 
>> Best regards,
>> Alessandro
>> 
>> On Fri, 21 Dec 2018 at 03:31, Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> >
>> > Hi Ryan,
>> >
>> > That's a good point. Since in this case Spark is just a channel to pass
>> user's action to the data source, we should think of what actions the data
>> source supports.
>> >
>> > Following this direction, it makes more sense to delegate everything to
>> data sources.
>> >
>> > As the first step, maybe we should not add DDL commands to change schema
>> of data source, but just use the capability API to let data source decide
>> what to do when input schema doesn't match the table schema during
>> writing. Users can use native client of data source to change schema.
>> >
>> > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue < rblue@ netflix. com (
>> rb...@netflix.com ) > wrote:
>> >>
>> >> I think it is good to know that not all sources support default values.
>> That makes me think that we should delegate this behavior to the source
>> and have a way for sources to signal that they accept default values in
>> DDL (a capability) and assume that they do not in most cases.
>> >>
>> >> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer < russell. spitzer@ gmail.
>> com ( russell.spit...@gmail.com ) > wrote:
>> >>>
>> >>> I guess my question is why is this a Spark level behavior? Say the
>> user has an underlying source where they have a different behavior at the
>> source level. In Spark they set a new default behavior and it's added to
>> the catalogue, is the Source expected to propagate this? Or does the user
>> have to be aware that their own Source settings may be different for a
>> client connecting via Spark or via a native driver.
>> >>>
>> >>> For example say i'm using C* (sorry but obviously I'm always thinking
>> about C*), and I add a new column to the database. When i connect to the
>> database with a non-spark application I expect to be able to insert to the
>> table given that I satisfy the required columns. In Spark someone sets the
>> columns as having a default value (there is no such feature in C*), now
>> depending on how I connect to the source I have two different behaviors.
>> If I insert from the native app I get empty cells, if I insert from spark
>> i get a default value inserted. That sounds more confusing to an end-user
>> to than having a consistent behavior between native clients and Spark
>> clients. This is why I asked if the goal was to just have a common "Spark"
>> behavior because I don't think it makes sense if you consider multiple
>> interaction points for a source.
>> >>>
>> >>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> >>>>
>> >>>> So you agree with my proposal that we should follow RDBMS/SQL
>> standard regarding the behavior?
>> >>>>
>> >>>> > pass the default through to the underlying data source
>> >>>>
>> >>>> This is one way to implement the behavior.
>> >>>>
>> >>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue < rblue@ netflix. com (
>> rb...@netflix.com ) > wrote:
>> >>>>>
>> >>>>> I don't think we have to change the syntax. Isn't the right thing
>> (for option 1) to pass the default through to the underlying data source?
>> Sources that don't support defaults would throw an exception.
>> >>>>>
>> >>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> >>>>>>
>> >>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD
>> COLUMN column_name datatype [DEFAULT value];
>> >>>>>>
>> >>>>>> If the DEFAULT statement is not specified, then the default value
>> is null. If we are going to change the behavior and say the default value
>> is decided by the underlying data source, we should use a new SQL syntax(I
>> don't have a proposal in mind), instead of reusing the existing syntax, to
>> be SQL compatible.
>> >>>>>>
>> >>>>>> Personally I don't like re-invent wheels. It's better to just
>> implement the SQL standard ADD COLUMN command, which means the default
>> value is decided by the end-users.
>> >>>>>>
>> >>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue < rblue@ netflix. com (
>> rb...@netflix.com ) > wrote:
>> >>>>>>>
>> >>>>>>> Wenchen, can you give more detail about the different ADD COLUMN
>> syntax? That sounds confusing to end users to me.
>> >>>>>>>
>> >>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan < cloud0fan@ gmail. com
>> ( cloud0...@gmail.com ) > wrote:
>> >>>>>>>>
>> >>>>>>>> Note that the design we make here will affect both data source
>> developers and end-users. It's better to provide reliable behaviors to
>> end-users, instead of asking them to read the spec of the data source and
>> know which value will be used for missing columns, when they write data.
>> >>>>>>>>
>> >>>>>>>> If we do want to go with the "data source decides default value"
>> approach, we should create a new SQL syntax for ADD COLUMN, as its
>> behavior is different from the SQL standard ADD COLUMN command.
>> >>>>>>>>
>> >>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer < russell. spitzer@
>> gmail. com ( russell.spit...@gmail.com ) > wrote:
>> >>>>>>>>>
>> >>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we
>> want 2 is for a unified way of dealing with missing columns? I feel like
>> that probably should be left up to the underlying datasource
>> implementation. For example if you have missing columns with a database
>> the Datasource can choose a value based on the Database's metadata if such
>> a thing exists, I don't think Spark should really have a this level of
>> detail but I've also missed out on all of these meetings (sorry it's
>> family dinner time :) ) so I may be missing something.
>> >>>>>>>>>
>> >>>>>>>>> So my tldr is, Let a datasource report whether or not missing
>> columns are OK and let the Datasource deal with the missing data based on
>> it's underlying storage.
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan < cloud0fan@ gmail. com
>> ( cloud0...@gmail.com ) > wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> I agree that we should not rewrite existing parquet files when
>> a new column is added, but we should also try out best to make the
>> behavior same as RDBMS/SQL standard.
>> >>>>>>>>>>
>> >>>>>>>>>> 1. it should be the user who decides the default value of a
>> column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER
>> COLUMN.
>> >>>>>>>>>> 2. When adding a new column, the default value should be
>> effective for all the existing data, and newly written data.
>> >>>>>>>>>> 3. When altering an existing column and change the default
>> value, it should be effective for newly written data only.
>> >>>>>>>>>>
>> >>>>>>>>>> A possible implementation:
>> >>>>>>>>>> 1. a columnn has 2 default values: the initial one and the
>> latest one.
>> >>>>>>>>>> 2. when adding a column with a default value, set both the
>> initial one and the latest one to this value. But do not update existing
>> data.
>> >>>>>>>>>> 3. when reading data, fill the missing column with the initial
>> default value
>> >>>>>>>>>> 4. when writing data, fill the missing column with the latest
>> default value
>> >>>>>>>>>> 5. when altering a column to change its default value, only
>> update the latest default value.
>> >>>>>>>>>>
>> >>>>>>>>>> This works because:
>> >>>>>>>>>> 1. new files will be written with the latest default value,
>> nothing we need to worry about at read time.
>> >>>>>>>>>> 2. old files will be read with the initial default value, which
>> returns expected result.
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue < rblue@ netflix. com.
>> invalid ( rb...@netflix.com.invalid ) > wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>
>> >>>>>>>>>>> This thread is a follow-up to a discussion that we started in
>> the DSv2 community sync last week.
>> >>>>>>>>>>>
>> >>>>>>>>>>> The problem I’m trying to solve is that the format I’m using
>> DSv2 to integrate supports schema evolution. Specifically, adding a new
>> optional column so that rows without that column get a default value (null
>> for Iceberg). The current validation rule for an append in DSv2 fails a
>> write if it is missing a column, so adding a column to an existing table
>> will cause currently-scheduled jobs that insert data to start failing.
>> Clearly, schema evolution shouldn't break existing jobs that produce valid
>> data.
>> >>>>>>>>>>>
>> >>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for
>> Spark to check whether to fail when an optional column is missing. Other
>> contributors in the sync thought that Spark should go with option 2:
>> Spark’s schema should have defaults and Spark should handle filling in
>> defaults the same way across all sources, like other databases.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is
>> that it is very hard to implement.
>> >>>>>>>>>>>
>> >>>>>>>>>>> A source might manage data stored in millions of immutable
>> Parquet files, so adding a default value isn’t possible. Spark would need
>> to fill in defaults for files written before the column was added at read
>> time (it could fill in defaults in new files at write time). Filling in
>> defaults at read time would require Spark to fill in defaults for only
>> some of the files in a scan, so Spark would need different handling for
>> each task depending on the schema of that task. Tasks would also be
>> required to produce a consistent schema, so a file without the new column
>> couldn’t be combined into a task with a file that has the new column. This
>> adds quite a bit of complexity.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Other sources may not need Spark to fill in the default at
>> all. A JDBC source would be capable of filling in the default values
>> itself, so Spark would need some way to communicate the default to that
>> source. If the source had a different policy for default values (write
>> time instead of read time, for example) then behavior could still be
>> inconsistent.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I think that this complexity probably isn’t worth consistency
>> in default values across sources, if that is even achievable.
>> >>>>>>>>>>>
>> >>>>>>>>>>> In the sync we thought it was a good idea to send this out to
>> the larger group to discuss. Please reply with comments!
>> >>>>>>>>>>>
>> >>>>>>>>>>> rb
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Ryan Blue
>> >>>>>>>>>>> Software Engineer
>> >>>>>>>>>>> Netflix
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Ryan Blue
>> >>>>>>> Software Engineer
>> >>>>>>> Netflix
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Ryan Blue
>> >>>>> Software Engineer
>> >>>>> Netflix
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
> 
> 
>

Re: [DISCUSS] Default values and data sources

Reply via email to