Re: [DISCUSS] Default values and data sources

Ryan Blue Thu, 20 Dec 2018 16:04:09 -0800

I think it is good to know that not all sources support default values.
That makes me think that we should delegate this behavior to the source and
have a way for sources to signal that they accept default values in DDL (a
capability) and assume that they do not in most cases.


On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I guess my question is why is this a Spark level behavior? Say the user
> has an underlying source where they have a different behavior at the source
> level. In Spark they set a new default behavior and it's added to the
> catalogue, is the Source expected to propagate this? Or does the user have
> to be aware that their own Source settings may be different for a client
> connecting via Spark or via a native driver.
>
> For example say i'm using C* (sorry but obviously I'm always thinking
> about C*), and I add a new column to the database. When i connect to the
> database with a non-spark application I expect to be able to insert to the
> table given that I satisfy the required columns. In Spark someone sets the
> columns as having a default value (there is no such feature in C*), now
> depending on how I connect to the source I have two different behaviors. If
> I insert from the native app I get empty cells, if I insert from spark i
> get a default value inserted. That sounds more confusing to an end-user to
> than having a consistent behavior between native clients and Spark clients.
> This is why I asked if the goal was to just have a common "Spark" behavior
> because I don't think it makes sense if you consider multiple interaction
> points for a source.
>
> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> So you agree with my proposal that we should follow RDBMS/SQL standard
>> regarding the behavior?
>>
>> > pass the default through to the underlying data source
>>
>> This is one way to implement the behavior.
>>
>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I don't think we have to change the syntax. Isn't the right thing (for
>>> option 1) to pass the default through to the underlying data source?
>>> Sources that don't support defaults would throw an exception.
>>>
>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD
>>>> COLUMN column_name datatype [DEFAULT value];
>>>>
>>>> If the DEFAULT statement is not specified, then the default value is
>>>> null. If we are going to change the behavior and say the default value is
>>>> decided by the underlying data source, we should use a new SQL syntax(I
>>>> don't have a proposal in mind), instead of reusing the existing syntax, to
>>>> be SQL compatible.
>>>>
>>>> Personally I don't like re-invent wheels. It's better to just implement
>>>> the SQL standard ADD COLUMN command, which means the default value is
>>>> decided by the end-users.
>>>>
>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Wenchen, can you give more detail about the different ADD COLUMN
>>>>> syntax? That sounds confusing to end users to me.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Note that the design we make here will affect both data source
>>>>>> developers and end-users. It's better to provide reliable behaviors to
>>>>>> end-users, instead of asking them to read the spec of the data source and
>>>>>> know which value will be used for missing columns, when they write data.
>>>>>>
>>>>>> If we do want to go with the "data source decides default value"
>>>>>> approach, we should create a new SQL syntax for ADD COLUMN, as its 
>>>>>> behavior
>>>>>> is different from the SQL standard ADD COLUMN command.
>>>>>>
>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we
>>>>>>> want 2 is for a unified way of dealing with missing columns? I feel like
>>>>>>> that probably should be left up to the underlying datasource
>>>>>>> implementation. For example if you have missing columns with a database 
>>>>>>> the
>>>>>>> Datasource can choose a value based on the Database's metadata if such a
>>>>>>> thing exists, I don't think Spark should really have a this level of 
>>>>>>> detail
>>>>>>> but I've also missed out on all of these meetings (sorry it's family 
>>>>>>> dinner
>>>>>>> time :) ) so I may be missing something.
>>>>>>>
>>>>>>> So my tldr is, Let a datasource report whether or not missing
>>>>>>> columns are OK and let the Datasource deal with the missing data based 
>>>>>>> on
>>>>>>> it's underlying storage.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree that we should not rewrite existing parquet files when a
>>>>>>>> new column is added, but we should also try out best to make the 
>>>>>>>> behavior
>>>>>>>> same as RDBMS/SQL standard.
>>>>>>>>
>>>>>>>> 1. it should be the user who decides the default value of a column,
>>>>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER 
>>>>>>>> COLUMN.
>>>>>>>> 2. When adding a new column, the default value should be effective
>>>>>>>> for all the existing data, and newly written data.
>>>>>>>> 3. When altering an existing column and change the default value,
>>>>>>>> it should be effective for newly written data only.
>>>>>>>>
>>>>>>>> A possible implementation:
>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest
>>>>>>>> one.
>>>>>>>> 2. when adding a column with a default value, set both the initial
>>>>>>>> one and the latest one to this value. But do not update existing data.
>>>>>>>> 3. when reading data, fill the missing column with the initial
>>>>>>>> default value
>>>>>>>> 4. when writing data, fill the missing column with the latest
>>>>>>>> default value
>>>>>>>> 5. when altering a column to change its default value, only update
>>>>>>>> the latest default value.
>>>>>>>>
>>>>>>>> This works because:
>>>>>>>> 1. new files will be written with the latest default value, nothing
>>>>>>>> we need to worry about at read time.
>>>>>>>> 2. old files will be read with the initial default value, which
>>>>>>>> returns expected result.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> This thread is a follow-up to a discussion that we started in the
>>>>>>>>> DSv2 community sync last week.
>>>>>>>>>
>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2
>>>>>>>>> to integrate supports schema evolution. Specifically, adding a new 
>>>>>>>>> optional
>>>>>>>>> column so that rows without that column get a default value (null for
>>>>>>>>> Iceberg). The current validation rule for an append in DSv2 fails a 
>>>>>>>>> write
>>>>>>>>> if it is missing a column, so adding a column to an existing table 
>>>>>>>>> will
>>>>>>>>> cause currently-scheduled jobs that insert data to start failing. 
>>>>>>>>> Clearly,
>>>>>>>>> schema evolution shouldn't break existing jobs that produce valid 
>>>>>>>>> data.
>>>>>>>>>
>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark
>>>>>>>>> to check whether to fail when an optional column is missing. Other
>>>>>>>>> contributors in the sync thought that Spark should go with option 2:
>>>>>>>>> Spark’s schema should have defaults and Spark should handle filling in
>>>>>>>>> defaults the same way across all sources, like other databases.
>>>>>>>>>
>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that
>>>>>>>>> it is very hard to implement.
>>>>>>>>>
>>>>>>>>> A source might manage data stored in millions of immutable Parquet
>>>>>>>>> files, so adding a default value isn’t possible. Spark would need to 
>>>>>>>>> fill
>>>>>>>>> in defaults for files written before the column was added at read 
>>>>>>>>> time (it
>>>>>>>>> could fill in defaults in new files at write time). Filling in 
>>>>>>>>> defaults at
>>>>>>>>> read time would require Spark to fill in defaults for only some of the
>>>>>>>>> files in a scan, so Spark would need different handling for each task
>>>>>>>>> depending on the schema of that task. Tasks would also be required to
>>>>>>>>> produce a consistent schema, so a file without the new column 
>>>>>>>>> couldn’t be
>>>>>>>>> combined into a task with a file that has the new column. This adds 
>>>>>>>>> quite a
>>>>>>>>> bit of complexity.
>>>>>>>>>
>>>>>>>>> Other sources may not need Spark to fill in the default at all. A
>>>>>>>>> JDBC source would be capable of filling in the default values itself, 
>>>>>>>>> so
>>>>>>>>> Spark would need some way to communicate the default to that source. 
>>>>>>>>> If the
>>>>>>>>> source had a different policy for default values (write time instead 
>>>>>>>>> of
>>>>>>>>> read time, for example) then behavior could still be inconsistent.
>>>>>>>>>
>>>>>>>>> I think that this complexity probably isn’t worth consistency in
>>>>>>>>> default values across sources, if that is even achievable.
>>>>>>>>>
>>>>>>>>> In the sync we thought it was a good idea to send this out to the
>>>>>>>>> larger group to discuss. Please reply with comments!
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Default values and data sources

Reply via email to