Re: [DISCUSS] Default values and data sources

Wenchen Fan Wed, 19 Dec 2018 19:29:19 -0800

So you agree with my proposal that we should follow RDBMS/SQL standard
regarding the behavior?


> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote:

> I don't think we have to change the syntax. Isn't the right thing (for
> option 1) to pass the default through to the underlying data source?
> Sources that don't support defaults would throw an exception.
>
> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN
>> column_name datatype [DEFAULT value];
>>
>> If the DEFAULT statement is not specified, then the default value is
>> null. If we are going to change the behavior and say the default value is
>> decided by the underlying data source, we should use a new SQL syntax(I
>> don't have a proposal in mind), instead of reusing the existing syntax, to
>> be SQL compatible.
>>
>> Personally I don't like re-invent wheels. It's better to just implement
>> the SQL standard ADD COLUMN command, which means the default value is
>> decided by the end-users.
>>
>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Wenchen, can you give more detail about the different ADD COLUMN syntax?
>>> That sounds confusing to end users to me.
>>>
>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> Note that the design we make here will affect both data source
>>>> developers and end-users. It's better to provide reliable behaviors to
>>>> end-users, instead of asking them to read the spec of the data source and
>>>> know which value will be used for missing columns, when they write data.
>>>>
>>>> If we do want to go with the "data source decides default value"
>>>> approach, we should create a new SQL syntax for ADD COLUMN, as its behavior
>>>> is different from the SQL standard ADD COLUMN command.
>>>>
>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want
>>>>> 2 is for a unified way of dealing with missing columns? I feel like that
>>>>> probably should be left up to the underlying datasource implementation. 
>>>>> For
>>>>> example if you have missing columns with a database the Datasource can
>>>>> choose a value based on the Database's metadata if such a thing exists, I
>>>>> don't think Spark should really have a this level of detail but I've also
>>>>> missed out on all of these meetings (sorry it's family dinner time :) ) so
>>>>> I may be missing something.
>>>>>
>>>>> So my tldr is, Let a datasource report whether or not missing columns
>>>>> are OK and let the Datasource deal with the missing data based on it's
>>>>> underlying storage.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree that we should not rewrite existing parquet files when a new
>>>>>> column is added, but we should also try out best to make the behavior 
>>>>>> same
>>>>>> as RDBMS/SQL standard.
>>>>>>
>>>>>> 1. it should be the user who decides the default value of a column,
>>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>> 2. When adding a new column, the default value should be effective
>>>>>> for all the existing data, and newly written data.
>>>>>> 3. When altering an existing column and change the default value, it
>>>>>> should be effective for newly written data only.
>>>>>>
>>>>>> A possible implementation:
>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>> 2. when adding a column with a default value, set both the initial
>>>>>> one and the latest one to this value. But do not update existing data.
>>>>>> 3. when reading data, fill the missing column with the initial
>>>>>> default value
>>>>>> 4. when writing data, fill the missing column with the latest default
>>>>>> value
>>>>>> 5. when altering a column to change its default value, only update
>>>>>> the latest default value.
>>>>>>
>>>>>> This works because:
>>>>>> 1. new files will be written with the latest default value, nothing
>>>>>> we need to worry about at read time.
>>>>>> 2. old files will be read with the initial default value, which
>>>>>> returns expected result.
>>>>>>
>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> This thread is a follow-up to a discussion that we started in the
>>>>>>> DSv2 community sync last week.
>>>>>>>
>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to
>>>>>>> integrate supports schema evolution. Specifically, adding a new optional
>>>>>>> column so that rows without that column get a default value (null for
>>>>>>> Iceberg). The current validation rule for an append in DSv2 fails a 
>>>>>>> write
>>>>>>> if it is missing a column, so adding a column to an existing table will
>>>>>>> cause currently-scheduled jobs that insert data to start failing. 
>>>>>>> Clearly,
>>>>>>> schema evolution shouldn't break existing jobs that produce valid data.
>>>>>>>
>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to
>>>>>>> check whether to fail when an optional column is missing. Other
>>>>>>> contributors in the sync thought that Spark should go with option 2:
>>>>>>> Spark’s schema should have defaults and Spark should handle filling in
>>>>>>> defaults the same way across all sources, like other databases.
>>>>>>>
>>>>>>> I think we agree that option 2 would be ideal. The problem is that
>>>>>>> it is very hard to implement.
>>>>>>>
>>>>>>> A source might manage data stored in millions of immutable Parquet
>>>>>>> files, so adding a default value isn’t possible. Spark would need to 
>>>>>>> fill
>>>>>>> in defaults for files written before the column was added at read time 
>>>>>>> (it
>>>>>>> could fill in defaults in new files at write time). Filling in defaults 
>>>>>>> at
>>>>>>> read time would require Spark to fill in defaults for only some of the
>>>>>>> files in a scan, so Spark would need different handling for each task
>>>>>>> depending on the schema of that task. Tasks would also be required to
>>>>>>> produce a consistent schema, so a file without the new column couldn’t 
>>>>>>> be
>>>>>>> combined into a task with a file that has the new column. This adds 
>>>>>>> quite a
>>>>>>> bit of complexity.
>>>>>>>
>>>>>>> Other sources may not need Spark to fill in the default at all. A
>>>>>>> JDBC source would be capable of filling in the default values itself, so
>>>>>>> Spark would need some way to communicate the default to that source. If 
>>>>>>> the
>>>>>>> source had a different policy for default values (write time instead of
>>>>>>> read time, for example) then behavior could still be inconsistent.
>>>>>>>
>>>>>>> I think that this complexity probably isn’t worth consistency in
>>>>>>> default values across sources, if that is even achievable.
>>>>>>>
>>>>>>> In the sync we thought it was a good idea to send this out to the
>>>>>>> larger group to discuss. Please reply with comments!
>>>>>>>
>>>>>>> rb
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Default values and data sources

Reply via email to