I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something.
So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage. On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> wrote: > I agree that we should not rewrite existing parquet files when a new > column is added, but we should also try out best to make the behavior same > as RDBMS/SQL standard. > > 1. it should be the user who decides the default value of a column, by > CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN. > 2. When adding a new column, the default value should be effective for all > the existing data, and newly written data. > 3. When altering an existing column and change the default value, it > should be effective for newly written data only. > > A possible implementation: > 1. a columnn has 2 default values: the initial one and the latest one. > 2. when adding a column with a default value, set both the initial one and > the latest one to this value. But do not update existing data. > 3. when reading data, fill the missing column with the initial default > value > 4. when writing data, fill the missing column with the latest default value > 5. when altering a column to change its default value, only update the > latest default value. > > This works because: > 1. new files will be written with the latest default value, nothing we > need to worry about at read time. > 2. old files will be read with the initial default value, which returns > expected result. > > On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Hi everyone, >> >> This thread is a follow-up to a discussion that we started in the DSv2 >> community sync last week. >> >> The problem I’m trying to solve is that the format I’m using DSv2 to >> integrate supports schema evolution. Specifically, adding a new optional >> column so that rows without that column get a default value (null for >> Iceberg). The current validation rule for an append in DSv2 fails a write >> if it is missing a column, so adding a column to an existing table will >> cause currently-scheduled jobs that insert data to start failing. Clearly, >> schema evolution shouldn't break existing jobs that produce valid data. >> >> To fix this problem, I suggested option 1: adding a way for Spark to >> check whether to fail when an optional column is missing. Other >> contributors in the sync thought that Spark should go with option 2: >> Spark’s schema should have defaults and Spark should handle filling in >> defaults the same way across all sources, like other databases. >> >> I think we agree that option 2 would be ideal. The problem is that it is >> very hard to implement. >> >> A source might manage data stored in millions of immutable Parquet files, >> so adding a default value isn’t possible. Spark would need to fill in >> defaults for files written before the column was added at read time (it >> could fill in defaults in new files at write time). Filling in defaults at >> read time would require Spark to fill in defaults for only some of the >> files in a scan, so Spark would need different handling for each task >> depending on the schema of that task. Tasks would also be required to >> produce a consistent schema, so a file without the new column couldn’t be >> combined into a task with a file that has the new column. This adds quite a >> bit of complexity. >> >> Other sources may not need Spark to fill in the default at all. A JDBC >> source would be capable of filling in the default values itself, so Spark >> would need some way to communicate the default to that source. If the >> source had a different policy for default values (write time instead of >> read time, for example) then behavior could still be inconsistent. >> >> I think that this complexity probably isn’t worth consistency in default >> values across sources, if that is even achievable. >> >> In the sync we thought it was a good idea to send this out to the larger >> group to discuss. Please reply with comments! >> >> rb >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >