Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

2024-08-01 Thread Ryan Blue
I think especially with support for default values, it's important for the writer to always produce the column. Otherwise things that would arguably be safe are confusing and potentially return incorrect values. For instance, moving a file from a spec with an identity partition to one that drops th

Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

2024-07-31 Thread Micah Kornfield
I might have missed it but in skimming I couldn't find a section in the spec about writing all columns to the data file. I posted https://github.com/apache/iceberg/pull/10835 which says implementations should write the column for redundancy but leaves the option open for others. Thanks, Micah O

Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

2024-07-25 Thread Ryan Blue
I support clarifying how to handle this when reading. It's definitely a best practice to project the values from metadata because the columns may not exist in data files when the files were converted from Hive. For writes, I _thought_ that the spec requires writing the values into data files so th

Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

2024-07-25 Thread Russell Spitzer
I have no problem with explicitly stating that writing identity source columns is optional on write. We should, of course, mandate surfacing the column on read :) On Thu, Jul 25, 2024 at 1:30 PM Micah Kornfield wrote: > The Table specification doesn't mention anything about requirements for > wh

[DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

2024-07-25 Thread Micah Kornfield
The Table specification doesn't mention anything about requirements for whether writing identity partitioned columns is necessary. Empirically, it appears that implementations always write the column data at least for parquet. For columnar formats, this is relatively cheap as it is trivially RLE