I might have missed it but in skimming I couldn't find a section in the spec about writing all columns to the data file.
I posted https://github.com/apache/iceberg/pull/10835 which says implementations should write the column for redundancy but leaves the option open for others. Thanks, Micah On Thu, Jul 25, 2024 at 3:46 PM Ryan Blue <b...@databricks.com.invalid> wrote: > I support clarifying how to handle this when reading. It's definitely a > best practice to project the values from metadata because the columns may > not exist in data files when the files were converted from Hive. > > For writes, I _thought_ that the spec requires writing the values into > data files so that each file is complete and does not depend on the Iceberg > metadata for correctness. That way we can recover and write data correctly > if we ever find data that was incorrectly written due to a bug. I > definitely prefer writing all columns, including those used in identity > partition fields. I could maybe be convinced, but writing those values in > Avro doesn't seem like it is worth it to me. > > Whatever we decide on writes, I think we should update the spec to clarify > this. > > Ryan > > On Thu, Jul 25, 2024 at 1:18 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I have no problem with explicitly stating that writing identity source >> columns is optional on write. We should, of course, mandate surfacing the >> column on read :) >> >> On Thu, Jul 25, 2024 at 1:30 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> The Table specification doesn't mention anything about requirements for >>> whether writing identity partitioned columns is necessary. Empirically, it >>> appears that implementations always write the column data at least for >>> parquet. For columnar formats, this is relatively cheap as it is trivially >>> RLE encodable. For Avro though it comes at a little bit of a higher cost. >>> Since the data is fully reproducible from Iceberg metadata, I think stating >>> this as optional in the specification would be useful. >>> >>> For reading identity partitioned from Iceberg tables, I think the >>> specification needs to require that identity partition column values are >>> read from metadata. This is due to the fact that Iceberg supports >>> migrating Hive data (and other table formats) without data rewrites that >>> don't typically write their partition information directly into files. >>> >>> Thoughts? >>> >>> When we get consensus I'll open up a PR to clarify these points. >>> >>> Thanks, >>> Micah >>> >> > > -- > Ryan Blue > Databricks >