I have no problem with explicitly stating that writing identity source columns is optional on write. We should, of course, mandate surfacing the column on read :)
On Thu, Jul 25, 2024 at 1:30 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > The Table specification doesn't mention anything about requirements for > whether writing identity partitioned columns is necessary. Empirically, it > appears that implementations always write the column data at least for > parquet. For columnar formats, this is relatively cheap as it is trivially > RLE encodable. For Avro though it comes at a little bit of a higher cost. > Since the data is fully reproducible from Iceberg metadata, I think stating > this as optional in the specification would be useful. > > For reading identity partitioned from Iceberg tables, I think the > specification needs to require that identity partition column values are > read from metadata. This is due to the fact that Iceberg supports > migrating Hive data (and other table formats) without data rewrites that > don't typically write their partition information directly into files. > > Thoughts? > > When we get consensus I'll open up a PR to clarify these points. > > Thanks, > Micah >