I might have missed it but in skimming I couldn't find a section in the
spec about writing all columns to the data file.

I posted https://github.com/apache/iceberg/pull/10835  which says
implementations should write the column for redundancy but leaves the
option open for others.

Thanks,
Micah

On Thu, Jul 25, 2024 at 3:46 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> I support clarifying how to handle this when reading. It's definitely a
> best practice to project the values from metadata because the columns may
> not exist in data files when the files were converted from Hive.
>
> For writes, I _thought_ that the spec requires writing the values into
> data files so that each file is complete and does not depend on the Iceberg
> metadata for correctness. That way we can recover and write data correctly
> if we ever find data that was incorrectly written due to a bug. I
> definitely prefer writing all columns, including those used in identity
> partition fields. I could maybe be convinced, but writing those values in
> Avro doesn't seem like it is worth it to me.
>
> Whatever we decide on writes, I think we should update the spec to clarify
> this.
>
> Ryan
>
> On Thu, Jul 25, 2024 at 1:18 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I have no problem with explicitly stating that writing identity source
>> columns is optional on write. We should, of course, mandate surfacing the
>> column on read :)
>>
>> On Thu, Jul 25, 2024 at 1:30 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> The Table specification doesn't mention anything about requirements for
>>> whether writing identity partitioned columns is necessary.  Empirically, it
>>> appears that implementations always write the column data at least for
>>> parquet.  For columnar formats, this is relatively cheap as it is trivially
>>> RLE encodable.  For Avro though it comes at a little bit of a higher cost.
>>> Since the data is fully reproducible from Iceberg metadata, I think stating
>>> this as optional in the specification would be useful.
>>>
>>> For reading identity partitioned from Iceberg tables, I think the
>>> specification needs to require that identity partition column values are
>>> read from metadata.  This is due to the fact that Iceberg supports
>>> migrating Hive data (and other table formats) without data rewrites that
>>> don't typically write their partition information directly into files.
>>>
>>> Thoughts?
>>>
>>> When we get consensus I'll open up a PR to clarify these points.
>>>
>>> Thanks,
>>> Micah
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Reply via email to