Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

Ryan Blue Thu, 25 Jul 2024 15:48:35 -0700

I support clarifying how to handle this when reading. It's definitely a
best practice to project the values from metadata because the columns may
not exist in data files when the files were converted from Hive.

For writes, I _thought_ that the spec requires writing the values into data
files so that each file is complete and does not depend on the Iceberg
metadata for correctness. That way we can recover and write data correctly
if we ever find data that was incorrectly written due to a bug. I
definitely prefer writing all columns, including those used in identity
partition fields. I could maybe be convinced, but writing those values in
Avro doesn't seem like it is worth it to me.

Whatever we decide on writes, I think we should update the spec to clarify
this.

Ryan

On Thu, Jul 25, 2024 at 1:18 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I have no problem with explicitly stating that writing identity source
> columns is optional on write. We should, of course, mandate surfacing the
> column on read :)
>
> On Thu, Jul 25, 2024 at 1:30 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> The Table specification doesn't mention anything about requirements for
>> whether writing identity partitioned columns is necessary.  Empirically, it
>> appears that implementations always write the column data at least for
>> parquet.  For columnar formats, this is relatively cheap as it is trivially
>> RLE encodable.  For Avro though it comes at a little bit of a higher cost.
>> Since the data is fully reproducible from Iceberg metadata, I think stating
>> this as optional in the specification would be useful.
>>
>> For reading identity partitioned from Iceberg tables, I think the
>> specification needs to require that identity partition column values are
>> read from metadata.  This is due to the fact that Iceberg supports
>> migrating Hive data (and other table formats) without data rewrites that
>> don't typically write their partition information directly into files.
>>
>> Thoughts?
>>
>> When we get consensus I'll open up a PR to clarify these points.
>>
>> Thanks,
>> Micah
>>
>

-- 
Ryan Blue
Databricks

Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

Reply via email to