Re: pre-proposal: schema_id on DataFile

Devin Smith Fri, 14 Feb 2025 11:42:01 -0800

Thanks for the info, it is very helpful. I see it debugging down through
`org.apache.iceberg.ManifestReader#readMetadata`. It wasn't obvious to me
that this sort of data would be in the avro metadata as opposed to the
org.apache.iceberg.ManifestFile object. I may have some questions later
about the writing side of the equation in these regards...


BTW, it looks like either the spec is incorrect, or the java implementation
is incorrect; I see `schema` being written to the manifest header metadata,
but not `schema-id`.

https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L346-L355

https://github.com/apache/iceberg/blob/apache-iceberg-1.8.0/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L312-L321



On Fri, Feb 14, 2025 at 10:26 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hi Devin,
>
> The schema-id is stored in the Manifest Avro header:
> https://iceberg.apache.org/spec/#manifests Also the schema itself is
> stored there. Would that help your situation? I think this makes adding it
> to the data file redundant.
>
> Kind regards,
> Fokko
>
> Op vr 14 feb 2025 om 17:56 schreef Devin Smith
> <devinsm...@deephaven.io.invalid>:
>
>> I want to make sure I'm not missing something that already exists;
>> otherwise, hoping to get a quick thumbs up / thumbs down on a potential
>> proposal before spending more time on it.
>>
>> It would be nice to know what Iceberg schema a writer used (/assumed)
>> when writing a DataFile. Oftentimes, this information is written into the
>> parquet file's metadata, but it would be great if Iceberg provided this
>> directly. A schema_id on DataFile would be nice, I think.
>>
>> Thanks,
>> -Devin
>>
>

Re: pre-proposal: schema_id on DataFile

Reply via email to