Re: pre-proposal: schema_id on DataFile

2025-02-18 Thread Devin Smith
I'm coming at this from a mental model where a producer(s) to a given Table is tightly-coupled to a specific Schema. That is, even as the Table's Schema is evolved, the producer's logic will be unchanged - they produce parquet files that have the same parquet metadata and columns. (This model may p

Re: pre-proposal: schema_id on DataFile

2025-02-14 Thread rdb...@gmail.com
We've considered this in the past and I'm undecided on it. There is some benefit, like being able to prune files during planning if the file didn't contain a column that is used in a non-null filter (i.e. `new_data_column IN ("a", "b")`). On the other hand, we don't want data files that were writt

Re: pre-proposal: schema_id on DataFile

2025-02-14 Thread Devin Smith
Thanks for the info, it is very helpful. I see it debugging down through `org.apache.iceberg.ManifestReader#readMetadata`. It wasn't obvious to me that this sort of data would be in the avro metadata as opposed to the org.apache.iceberg.ManifestFile object. I may have some questions later about the

Re: pre-proposal: schema_id on DataFile

2025-02-14 Thread Fokko Driesprong
Hi Devin, The schema-id is stored in the Manifest Avro header: https://iceberg.apache.org/spec/#manifests Also the schema itself is stored there. Would that help your situation? I think this makes adding it to the data file redundant. Kind regards, Fokko Op vr 14 feb 2025 om 17:56 schreef Devin

pre-proposal: schema_id on DataFile

2025-02-14 Thread Devin Smith
I want to make sure I'm not missing something that already exists; otherwise, hoping to get a quick thumbs up / thumbs down on a potential proposal before spending more time on it. It would be nice to know what Iceberg schema a writer used (/assumed) when writing a DataFile. Oftentimes, this infor