Hello Iceberg devs, I'm Weston, I've been working on the Arrow project lately and I am reviewing how we handle the parquet field_id (and also adding support for specifying a field_id at write time) from parquet[1][2]. This has brought up two questions.
1. The original PR adding field_id support[3][4] not only allowed the field_id to pass through from parquet to arrow but also generated ids (in a depth first fashion) for fields that did not have a field_id. In retrospect, it seems this auto-generation of field_id was probably not a good idea. Would it have any impact on Iceberg if we removed it? Just to be clear, we will still have support for reading (and now writing) the parquet field_id. I am only talking about removing the auto-generation of missing values. 2. For the second question I'm looking for the Iceberg community's opinion as users of Arrow. Arrow is enabling more support for computation on data (e.g. relational operators) and I've been wondering how those transformations should affect metadata (like the field_id). For some examples: * Filtering a table by column (it seems the field_id/metadata should remain unchanged) * Filtering a table by rows (it seems the field_id/metadata should remain unchanged) * Filling in null values with a placeholder value (the data is changed so ???) * Casting a field to a different data type (the meaning of the data has changed so ???) * Combining two fields into a third field (it seems the field_id/metadata should be erased in the third field but presumably it could also be the joined metadata from the two origin fields) Thanks for your time, -Weston Pace [1] https://issues.apache.org/jira/browse/PARQUET-1798 [2] https://github.com/apache/arrow/pull/10289 [3] https://issues.apache.org/jira/browse/ARROW-7080 [4] https://github.com/apache/arrow/pull/6408