Hi community, I want to solicit people's thoughts on the different toolchain behaviors of whether the hive partition keys should appear as columns in the underlying parquet file.
Say I have data layout as: /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet IIRC the default Arrow behavior is column *dt *and* lang *will be* excluded *from the underlying parquet file. It's indeed what Google BigQuery external table loading is expecting <https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts>. If the keys exist in the parquet file, BigQuery will error out. When it comes to Iceberg, it requires <https://iceberg.apache.org/spec/#partitioning> partition keys to be present in the parquet file. If the keys do not exist, it will reject the parquet commitment... Can folks shade some light here? I really do not want to duplicate my data for this discrepancy... Cheers, Haocheng