I don't think there is a strong consensus here unfortunately and different people might want different things, and there is the issue with legacy systems. As another example, whether to include partition columns in data files is a configuration option in Hudi. If I was creating new data from scratch, I'd recommended using Iceberg as a table format, which somewhat resolves this (more details below).
> It's indeed what Google BigQuery external table loading is expecting > < > https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts > >. > If the keys exist in the parquet file, BigQuery will error out. BigQuery is likely overly strict here and might relax this in the future. When it comes to Iceberg, it requires > <https://iceberg.apache.org/spec/#partitioning> partition keys to be > present in the parquet file. If the keys do not exist, it will reject the > parquet commitment... I don't think this is well documented in Iceberg. New files written to an Iceberg table will have partition columns present. But IIRC Iceberg has code to support tables migrated for Hive, that will store Identity transforms of the partition values and use those for scanning/reading rather than try to retrieve the column directly from the file. Cheers, Micah On Wed, Nov 29, 2023 at 6:18 AM Haocheng Liu <lbtin...@gmail.com> wrote: > Hi community, > > I want to solicit people's thoughts on the different toolchain behaviors of > whether the hive partition keys should appear as columns in the underlying > parquet file. > > Say I have data layout as: > > /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet > /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet > > IIRC the default Arrow behavior is column *dt *and* lang *will be* > excluded *from the underlying parquet file. > > It's indeed what Google BigQuery external table loading is expecting > < > https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts > >. > If the keys exist in the parquet file, BigQuery will error out. > > When it comes to Iceberg, it requires > <https://iceberg.apache.org/spec/#partitioning> partition keys to be > present in the parquet file. If the keys do not exist, it will reject the > parquet commitment... > > Can folks shade some light here? I really do not want to duplicate my data > for this discrepancy... > > Cheers, > Haocheng >