Hi community,

I want to solicit people's thoughts on the different toolchain behaviors of
whether the hive partition keys should appear as columns in the underlying
parquet file.

Say I have data layout as:

/<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet
/<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet

IIRC the default Arrow behavior is column *dt *and* lang *will be*
excluded *from the underlying parquet file.

It's indeed what Google BigQuery external table loading is expecting
<https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts>.
If the keys exist in the parquet file, BigQuery will error out.

When it comes to Iceberg, it requires
<https://iceberg.apache.org/spec/#partitioning> partition keys to be
present in the parquet file. If the keys do not exist, it will reject the
parquet commitment...

Can folks shade some light here? I really do not want to duplicate my data
for this discrepancy...

Cheers,
Haocheng

Reply via email to