Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

Micah Kornfield Wed, 29 Nov 2023 11:49:52 -0800

I don't think there is a strong consensus here unfortunately and different
people might want different things, and there is the issue with legacy
systems.  As another example, whether to include partition columns in data
files is a configuration option in Hudi.  If I was creating new data from
scratch, I'd recommended using Iceberg as a table format, which somewhat
resolves this (more details below).




> It's indeed what Google BigQuery external table loading is expecting
> <
> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts
> >.
> If the keys exist in the parquet file, BigQuery will error out.

BigQuery is likely overly strict here and might relax this in the future.

When it comes to Iceberg, it requires
> <https://iceberg.apache.org/spec/#partitioning> partition keys to be
> present in the parquet file. If the keys do not exist, it will reject the
> parquet commitment...

I don't think this is well documented in Iceberg.  New files written to an
Iceberg table will have partition columns present.  But IIRC Iceberg has
code to support tables migrated for Hive, that will store Identity
transforms of the partition values and use those for scanning/reading
rather than try to retrieve the column directly from the file.

Cheers,
Micah

On Wed, Nov 29, 2023 at 6:18 AM Haocheng Liu <lbtin...@gmail.com> wrote:

> Hi community,
>
> I want to solicit people's thoughts on the different toolchain behaviors of
> whether the hive partition keys should appear as columns in the underlying
> parquet file.
>
> Say I have data layout as:
>
> /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet
> /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet
>
> IIRC the default Arrow behavior is column *dt *and* lang *will be*
> excluded *from the underlying parquet file.
>
> It's indeed what Google BigQuery external table loading is expecting
> <
> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts
> >.
> If the keys exist in the parquet file, BigQuery will error out.
>
> When it comes to Iceberg, it requires
> <https://iceberg.apache.org/spec/#partitioning> partition keys to be
> present in the parquet file. If the keys do not exist, it will reject the
> parquet commitment...
>
> Can folks shade some light here? I really do not want to duplicate my data
> for this discrepancy...
>
> Cheers,
> Haocheng
>

Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

Reply via email to