Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

Fokko Driesprong Wed, 29 Nov 2023 12:10:32 -0800

Hey Haocheng,

The partitioning in Iceberg is logical, instead of physical. The directory
structure (/dt=2021-03-01/) is there just for convenience, but Iceberg does
not rely on the actual directory structure. The partition information is
stored in the metadata layer (manifests and manifest list). More
information can be found here
<https://iceberg.apache.org/docs/latest/partitioning/> on the comparison
with Hive.


Just to add some more context. What we call hidden partitioning in Iceberg
is always referenced by an actual column in the data. The name of the
partition does not materialize in the Parquet file. This has several
advantages, such as for partition evolution. If you add a new partition
column, or you change the granularity (month, day, hour, etc), this will
not affect the existing data. New data will adopt the new partition
specification, and when you rewrite historical data, it will also use the
new partition specification.

Maybe you also want to check out PyIceberg <https://py.iceberg.apache.org/>
which allows you to read data into Arrow
<https://py.iceberg.apache.org/api/#apache-arrow>.

Hope this helps!

Kind regards,
Fokko

Kind regards,
Fokko

Op wo 29 nov 2023 om 11:49 schreef Micah Kornfield <emkornfi...@gmail.com>:

> I don't think there is a strong consensus here unfortunately and different
> people might want different things, and there is the issue with legacy
> systems.  As another example, whether to include partition columns in data
> files is a configuration option in Hudi.  If I was creating new data from
> scratch, I'd recommended using Iceberg as a table format, which somewhat
> resolves this (more details below).
>
>
>
> > It's indeed what Google BigQuery external table loading is expecting
> > <
> >
> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts
> > >.
> > If the keys exist in the parquet file, BigQuery will error out.
>
> BigQuery is likely overly strict here and might relax this in the future.
>
> When it comes to Iceberg, it requires
> > <https://iceberg.apache.org/spec/#partitioning> partition keys to be
> > present in the parquet file. If the keys do not exist, it will reject the
> > parquet commitment...
>
> I don't think this is well documented in Iceberg.  New files written to an
> Iceberg table will have partition columns present.  But IIRC Iceberg has
> code to support tables migrated for Hive, that will store Identity
> transforms of the partition values and use those for scanning/reading
> rather than try to retrieve the column directly from the file.
>
> Cheers,
> Micah
>
> On Wed, Nov 29, 2023 at 6:18 AM Haocheng Liu <lbtin...@gmail.com> wrote:
>
> > Hi community,
> >
> > I want to solicit people's thoughts on the different toolchain behaviors
> of
> > whether the hive partition keys should appear as columns in the
> underlying
> > parquet file.
> >
> > Say I have data layout as:
> >
> > /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet
> > /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet
> >
> > IIRC the default Arrow behavior is column *dt *and* lang *will be*
> > excluded *from the underlying parquet file.
> >
> > It's indeed what Google BigQuery external table loading is expecting
> > <
> >
> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts
> > >.
> > If the keys exist in the parquet file, BigQuery will error out.
> >
> > When it comes to Iceberg, it requires
> > <https://iceberg.apache.org/spec/#partitioning> partition keys to be
> > present in the parquet file. If the keys do not exist, it will reject the
> > parquet commitment...
> >
> > Can folks shade some light here? I really do not want to duplicate my
> data
> > for this discrepancy...
> >
> > Cheers,
> > Haocheng
> >
>

Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

Reply via email to