Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

lbtinglb Thu, 30 Nov 2023 15:12:50 -0800

Hi Fokko and Micah,

Really appreciate your input here. I will read the materials and discuss with 
my colleagues.


Thanks!

Best,
Haocheng 

> On Nov 29, 2023, at 15:10, Fokko Driesprong <fo...@apache.org> wrote:
> 
> Hey Haocheng,
> 
> The partitioning in Iceberg is logical, instead of physical. The directory
> structure (/dt=2021-03-01/) is there just for convenience, but Iceberg does
> not rely on the actual directory structure. The partition information is
> stored in the metadata layer (manifests and manifest list). More
> information can be found here
> <https://iceberg.apache.org/docs/latest/partitioning/> on the comparison
> with Hive.
> 
> Just to add some more context. What we call hidden partitioning in Iceberg
> is always referenced by an actual column in the data. The name of the
> partition does not materialize in the Parquet file. This has several
> advantages, such as for partition evolution. If you add a new partition
> column, or you change the granularity (month, day, hour, etc), this will
> not affect the existing data. New data will adopt the new partition
> specification, and when you rewrite historical data, it will also use the
> new partition specification.
> 
> Maybe you also want to check out PyIceberg <https://py.iceberg.apache.org/>
> which allows you to read data into Arrow
> <https://py.iceberg.apache.org/api/#apache-arrow>.
> 
> Hope this helps!
> 
> Kind regards,
> Fokko
> 
> Kind regards,
> Fokko
> 
> Op wo 29 nov 2023 om 11:49 schreef Micah Kornfield <emkornfi...@gmail.com>:
> 
>> I don't think there is a strong consensus here unfortunately and different
>> people might want different things, and there is the issue with legacy
>> systems.  As another example, whether to include partition columns in data
>> files is a configuration option in Hudi.  If I was creating new data from
>> scratch, I'd recommended using Iceberg as a table format, which somewhat
>> resolves this (more details below).
>> 
>> 
>> 
>>> It's indeed what Google BigQuery external table loading is expecting
>>> <
>>> 
>> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts
>>>> .
>>> If the keys exist in the parquet file, BigQuery will error out.
>> 
>> BigQuery is likely overly strict here and might relax this in the future.
>> 
>> When it comes to Iceberg, it requires
>>> <https://iceberg.apache.org/spec/#partitioning> partition keys to be
>>> present in the parquet file. If the keys do not exist, it will reject the
>>> parquet commitment...
>> 
>> I don't think this is well documented in Iceberg.  New files written to an
>> Iceberg table will have partition columns present.  But IIRC Iceberg has
>> code to support tables migrated for Hive, that will store Identity
>> transforms of the partition values and use those for scanning/reading
>> rather than try to retrieve the column directly from the file.
>> 
>> Cheers,
>> Micah
>> 
>> On Wed, Nov 29, 2023 at 6:18 AM Haocheng Liu <lbtin...@gmail.com> wrote:
>> 
>>> Hi community,
>>> 
>>> I want to solicit people's thoughts on the different toolchain behaviors
>> of
>>> whether the hive partition keys should appear as columns in the
>> underlying
>>> parquet file.
>>> 
>>> Say I have data layout as:
>>> 
>>> /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet
>>> /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet
>>> 
>>> IIRC the default Arrow behavior is column *dt *and* lang *will be*
>>> excluded *from the underlying parquet file.
>>> 
>>> It's indeed what Google BigQuery external table loading is expecting
>>> <
>>> 
>> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts
>>>> .
>>> If the keys exist in the parquet file, BigQuery will error out.
>>> 
>>> When it comes to Iceberg, it requires
>>> <https://iceberg.apache.org/spec/#partitioning> partition keys to be
>>> present in the parquet file. If the keys do not exist, it will reject the
>>> parquet commitment...
>>> 
>>> Can folks shade some light here? I really do not want to duplicate my
>> data
>>> for this discrepancy...
>>> 
>>> Cheers,
>>> Haocheng
>>> 
>>

Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

Reply via email to