Hi Fokko and Micah, Really appreciate your input here. I will read the materials and discuss with my colleagues.
Thanks! Best, Haocheng > On Nov 29, 2023, at 15:10, Fokko Driesprong <fo...@apache.org> wrote: > > Hey Haocheng, > > The partitioning in Iceberg is logical, instead of physical. The directory > structure (/dt=2021-03-01/) is there just for convenience, but Iceberg does > not rely on the actual directory structure. The partition information is > stored in the metadata layer (manifests and manifest list). More > information can be found here > <https://iceberg.apache.org/docs/latest/partitioning/> on the comparison > with Hive. > > Just to add some more context. What we call hidden partitioning in Iceberg > is always referenced by an actual column in the data. The name of the > partition does not materialize in the Parquet file. This has several > advantages, such as for partition evolution. If you add a new partition > column, or you change the granularity (month, day, hour, etc), this will > not affect the existing data. New data will adopt the new partition > specification, and when you rewrite historical data, it will also use the > new partition specification. > > Maybe you also want to check out PyIceberg <https://py.iceberg.apache.org/> > which allows you to read data into Arrow > <https://py.iceberg.apache.org/api/#apache-arrow>. > > Hope this helps! > > Kind regards, > Fokko > > Kind regards, > Fokko > > Op wo 29 nov 2023 om 11:49 schreef Micah Kornfield <emkornfi...@gmail.com>: > >> I don't think there is a strong consensus here unfortunately and different >> people might want different things, and there is the issue with legacy >> systems. As another example, whether to include partition columns in data >> files is a configuration option in Hudi. If I was creating new data from >> scratch, I'd recommended using Iceberg as a table format, which somewhat >> resolves this (more details below). >> >> >> >>> It's indeed what Google BigQuery external table loading is expecting >>> < >>> >> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts >>>> . >>> If the keys exist in the parquet file, BigQuery will error out. >> >> BigQuery is likely overly strict here and might relax this in the future. >> >> When it comes to Iceberg, it requires >>> <https://iceberg.apache.org/spec/#partitioning> partition keys to be >>> present in the parquet file. If the keys do not exist, it will reject the >>> parquet commitment... >> >> I don't think this is well documented in Iceberg. New files written to an >> Iceberg table will have partition columns present. But IIRC Iceberg has >> code to support tables migrated for Hive, that will store Identity >> transforms of the partition values and use those for scanning/reading >> rather than try to retrieve the column directly from the file. >> >> Cheers, >> Micah >> >> On Wed, Nov 29, 2023 at 6:18 AM Haocheng Liu <lbtin...@gmail.com> wrote: >> >>> Hi community, >>> >>> I want to solicit people's thoughts on the different toolchain behaviors >> of >>> whether the hive partition keys should appear as columns in the >> underlying >>> parquet file. >>> >>> Say I have data layout as: >>> >>> /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet >>> /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet >>> >>> IIRC the default Arrow behavior is column *dt *and* lang *will be* >>> excluded *from the underlying parquet file. >>> >>> It's indeed what Google BigQuery external table loading is expecting >>> < >>> >> https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts >>>> . >>> If the keys exist in the parquet file, BigQuery will error out. >>> >>> When it comes to Iceberg, it requires >>> <https://iceberg.apache.org/spec/#partitioning> partition keys to be >>> present in the parquet file. If the keys do not exist, it will reject the >>> parquet commitment... >>> >>> Can folks shade some light here? I really do not want to duplicate my >> data >>> for this discrepancy... >>> >>> Cheers, >>> Haocheng >>> >>