Hey Haocheng, The partitioning in Iceberg is logical, instead of physical. The directory structure (/dt=2021-03-01/) is there just for convenience, but Iceberg does not rely on the actual directory structure. The partition information is stored in the metadata layer (manifests and manifest list). More information can be found here <https://iceberg.apache.org/docs/latest/partitioning/> on the comparison with Hive.
Just to add some more context. What we call hidden partitioning in Iceberg is always referenced by an actual column in the data. The name of the partition does not materialize in the Parquet file. This has several advantages, such as for partition evolution. If you add a new partition column, or you change the granularity (month, day, hour, etc), this will not affect the existing data. New data will adopt the new partition specification, and when you rewrite historical data, it will also use the new partition specification. Maybe you also want to check out PyIceberg <https://py.iceberg.apache.org/> which allows you to read data into Arrow <https://py.iceberg.apache.org/api/#apache-arrow>. Hope this helps! Kind regards, Fokko Kind regards, Fokko Op wo 29 nov 2023 om 11:49 schreef Micah Kornfield <emkornfi...@gmail.com>: > I don't think there is a strong consensus here unfortunately and different > people might want different things, and there is the issue with legacy > systems. As another example, whether to include partition columns in data > files is a configuration option in Hudi. If I was creating new data from > scratch, I'd recommended using Iceberg as a table format, which somewhat > resolves this (more details below). > > > > > It's indeed what Google BigQuery external table loading is expecting > > < > > > https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts > > >. > > If the keys exist in the parquet file, BigQuery will error out. > > BigQuery is likely overly strict here and might relax this in the future. > > When it comes to Iceberg, it requires > > <https://iceberg.apache.org/spec/#partitioning> partition keys to be > > present in the parquet file. If the keys do not exist, it will reject the > > parquet commitment... > > I don't think this is well documented in Iceberg. New files written to an > Iceberg table will have partition columns present. But IIRC Iceberg has > code to support tables migrated for Hive, that will store Identity > transforms of the partition values and use those for scanning/reading > rather than try to retrieve the column directly from the file. > > Cheers, > Micah > > On Wed, Nov 29, 2023 at 6:18 AM Haocheng Liu <lbtin...@gmail.com> wrote: > > > Hi community, > > > > I want to solicit people's thoughts on the different toolchain behaviors > of > > whether the hive partition keys should appear as columns in the > underlying > > parquet file. > > > > Say I have data layout as: > > > > /<my-path>/myTable/dt=2019-10-31/lang=en/0.parquet > > /<my-path>/myTable/dt=2018-10-31/lang=fr/1.parquet > > > > IIRC the default Arrow behavior is column *dt *and* lang *will be* > > excluded *from the underlying parquet file. > > > > It's indeed what Google BigQuery external table loading is expecting > > < > > > https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#supported_data_layouts > > >. > > If the keys exist in the parquet file, BigQuery will error out. > > > > When it comes to Iceberg, it requires > > <https://iceberg.apache.org/spec/#partitioning> partition keys to be > > present in the parquet file. If the keys do not exist, it will reject the > > parquet commitment... > > > > Can folks shade some light here? I really do not want to duplicate my > data > > for this discrepancy... > > > > Cheers, > > Haocheng > > >