Hi guys, This is to bring up a discussion on the feature proposed in https://github.com/apache/iceberg/issues/5631. Thanks Ryan Blue for pointing out the concerns!
As mentioned in the issue thread, the ability to describe a NestedField in detail is required by ML and other scenarios. We can support this via adding a metadata field for NestedField. However, to support this feature there are 2 concerns: 1. Field metadata is not part of SQL, therefore not supported across compute engines, which is the reason why Iceberg hasn’t added this feature so far. For those engines that are not aware of the metadata, operations like “CREATE TABLE copy AS SELECT * FROM original” will not carry through original’s metadata to copy, which can be confusing to the user. 2. Will need to update the iceberg table spec and define compatibility rules across spec versions. Regarding the first concern, I think the field metadata can be treated like identifier-field-ids. identifier-field-ids specifies a table’s unique identifier. However the uniqueness is guaranteed by specific compute engines and not enforced by Iceberg. Likewise, the ability of field metadata can be engine-specific. For engines that support field metadata, there won’t be issues. My colleagues and I have developed a draft of Iceberg’s field metadata and supported Spark to load & save Iceberg’s field metadata. For engines that does not support, maybe we can log a warning when loading a table with field metadata. What do you think of it? Should we add the feature? Thanks, Yanghao