Discussion on adding a metadata field for NestedField

林阳昊 Thu, 17 Nov 2022 04:49:11 -0800

Hi guys,

This is to bring up a discussion on the feature proposed in 
https://github.com/apache/iceberg/issues/5631. Thanks Ryan Blue for pointing 
out the concerns!


As mentioned in the issue thread, the ability to describe a NestedField in 
detail is required by ML and other scenarios. We can support this via adding a 
metadata field for NestedField. However, to support this feature there are 2 
concerns:
1. Field metadata is not part of SQL, therefore not supported across compute 
engines, which is the reason why Iceberg hasn’t added this feature so far. For 
those engines that are not aware of the metadata, operations like “CREATE TABLE 
copy AS SELECT * FROM original” will not carry through original’s metadata to 
copy, which can be confusing to the user.
2. Will need to update the iceberg table spec and define compatibility rules 
across spec versions.

Regarding the first concern, I think the field metadata can be treated like 
identifier-field-ids. identifier-field-ids specifies a table’s unique 
identifier. However the uniqueness is guaranteed by specific compute engines 
and not enforced by Iceberg. Likewise, the ability of field metadata can be 
engine-specific. For engines that support field metadata, there won’t be 
issues. My colleagues and I have developed a draft of Iceberg’s field metadata 
and supported Spark to load & save Iceberg’s field metadata. For engines that 
does not support, maybe we can log a warning when loading a table with field 
metadata.


What do you think of it? Should we add the feature?

Thanks,
Yanghao

Discussion on adding a metadata field for NestedField

Reply via email to