Hi Iceberg Developers,

I would like to start a discussion on a potential enhancement to Iceberg
around the implementation of key-value style properties (tags) for
individual columns or fields. I believe this feature could have significant
applications, especially in the domain of data governance.


Here are some examples of how this feature can be potentially used:


* PII Classification: Indicating whether a field contains Personally
Identifiable Information (e.g., PII -> {true, false}).

* Ontology Mapping: Associating fields with specific ontology terms (e.g.,
Type -> {USER_ID, USER_NAME, LOCATION}).

* Sensitivity Level Setting: Defining the sensitivity level of a field
(e.g., Sensitive -> {High, Medium, Low}).


While current workarounds like table-level properties or column-level
comments/docs exist, they lack the structured approach needed for these use
cases. Table-level properties often require constant schema validation and
can be error-prone, especially when not in sync with the table schema.
Additionally, column-level comments, while useful, do not enforce a
standardized format.


I am also interested in hearing thoughts or experiences around whether this
problem is addressed at the catalog level in any of the implementations
(e.g., AWS Glue). My impression is that even with catalog-level
implementations, there's still a need for continual validation against the
table schema. Further, catalog-specific implementations will lack a
standardized specification. A spec could be beneficial for areas requiring
consistent and structured metadata management.


I realize that introducing this feature may necessitate the development of
APIs in various engines to set these properties or tags, such as extensions
in Spark or Trino SQL. However, I believe it’s a worthwhile discussion to
have, separate from whether Iceberg should include these features in its
APIs. For the sake of this thread we can focus on the Iceberg APIs aspect.


Here are some references to similar concepts in other systems:


* Avro attributes: *Avro 1.10.2 Specification - Schemas*
<https://avro.apache.org/docs/1.10.2/spec.html#schemas> (see "Attributes
not defined in this document are permitted as metadata").

* BigQuery policy tags: *BigQuery Column-level Security*
<https://cloud.google.com/bigquery/docs/column-level-security#set_policy>.

* Snowflake object tagging: *Snowflake Object Tagging Documentation*
<https://docs.snowflake.com/en/user-guide/object-tagging#create-and-assign-tags>
 (see references to "MODIFY COLUMN").


Looking forward to your insights on whether addressing this issue at the
Iceberg specification and API level is a reasonable direction.


Thanks,
Walaa.

Reply via email to