Thanks for bringing this topic up! I can provide some perspective about AWS
Glue's related features.

AWS Glue table definition also has a column parameters feature (ref
<https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Column>).
This does not serve any governance purpose at this moment, but it is a
pretty convenient feature that allows users to add arbitrary tags to
columns. As you said, it is technically just a more fancy and more
structured doc field for a column, which I don't have a strong opinion
about adding it or not in Iceberg.

If we talk specifically about governance features, I am not sure if column
property is the best way though. Consider the case of having a column which
was not PII, but becomes PII because certain law has passed. The operation
a user would perform in this case is something like "ALTER TABLE MODIFY
COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg schema is
versioned, that means if you time travel to some time before the MODIFY
COLUMN operation, the PII column becomes still accessible. So what you
really want is to globally set the column to be PII, instead of just the
latest column, but that becomes a bit incompatible with Iceberg's versioned
schema model.

in AWS Glue, such governance features are provided at policy and table
level. The information like PII, sensitivity level are essentially
persisted as LakeFormation policies that are attached to the table but
separated from the table. After users configure column/row-level access to
a table through LakeFormation, what would happen is that the table response
received by services like EMR Spark, Athena, Glue ETL will contain an
additional fields of authorized columns and cell filters (ref
<https://docs.aws.amazon.com/glue/latest/webapi/API_GetUnfilteredTableMetadata.html#API_GetUnfilteredTableMetadata_ResponseElements>),
which allows these engines to apply the authorization to any schema of the
table that will be used for the query. In this approach, the user's policy
setting is decoupled with the table's schema evolution over time, which
avoids problems like the one above in time travel, and many other types of
unintended user configuration mistakes.

So I think a full governance story would mean to add something similar in
Iceberg's table model. For example, we could add a "policy" field that
contains sub-fields like the table's basic access permission
(READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if
Iceberg needs its own policy spec though, that might go a bit too far.

Any thoughts?

Best,
Jack Ye


On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Iceberg Developers,
>
>
> I would like to start a discussion on a potential enhancement to Iceberg
> around the implementation of key-value style properties (tags) for
> individual columns or fields. I believe this feature could have significant
> applications, especially in the domain of data governance.
>
>
> Here are some examples of how this feature can be potentially used:
>
>
> * PII Classification: Indicating whether a field contains Personally
> Identifiable Information (e.g., PII -> {true, false}).
>
> * Ontology Mapping: Associating fields with specific ontology terms (e.g.,
> Type -> {USER_ID, USER_NAME, LOCATION}).
>
> * Sensitivity Level Setting: Defining the sensitivity level of a field
> (e.g., Sensitive -> {High, Medium, Low}).
>
>
> While current workarounds like table-level properties or column-level
> comments/docs exist, they lack the structured approach needed for these use
> cases. Table-level properties often require constant schema validation and
> can be error-prone, especially when not in sync with the table schema.
> Additionally, column-level comments, while useful, do not enforce a
> standardized format.
>
>
> I am also interested in hearing thoughts or experiences around whether
> this problem is addressed at the catalog level in any of the
> implementations (e.g., AWS Glue). My impression is that even with
> catalog-level implementations, there's still a need for continual
> validation against the table schema. Further, catalog-specific
> implementations will lack a standardized specification. A spec could be
> beneficial for areas requiring consistent and structured metadata
> management.
>
>
> I realize that introducing this feature may necessitate the development of
> APIs in various engines to set these properties or tags, such as extensions
> in Spark or Trino SQL. However, I believe it’s a worthwhile discussion to
> have, separate from whether Iceberg should include these features in its
> APIs. For the sake of this thread we can focus on the Iceberg APIs aspect.
>
>
> Here are some references to similar concepts in other systems:
>
>
> * Avro attributes: *Avro 1.10.2 Specification - Schemas*
> <https://avro.apache.org/docs/1.10.2/spec.html#schemas> (see "Attributes
> not defined in this document are permitted as metadata").
>
> * BigQuery policy tags: *BigQuery Column-level Security*
> <https://cloud.google.com/bigquery/docs/column-level-security#set_policy>.
>
> * Snowflake object tagging: *Snowflake Object Tagging Documentation*
> <https://docs.snowflake.com/en/user-guide/object-tagging#create-and-assign-tags>
>  (see references to "MODIFY COLUMN").
>
>
> Looking forward to your insights on whether addressing this issue at the
> Iceberg specification and API level is a reasonable direction.
>
>
> Thanks,
> Walaa.
>
>
>
>

Reply via email to