Thanks Jack! I think generic key value pairs are still valuable, even for data governance.
Regarding schema versions and PII evolution over time, I actually think it is a good feature to keep PII and schema in sync across versions for data reproducibility. Consistency is key in time travel scenarios - the objective should be to replicate data states accurately, regardless of subsequent changes in column tags. On the other hand, organizations typically make special arrangements when it comes to addressing compliance in the context of time travel. For example in the data deletion use case, special accomodation should take place to address the fact that time travel can facilitate restoring the data. Finally, I am not very concerned about the case when a field evolves to PII=true while it is still set to PII=false in the time travel window. Typically, the time travel window is in the order of days but regulation enforcement window is in the order of months. Most often, the data versions with PII=false would have cycled out of the system before the regulatory enforcement is in effect. I also think that the catalog level example in AWS Glue still needs to consistently ensure schema compatibility? How does it ensure that the columns referenced in the policies are in sync with the Iceberg table schema, especially when the Iceberg table schema is evolved when the policies and referenced columns are not? Regarding bringing policy and compliance semantics aspects into Iceberg as a top level construct, I agree this is taking it a bit too far and might be out of scope. Further, compliance policies can be quite complicated, and a predefined set of permissions/access controls can be too restrictive and not flexible enough to capture various compliance needs, like dynamic data masking. Thanks, Walaa. On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com> wrote: > Thanks for bringing this topic up! I can provide some perspective about > AWS Glue's related features. > > AWS Glue table definition also has a column parameters feature (ref > <https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Column>). > This does not serve any governance purpose at this moment, but it is a > pretty convenient feature that allows users to add arbitrary tags to > columns. As you said, it is technically just a more fancy and more > structured doc field for a column, which I don't have a strong opinion > about adding it or not in Iceberg. > > If we talk specifically about governance features, I am not sure if column > property is the best way though. Consider the case of having a column which > was not PII, but becomes PII because certain law has passed. The operation > a user would perform in this case is something like "ALTER TABLE MODIFY > COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg schema is > versioned, that means if you time travel to some time before the MODIFY > COLUMN operation, the PII column becomes still accessible. So what you > really want is to globally set the column to be PII, instead of just the > latest column, but that becomes a bit incompatible with Iceberg's versioned > schema model. > > in AWS Glue, such governance features are provided at policy and table > level. The information like PII, sensitivity level are essentially > persisted as LakeFormation policies that are attached to the table but > separated from the table. After users configure column/row-level access to > a table through LakeFormation, what would happen is that the table response > received by services like EMR Spark, Athena, Glue ETL will contain an > additional fields of authorized columns and cell filters (ref > <https://docs.aws.amazon.com/glue/latest/webapi/API_GetUnfilteredTableMetadata.html#API_GetUnfilteredTableMetadata_ResponseElements>), > which allows these engines to apply the authorization to any schema of the > table that will be used for the query. In this approach, the user's policy > setting is decoupled with the table's schema evolution over time, which > avoids problems like the one above in time travel, and many other types of > unintended user configuration mistakes. > > So I think a full governance story would mean to add something similar in > Iceberg's table model. For example, we could add a "policy" field that > contains sub-fields like the table's basic access permission > (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if > Iceberg needs its own policy spec though, that might go a bit too far. > > Any thoughts? > > Best, > Jack Ye > > > On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > >> Hi Iceberg Developers, >> >> >> I would like to start a discussion on a potential enhancement to Iceberg >> around the implementation of key-value style properties (tags) for >> individual columns or fields. I believe this feature could have significant >> applications, especially in the domain of data governance. >> >> >> Here are some examples of how this feature can be potentially used: >> >> >> * PII Classification: Indicating whether a field contains Personally >> Identifiable Information (e.g., PII -> {true, false}). >> >> * Ontology Mapping: Associating fields with specific ontology terms >> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}). >> >> * Sensitivity Level Setting: Defining the sensitivity level of a field >> (e.g., Sensitive -> {High, Medium, Low}). >> >> >> While current workarounds like table-level properties or column-level >> comments/docs exist, they lack the structured approach needed for these use >> cases. Table-level properties often require constant schema validation and >> can be error-prone, especially when not in sync with the table schema. >> Additionally, column-level comments, while useful, do not enforce a >> standardized format. >> >> >> I am also interested in hearing thoughts or experiences around whether >> this problem is addressed at the catalog level in any of the >> implementations (e.g., AWS Glue). My impression is that even with >> catalog-level implementations, there's still a need for continual >> validation against the table schema. Further, catalog-specific >> implementations will lack a standardized specification. A spec could be >> beneficial for areas requiring consistent and structured metadata >> management. >> >> >> I realize that introducing this feature may necessitate the development >> of APIs in various engines to set these properties or tags, such as >> extensions in Spark or Trino SQL. However, I believe it’s a worthwhile >> discussion to have, separate from whether Iceberg should include these >> features in its APIs. For the sake of this thread we can focus on the >> Iceberg APIs aspect. >> >> >> Here are some references to similar concepts in other systems: >> >> >> * Avro attributes: *Avro 1.10.2 Specification - Schemas* >> <https://avro.apache.org/docs/1.10.2/spec.html#schemas> (see "Attributes >> not defined in this document are permitted as metadata"). >> >> * BigQuery policy tags: *BigQuery Column-level Security* >> <https://cloud.google.com/bigquery/docs/column-level-security#set_policy> >> . >> >> * Snowflake object tagging: *Snowflake Object Tagging Documentation* >> <https://docs.snowflake.com/en/user-guide/object-tagging#create-and-assign-tags> >> (see references to "MODIFY COLUMN"). >> >> >> Looking forward to your insights on whether addressing this issue at the >> Iceberg specification and API level is a reasonable direction. >> >> >> Thanks, >> Walaa. >> >> >> >>