I not opposed to the idea of adding column-level properties with a few considerations:
1. We shouldn't explicitly tie it to a particular use case like data governance. You may be able leverage this for those capabilities, but adding anything use case specific gets into some really opinionated areas and makes the feature less generalizable. 2. We need to be really explicit about the behaviors around evolution, tags and branches as it could have implications about features built around this behave. 3. Iceberg would need to be the source of truth for this information to keep external tags from misrepresenting the underlying schema definition. I would agree with Jack that there may be other ways to approach policy information so we should explore those and see if those would render this functionality less useful overall (I'm sure there are ways we can use column-level properties, but if the main driver is policy, this may not be worth the investment at the moment). -Dan On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > This proposal sounds good to me. > > If we talk specifically about governance features, I am not sure if column >> property is the best way though. Consider the case of having a column which >> was not PII, but becomes PII because certain law has passed. The operation >> a user would perform in this case is something like "ALTER TABLE MODIFY >> COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg schema is >> versioned, that means if you time travel to some time before the MODIFY >> COLUMN operation, the PII column becomes still accessible. > > > This sounds like reasonable behavior to me. This is just like we do an ddl > like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)", and if we > to time travel to older version, we should also not see the new_column. > > On Thu, Jan 4, 2024 at 6:26 AM John Zhuge <jzh...@apache.org> wrote: > >> Hi Walaa, >> >> Netflix internal Spark and Iceberg have supported column metadata in >> Iceberg tables since Spark 2.4. The Spark data type is >> `org.apache.spark.sql.types.Metadata` in StructType. The feature is used by >> ML teams. >> >> It'd be great for the feature to be adopted. >> >> >> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Thanks Jack! >>> >>> I think generic key value pairs are still valuable, even for data >>> governance. >>> >>> Regarding schema versions and PII evolution over time, I actually think >>> it is a good feature to keep PII and schema in sync across versions for >>> data reproducibility. Consistency is key in time travel scenarios - the >>> objective should be to replicate data states accurately, regardless of >>> subsequent changes in column tags. On the other hand, organizations >>> typically make special arrangements when it comes to addressing compliance >>> in the context of time travel. For example in the data deletion use case, >>> special accomodation should take place to address the fact that time travel >>> can facilitate restoring the data. Finally, I am not very concerned about >>> the case when a field evolves to PII=true while it is still set to >>> PII=false in the time travel window. Typically, the time travel window is >>> in the order of days but regulation enforcement window is in the order of >>> months. Most often, the data versions with PII=false would have cycled out >>> of the system before the regulatory enforcement is in effect. >>> >>> I also think that the catalog level example in AWS Glue still needs to >>> consistently ensure schema compatibility? How does it ensure that the >>> columns referenced in the policies are in sync with the Iceberg table >>> schema, especially when the Iceberg table schema is evolved when the >>> policies and referenced columns are not? >>> >>> Regarding bringing policy and compliance semantics aspects into Iceberg >>> as a top level construct, I agree this is taking it a bit too far and might >>> be out of scope. Further, compliance policies can be quite complicated, and >>> a predefined set of permissions/access controls can be too restrictive and >>> not flexible enough to capture various compliance needs, like dynamic data >>> masking. >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Thanks for bringing this topic up! I can provide some perspective about >>>> AWS Glue's related features. >>>> >>>> AWS Glue table definition also has a column parameters feature (ref >>>> <https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Column>). >>>> This does not serve any governance purpose at this moment, but it is a >>>> pretty convenient feature that allows users to add arbitrary tags to >>>> columns. As you said, it is technically just a more fancy and more >>>> structured doc field for a column, which I don't have a strong opinion >>>> about adding it or not in Iceberg. >>>> >>>> If we talk specifically about governance features, I am not sure if >>>> column property is the best way though. Consider the case of having a >>>> column which was not PII, but becomes PII because certain law has passed. >>>> The operation a user would perform in this case is something like "ALTER >>>> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg >>>> schema is versioned, that means if you time travel to some time before the >>>> MODIFY COLUMN operation, the PII column becomes still accessible. So what >>>> you really want is to globally set the column to be PII, instead of just >>>> the latest column, but that becomes a bit incompatible with Iceberg's >>>> versioned schema model. >>>> >>>> in AWS Glue, such governance features are provided at policy and table >>>> level. The information like PII, sensitivity level are essentially >>>> persisted as LakeFormation policies that are attached to the table but >>>> separated from the table. After users configure column/row-level access to >>>> a table through LakeFormation, what would happen is that the table response >>>> received by services like EMR Spark, Athena, Glue ETL will contain an >>>> additional fields of authorized columns and cell filters (ref >>>> <https://docs.aws.amazon.com/glue/latest/webapi/API_GetUnfilteredTableMetadata.html#API_GetUnfilteredTableMetadata_ResponseElements>), >>>> which allows these engines to apply the authorization to any schema of the >>>> table that will be used for the query. In this approach, the user's policy >>>> setting is decoupled with the table's schema evolution over time, which >>>> avoids problems like the one above in time travel, and many other types of >>>> unintended user configuration mistakes. >>>> >>>> So I think a full governance story would mean to add something similar >>>> in Iceberg's table model. For example, we could add a "policy" field that >>>> contains sub-fields like the table's basic access permission >>>> (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if >>>> Iceberg needs its own policy spec though, that might go a bit too far. >>>> >>>> Any thoughts? >>>> >>>> Best, >>>> Jack Ye >>>> >>>> >>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> Hi Iceberg Developers, >>>>> >>>>> >>>>> I would like to start a discussion on a potential enhancement to >>>>> Iceberg around the implementation of key-value style properties (tags) for >>>>> individual columns or fields. I believe this feature could have >>>>> significant >>>>> applications, especially in the domain of data governance. >>>>> >>>>> >>>>> Here are some examples of how this feature can be potentially used: >>>>> >>>>> >>>>> * PII Classification: Indicating whether a field contains Personally >>>>> Identifiable Information (e.g., PII -> {true, false}). >>>>> >>>>> * Ontology Mapping: Associating fields with specific ontology terms >>>>> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}). >>>>> >>>>> * Sensitivity Level Setting: Defining the sensitivity level of a field >>>>> (e.g., Sensitive -> {High, Medium, Low}). >>>>> >>>>> >>>>> While current workarounds like table-level properties or column-level >>>>> comments/docs exist, they lack the structured approach needed for these >>>>> use >>>>> cases. Table-level properties often require constant schema validation and >>>>> can be error-prone, especially when not in sync with the table schema. >>>>> Additionally, column-level comments, while useful, do not enforce a >>>>> standardized format. >>>>> >>>>> >>>>> I am also interested in hearing thoughts or experiences around whether >>>>> this problem is addressed at the catalog level in any of the >>>>> implementations (e.g., AWS Glue). My impression is that even with >>>>> catalog-level implementations, there's still a need for continual >>>>> validation against the table schema. Further, catalog-specific >>>>> implementations will lack a standardized specification. A spec could be >>>>> beneficial for areas requiring consistent and structured metadata >>>>> management. >>>>> >>>>> >>>>> I realize that introducing this feature may necessitate the >>>>> development of APIs in various engines to set these properties or tags, >>>>> such as extensions in Spark or Trino SQL. However, I believe it’s a >>>>> worthwhile discussion to have, separate from whether Iceberg should >>>>> include >>>>> these features in its APIs. For the sake of this thread we can focus on >>>>> the >>>>> Iceberg APIs aspect. >>>>> >>>>> >>>>> Here are some references to similar concepts in other systems: >>>>> >>>>> >>>>> * Avro attributes: *Avro 1.10.2 Specification - Schemas* >>>>> <https://avro.apache.org/docs/1.10.2/spec.html#schemas> (see >>>>> "Attributes not defined in this document are permitted as metadata"). >>>>> >>>>> * BigQuery policy tags: *BigQuery Column-level Security* >>>>> <https://cloud.google.com/bigquery/docs/column-level-security#set_policy> >>>>> . >>>>> >>>>> * Snowflake object tagging: *Snowflake Object Tagging Documentation* >>>>> <https://docs.snowflake.com/en/user-guide/object-tagging#create-and-assign-tags> >>>>> (see references to "MODIFY COLUMN"). >>>>> >>>>> >>>>> Looking forward to your insights on whether addressing this issue at >>>>> the Iceberg specification and API level is a reasonable direction. >>>>> >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> >>>>> >> >> -- >> John Zhuge >> >