It makes sense. Agree. Regards JB
On Mon, Jan 8, 2024 at 11:48 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote: > > JB, > > I would draw a distinction between catalog and this proposed feature in that > the catalog is actually not part of the spec, so it is entirely up to the > engine and is optional. > > When it comes to the table spec, "optional" does not mean that it does not > have to be implemented/supported. Any engine/library that produces metadata > would need to support column-level properties so that it does not drop or > improperly handle the metadata elements, even if it does not expose a way to > view/manipulate them. This is why scrutiny of spec changes is critical. > > +1 to what you said about documentation and support. > > -Dan > > > > On Mon, Jan 8, 2024 at 1:38 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: >> >> Hi Dan, >> >> I agree: it will depend on the engine capabilities. That said, it's >> similar to catalog: each catalog might have different >> approaches/features/capabilities, so engines might have different >> capabilities as well. >> If it's an optional feature in the spec, and each engine might or >> might not implement it, that's ok. But it's certainly not a >> requirement. >> That said, we would need to clearly document the capabilities of each >> engine (and catalog) (I don't say this documentation should be in >> Iceberg, but engine "providers" would need to clearly state the >> supported features). >> >> Regards >> JB >> >> On Mon, Jan 8, 2024 at 6:33 AM Daniel Weeks <dwe...@apache.org> wrote: >> > >> > The main risk I see is that this adds complexity and there may be limited >> > use of the feature, which makes me question the value. Spark seems like >> > the most likely/obvious to add native support for column-level properties, >> > but there are a wide range of engines that may never really adopt this >> > (e.g. Trino, Dremio, Doris, Starrocks, Redshift) as there isn't SQL >> > specification for table/column properties to my knowledge. >> > >> > I do think it would be nice for engines that have similar concepts if it >> > really can be natively integrated and I'm sure there are other use cases >> > for column properties, but it still feels somewhat niche. >> > >> > That being said, I'm not opposed and if there's interest in getting a >> > proposal put together for the spec changes, we'll get a much better idea >> > of any challenges. >> > >> > Thanks, >> > -Dan >> > >> > On Thu, Jan 4, 2024 at 11:55 AM Walaa Eldin Moustafa >> > <wa.moust...@gmail.com> wrote: >> >> >> >> Agree that it should not be use case specific. There could be other >> >> applications beyond governance. As John mentioned, ML is another domain, >> >> and it is actually the case at LinkedIn as well. >> >> >> >> I would approach this with the understanding that the key requirement is >> >> to add key/value properties at the column level, not necessarily solving >> >> for compliance. Compliance is just one of the applications and can >> >> leverage this feature in many ways. But one of the key requirements in >> >> compliance, ML, and other applications is enriching column-level >> >> metadata. Other systems (Avro, BigQuery, Snowflake) do that too as >> >> pointed out in the original message. Since Iceberg is the source of truth >> >> for schema/column/field data, it sounds reasonable that the column-level >> >> metadata should co-exist in the same place, hence the Iceberg-level >> >> proposal. Other external solutions are possible of course (for column >> >> level metadata, not necessarily "compliance"), but with the compromise of >> >> possible schema drift and inconsistency. For example, at LinkedIn, we use >> >> Datahub for compliance annotations/tags (this is an example of an >> >> external system, even outside the catalog) and use Avro schema literals >> >> for ML column-level metadata (this is an example of table level >> >> property). In both situations, it would have been better if the tags >> >> co-existed with the column definitions. So the tradeoff is really >> >> between: (1) Enhancing Iceberg spec to minimize inconsistency in this >> >> domain, or (2) Letting Iceberg users come up with custom, disparate, and >> >> potentially inconsistent solutions. What do you all think? >> >> >> >> Thanks, >> >> Walaa. >> >> >> >> On Thu, Jan 4, 2024 at 11:14 AM Daniel Weeks <dwe...@apache.org> wrote: >> >>> >> >>> I not opposed to the idea of adding column-level properties with a few >> >>> considerations: >> >>> >> >>> We shouldn't explicitly tie it to a particular use case like data >> >>> governance. You may be able leverage this for those capabilities, but >> >>> adding anything use case specific gets into some really opinionated >> >>> areas and makes the feature less generalizable. >> >>> We need to be really explicit about the behaviors around evolution, tags >> >>> and branches as it could have implications about features built around >> >>> this behave. >> >>> Iceberg would need to be the source of truth for this information to >> >>> keep external tags from misrepresenting the underlying schema definition. >> >>> >> >>> I would agree with Jack that there may be other ways to approach policy >> >>> information so we should explore those and see if those would render >> >>> this functionality less useful overall (I'm sure there are ways we can >> >>> use column-level properties, but if the main driver is policy, this may >> >>> not be worth the investment at the moment). >> >>> >> >>> -Dan >> >>> >> >>> On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu <liurenjie2...@gmail.com> >> >>> wrote: >> >>>> >> >>>> This proposal sounds good to me. >> >>>> >> >>>>> If we talk specifically about governance features, I am not sure if >> >>>>> column property is the best way though. Consider the case of having a >> >>>>> column which was not PII, but becomes PII because certain law has >> >>>>> passed. The operation a user would perform in this case is something >> >>>>> like "ALTER TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". >> >>>>> However, Iceberg schema is versioned, that means if you time travel to >> >>>>> some time before the MODIFY COLUMN operation, the PII column becomes >> >>>>> still accessible. >> >>>> >> >>>> >> >>>> This sounds like reasonable behavior to me. This is just like we do an >> >>>> ddl like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)", and >> >>>> if we to time travel to older version, we should also not see the >> >>>> new_column. >> >>>> >> >>>> On Thu, Jan 4, 2024 at 6:26 AM John Zhuge <jzh...@apache.org> wrote: >> >>>>> >> >>>>> Hi Walaa, >> >>>>> >> >>>>> Netflix internal Spark and Iceberg have supported column metadata in >> >>>>> Iceberg tables since Spark 2.4. The Spark data type is >> >>>>> `org.apache.spark.sql.types.Metadata` in StructType. The feature is >> >>>>> used by ML teams. >> >>>>> >> >>>>> It'd be great for the feature to be adopted. >> >>>>> >> >>>>> >> >>>>> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa >> >>>>> <wa.moust...@gmail.com> wrote: >> >>>>>> >> >>>>>> Thanks Jack! >> >>>>>> >> >>>>>> I think generic key value pairs are still valuable, even for data >> >>>>>> governance. >> >>>>>> >> >>>>>> Regarding schema versions and PII evolution over time, I actually >> >>>>>> think it is a good feature to keep PII and schema in sync across >> >>>>>> versions for data reproducibility. Consistency is key in time travel >> >>>>>> scenarios - the objective should be to replicate data states >> >>>>>> accurately, regardless of subsequent changes in column tags. On the >> >>>>>> other hand, organizations typically make special arrangements when it >> >>>>>> comes to addressing compliance in the context of time travel. For >> >>>>>> example in the data deletion use case, special accomodation should >> >>>>>> take place to address the fact that time travel can facilitate >> >>>>>> restoring the data. Finally, I am not very concerned about the case >> >>>>>> when a field evolves to PII=true while it is still set to PII=false >> >>>>>> in the time travel window. Typically, the time travel window is in >> >>>>>> the order of days but regulation enforcement window is in the order >> >>>>>> of months. Most often, the data versions with PII=false would have >> >>>>>> cycled out of the system before the regulatory enforcement is in >> >>>>>> effect. >> >>>>>> >> >>>>>> I also think that the catalog level example in AWS Glue still needs >> >>>>>> to consistently ensure schema compatibility? How does it ensure that >> >>>>>> the columns referenced in the policies are in sync with the Iceberg >> >>>>>> table schema, especially when the Iceberg table schema is evolved >> >>>>>> when the policies and referenced columns are not? >> >>>>>> >> >>>>>> Regarding bringing policy and compliance semantics aspects into >> >>>>>> Iceberg as a top level construct, I agree this is taking it a bit too >> >>>>>> far and might be out of scope. Further, compliance policies can be >> >>>>>> quite complicated, and a predefined set of permissions/access >> >>>>>> controls can be too restrictive and not flexible enough to capture >> >>>>>> various compliance needs, like dynamic data masking. >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Walaa. >> >>>>>> >> >>>>>> >> >>>>>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>>>>>> >> >>>>>>> Thanks for bringing this topic up! I can provide some perspective >> >>>>>>> about AWS Glue's related features. >> >>>>>>> >> >>>>>>> AWS Glue table definition also has a column parameters feature >> >>>>>>> (ref). This does not serve any governance purpose at this moment, >> >>>>>>> but it is a pretty convenient feature that allows users to add >> >>>>>>> arbitrary tags to columns. As you said, it is technically just a >> >>>>>>> more fancy and more structured doc field for a column, which I don't >> >>>>>>> have a strong opinion about adding it or not in Iceberg. >> >>>>>>> >> >>>>>>> If we talk specifically about governance features, I am not sure if >> >>>>>>> column property is the best way though. Consider the case of having >> >>>>>>> a column which was not PII, but becomes PII because certain law has >> >>>>>>> passed. The operation a user would perform in this case is something >> >>>>>>> like "ALTER TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". >> >>>>>>> However, Iceberg schema is versioned, that means if you time travel >> >>>>>>> to some time before the MODIFY COLUMN operation, the PII column >> >>>>>>> becomes still accessible. So what you really want is to globally set >> >>>>>>> the column to be PII, instead of just the latest column, but that >> >>>>>>> becomes a bit incompatible with Iceberg's versioned schema model. >> >>>>>>> >> >>>>>>> in AWS Glue, such governance features are provided at policy and >> >>>>>>> table level. The information like PII, sensitivity level are >> >>>>>>> essentially persisted as LakeFormation policies that are attached to >> >>>>>>> the table but separated from the table. After users configure >> >>>>>>> column/row-level access to a table through LakeFormation, what would >> >>>>>>> happen is that the table response received by services like EMR >> >>>>>>> Spark, Athena, Glue ETL will contain an additional fields of >> >>>>>>> authorized columns and cell filters (ref), which allows these >> >>>>>>> engines to apply the authorization to any schema of the table that >> >>>>>>> will be used for the query. In this approach, the user's policy >> >>>>>>> setting is decoupled with the table's schema evolution over time, >> >>>>>>> which avoids problems like the one above in time travel, and many >> >>>>>>> other types of unintended user configuration mistakes. >> >>>>>>> >> >>>>>>> So I think a full governance story would mean to add something >> >>>>>>> similar in Iceberg's table model. For example, we could add a >> >>>>>>> "policy" field that contains sub-fields like the table's basic >> >>>>>>> access permission (READ/WRITE/ADMIN), authorized columns, data >> >>>>>>> filters, etc. I am not sure if Iceberg needs its own policy spec >> >>>>>>> though, that might go a bit too far. >> >>>>>>> >> >>>>>>> Any thoughts? >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> Jack Ye >> >>>>>>> >> >>>>>>> >> >>>>>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa >> >>>>>>> <wa.moust...@gmail.com> wrote: >> >>>>>>>> >> >>>>>>>> Hi Iceberg Developers, >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> I would like to start a discussion on a potential enhancement to >> >>>>>>>> Iceberg around the implementation of key-value style properties >> >>>>>>>> (tags) for individual columns or fields. I believe this feature >> >>>>>>>> could have significant applications, especially in the domain of >> >>>>>>>> data governance. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Here are some examples of how this feature can be potentially used: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> * PII Classification: Indicating whether a field contains >> >>>>>>>> Personally Identifiable Information (e.g., PII -> {true, false}). >> >>>>>>>> >> >>>>>>>> * Ontology Mapping: Associating fields with specific ontology terms >> >>>>>>>> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}). >> >>>>>>>> >> >>>>>>>> * Sensitivity Level Setting: Defining the sensitivity level of a >> >>>>>>>> field (e.g., Sensitive -> {High, Medium, Low}). >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> While current workarounds like table-level properties or >> >>>>>>>> column-level comments/docs exist, they lack the structured approach >> >>>>>>>> needed for these use cases. Table-level properties often require >> >>>>>>>> constant schema validation and can be error-prone, especially when >> >>>>>>>> not in sync with the table schema. Additionally, column-level >> >>>>>>>> comments, while useful, do not enforce a standardized format. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> I am also interested in hearing thoughts or experiences around >> >>>>>>>> whether this problem is addressed at the catalog level in any of >> >>>>>>>> the implementations (e.g., AWS Glue). My impression is that even >> >>>>>>>> with catalog-level implementations, there's still a need for >> >>>>>>>> continual validation against the table schema. Further, >> >>>>>>>> catalog-specific implementations will lack a standardized >> >>>>>>>> specification. A spec could be beneficial for areas requiring >> >>>>>>>> consistent and structured metadata management. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> I realize that introducing this feature may necessitate the >> >>>>>>>> development of APIs in various engines to set these properties or >> >>>>>>>> tags, such as extensions in Spark or Trino SQL. However, I believe >> >>>>>>>> it’s a worthwhile discussion to have, separate from whether Iceberg >> >>>>>>>> should include these features in its APIs. For the sake of this >> >>>>>>>> thread we can focus on the Iceberg APIs aspect. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Here are some references to similar concepts in other systems: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> * Avro attributes: Avro 1.10.2 Specification - Schemas (see >> >>>>>>>> "Attributes not defined in this document are permitted as >> >>>>>>>> metadata"). >> >>>>>>>> >> >>>>>>>> * BigQuery policy tags: BigQuery Column-level Security. >> >>>>>>>> >> >>>>>>>> * Snowflake object tagging: Snowflake Object Tagging Documentation >> >>>>>>>> (see references to "MODIFY COLUMN"). >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Looking forward to your insights on whether addressing this issue >> >>>>>>>> at the Iceberg specification and API level is a reasonable >> >>>>>>>> direction. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Walaa. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> John Zhuge