Re: Column-Level Key-Value Properties (Tags) in Iceberg

Jean-Baptiste Onofré Tue, 09 Jan 2024 07:00:46 -0800

It makes sense. Agree.

Regards
JB


On Mon, Jan 8, 2024 at 11:48 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote:
>
> JB,
>
> I would draw a distinction between catalog and this proposed feature in that 
> the catalog is actually not part of the spec, so it is entirely up to the 
> engine and is optional.
>
> When it comes to the table spec, "optional" does not mean that it does not 
> have to be implemented/supported.  Any engine/library that produces metadata 
> would need to support column-level properties so that it does not drop or 
> improperly handle the metadata elements, even if it does not expose a way to 
> view/manipulate them.  This is why scrutiny of spec changes is critical.
>
> +1 to what you said about documentation and support.
>
> -Dan
>
>
>
> On Mon, Jan 8, 2024 at 1:38 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
>>
>> Hi Dan,
>>
>> I agree: it will depend on the engine capabilities. That said, it's
>> similar to catalog: each catalog might have different
>> approaches/features/capabilities, so engines might have different
>> capabilities as well.
>> If it's an optional feature in the spec, and each engine might or
>> might not implement it, that's ok. But it's certainly not a
>> requirement.
>> That said, we would need to clearly document the capabilities of each
>> engine (and catalog) (I don't say this documentation should be in
>> Iceberg, but engine "providers" would need to clearly state the
>> supported features).
>>
>> Regards
>> JB
>>
>> On Mon, Jan 8, 2024 at 6:33 AM Daniel Weeks <dwe...@apache.org> wrote:
>> >
>> > The main risk I see is that this adds complexity and there may be limited 
>> > use of the feature, which makes me question the value.  Spark seems like 
>> > the most likely/obvious to add native support for column-level properties, 
>> > but there are a wide range of engines that may never really adopt this 
>> > (e.g. Trino, Dremio, Doris, Starrocks, Redshift) as there isn't SQL 
>> > specification for table/column properties to my knowledge.
>> >
>> > I do think it would be nice for engines that have similar concepts if it 
>> > really can be natively integrated and I'm sure there are other use cases 
>> > for column properties, but it still feels somewhat niche.
>> >
>> > That being said, I'm not opposed and if there's interest in getting a 
>> > proposal put together for the spec changes, we'll get a much better idea 
>> > of any challenges.
>> >
>> > Thanks,
>> > -Dan
>> >
>> > On Thu, Jan 4, 2024 at 11:55 AM Walaa Eldin Moustafa 
>> > <wa.moust...@gmail.com> wrote:
>> >>
>> >> Agree that it should not be use case specific. There could be other 
>> >> applications beyond governance. As John mentioned, ML is another domain, 
>> >> and it is actually the case at LinkedIn as well.
>> >>
>> >> I would approach this with the understanding that the key requirement is 
>> >> to add key/value properties at the column level, not necessarily solving 
>> >> for compliance. Compliance is just one of the applications and can 
>> >> leverage this feature in many ways. But one of the key requirements in 
>> >> compliance, ML, and other applications is enriching column-level 
>> >> metadata. Other systems (Avro, BigQuery, Snowflake) do that too as 
>> >> pointed out in the original message. Since Iceberg is the source of truth 
>> >> for schema/column/field data, it sounds reasonable that the column-level 
>> >> metadata should co-exist in the same place, hence the Iceberg-level 
>> >> proposal. Other external solutions are possible of course (for column 
>> >> level metadata, not necessarily "compliance"), but with the compromise of 
>> >> possible schema drift and inconsistency. For example, at LinkedIn, we use 
>> >> Datahub for compliance annotations/tags (this is an example of an 
>> >> external system, even outside the catalog) and use Avro schema literals 
>> >> for ML column-level metadata (this is an example of table level 
>> >> property). In both situations, it would have been better if the tags 
>> >> co-existed with the column definitions. So the tradeoff is really 
>> >> between: (1) Enhancing Iceberg spec to minimize inconsistency in this 
>> >> domain, or (2) Letting Iceberg users come up with custom, disparate, and 
>> >> potentially inconsistent solutions. What do you all think?
>> >>
>> >> Thanks,
>> >> Walaa.
>> >>
>> >> On Thu, Jan 4, 2024 at 11:14 AM Daniel Weeks <dwe...@apache.org> wrote:
>> >>>
>> >>> I not opposed to the idea of adding column-level properties with a few 
>> >>> considerations:
>> >>>
>> >>> We shouldn't explicitly tie it to a particular use case like data 
>> >>> governance.  You may be able leverage this for those capabilities, but 
>> >>> adding anything use case specific gets into some really opinionated 
>> >>> areas and makes the feature less generalizable.
>> >>> We need to be really explicit about the behaviors around evolution, tags 
>> >>> and branches as it could have implications about features built around 
>> >>> this behave.
>> >>> Iceberg would need to be the source of truth for this information to 
>> >>> keep external tags from misrepresenting the underlying schema definition.
>> >>>
>> >>> I would agree with Jack that there may be other ways to approach policy 
>> >>> information so we should explore those and see if those would render 
>> >>> this functionality less useful overall (I'm sure there are ways we can 
>> >>> use column-level properties, but if the main driver is policy, this may 
>> >>> not be worth the investment at the moment).
>> >>>
>> >>> -Dan
>> >>>
>> >>> On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu <liurenjie2...@gmail.com> 
>> >>> wrote:
>> >>>>
>> >>>> This proposal sounds good to me.
>> >>>>
>> >>>>> If we talk specifically about governance features, I am not sure if 
>> >>>>> column property is the best way though. Consider the case of having a 
>> >>>>> column which was not PII, but becomes PII because certain law has 
>> >>>>> passed. The operation a user would perform in this case is something 
>> >>>>> like "ALTER TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". 
>> >>>>> However, Iceberg schema is versioned, that means if you time travel to 
>> >>>>> some time before the MODIFY COLUMN operation, the PII column becomes 
>> >>>>> still accessible.
>> >>>>
>> >>>>
>> >>>> This sounds like reasonable behavior to me. This is just like we do an 
>> >>>> ddl like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)", and 
>> >>>> if we to time travel to older version, we should also not see the 
>> >>>> new_column.
>> >>>>
>> >>>> On Thu, Jan 4, 2024 at 6:26 AM John Zhuge <jzh...@apache.org> wrote:
>> >>>>>
>> >>>>> Hi Walaa,
>> >>>>>
>> >>>>> Netflix internal Spark and Iceberg have supported column metadata in 
>> >>>>> Iceberg tables since Spark 2.4. The Spark data type is 
>> >>>>> `org.apache.spark.sql.types.Metadata` in StructType. The feature is 
>> >>>>> used by ML teams.
>> >>>>>
>> >>>>> It'd be great for the feature to be adopted.
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa 
>> >>>>> <wa.moust...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Thanks Jack!
>> >>>>>>
>> >>>>>> I think generic key value pairs are still valuable, even for data 
>> >>>>>> governance.
>> >>>>>>
>> >>>>>> Regarding schema versions and PII evolution over time, I actually 
>> >>>>>> think it is a good feature to keep PII and schema in sync across 
>> >>>>>> versions for data reproducibility. Consistency is key in time travel 
>> >>>>>> scenarios - the objective should be to replicate data states 
>> >>>>>> accurately, regardless of subsequent changes in column tags. On the 
>> >>>>>> other hand, organizations typically make special arrangements when it 
>> >>>>>> comes to addressing compliance in the context of time travel. For 
>> >>>>>> example in the data deletion use case, special accomodation should 
>> >>>>>> take place to address the fact that time travel can facilitate 
>> >>>>>> restoring the data. Finally, I am not very concerned about the case 
>> >>>>>> when a field evolves to PII=true while it is still set to PII=false 
>> >>>>>> in the time travel window. Typically, the time travel window is in 
>> >>>>>> the order of days but regulation enforcement window is in the order 
>> >>>>>> of months. Most often, the data versions with PII=false would have 
>> >>>>>> cycled out of the system before the regulatory enforcement is in 
>> >>>>>> effect.
>> >>>>>>
>> >>>>>> I also think that the catalog level example in AWS Glue still needs 
>> >>>>>> to consistently ensure schema compatibility? How does it ensure that 
>> >>>>>> the columns referenced in the policies are in sync with the Iceberg 
>> >>>>>> table schema, especially when the Iceberg table schema is evolved 
>> >>>>>> when the policies and referenced columns are not?
>> >>>>>>
>> >>>>>> Regarding bringing policy and compliance semantics aspects into 
>> >>>>>> Iceberg as a top level construct, I agree this is taking it a bit too 
>> >>>>>> far and might be out of scope. Further, compliance policies can be 
>> >>>>>> quite complicated, and a predefined set of permissions/access 
>> >>>>>> controls can be too restrictive and not flexible enough to capture 
>> >>>>>> various compliance needs, like dynamic data masking.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Walaa.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>> Thanks for bringing this topic up! I can provide some perspective 
>> >>>>>>> about AWS Glue's related features.
>> >>>>>>>
>> >>>>>>> AWS Glue table definition also has a column parameters feature 
>> >>>>>>> (ref). This does not serve any governance purpose at this moment, 
>> >>>>>>> but it is a pretty convenient feature that allows users to add 
>> >>>>>>> arbitrary tags to columns. As you said, it is technically just a 
>> >>>>>>> more fancy and more structured doc field for a column, which I don't 
>> >>>>>>> have a strong opinion about adding it or not in Iceberg.
>> >>>>>>>
>> >>>>>>> If we talk specifically about governance features, I am not sure if 
>> >>>>>>> column property is the best way though. Consider the case of having 
>> >>>>>>> a column which was not PII, but becomes PII because certain law has 
>> >>>>>>> passed. The operation a user would perform in this case is something 
>> >>>>>>> like "ALTER TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". 
>> >>>>>>> However, Iceberg schema is versioned, that means if you time travel 
>> >>>>>>> to some time before the MODIFY COLUMN operation, the PII column 
>> >>>>>>> becomes still accessible. So what you really want is to globally set 
>> >>>>>>> the column to be PII, instead of just the latest column, but that 
>> >>>>>>> becomes a bit incompatible with Iceberg's versioned schema model.
>> >>>>>>>
>> >>>>>>> in AWS Glue, such governance features are provided at policy and 
>> >>>>>>> table level. The information like PII, sensitivity level are 
>> >>>>>>> essentially persisted as LakeFormation policies that are attached to 
>> >>>>>>> the table but separated from the table. After users configure 
>> >>>>>>> column/row-level access to a table through LakeFormation, what would 
>> >>>>>>> happen is that the table response received by services like EMR 
>> >>>>>>> Spark, Athena, Glue ETL will contain an additional fields of 
>> >>>>>>> authorized columns and cell filters (ref), which allows these 
>> >>>>>>> engines to apply the authorization to any schema of the table that 
>> >>>>>>> will be used for the query. In this approach, the user's policy 
>> >>>>>>> setting is decoupled with the table's schema evolution over time, 
>> >>>>>>> which avoids problems like the one above in time travel, and many 
>> >>>>>>> other types of unintended user configuration mistakes.
>> >>>>>>>
>> >>>>>>> So I think a full governance story would mean to add something 
>> >>>>>>> similar in Iceberg's table model. For example, we could add a 
>> >>>>>>> "policy" field that contains sub-fields like the table's basic 
>> >>>>>>> access permission (READ/WRITE/ADMIN), authorized columns, data 
>> >>>>>>> filters, etc. I am not sure if Iceberg needs its own policy spec 
>> >>>>>>> though, that might go a bit too far.
>> >>>>>>>
>> >>>>>>> Any thoughts?
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Jack Ye
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa 
>> >>>>>>> <wa.moust...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Iceberg Developers,
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I would like to start a discussion on a potential enhancement to 
>> >>>>>>>> Iceberg around the implementation of key-value style properties 
>> >>>>>>>> (tags) for individual columns or fields. I believe this feature 
>> >>>>>>>> could have significant applications, especially in the domain of 
>> >>>>>>>> data governance.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Here are some examples of how this feature can be potentially used:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> * PII Classification: Indicating whether a field contains 
>> >>>>>>>> Personally Identifiable Information (e.g., PII -> {true, false}).
>> >>>>>>>>
>> >>>>>>>> * Ontology Mapping: Associating fields with specific ontology terms 
>> >>>>>>>> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}).
>> >>>>>>>>
>> >>>>>>>> * Sensitivity Level Setting: Defining the sensitivity level of a 
>> >>>>>>>> field (e.g., Sensitive -> {High, Medium, Low}).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> While current workarounds like table-level properties or 
>> >>>>>>>> column-level comments/docs exist, they lack the structured approach 
>> >>>>>>>> needed for these use cases. Table-level properties often require 
>> >>>>>>>> constant schema validation and can be error-prone, especially when 
>> >>>>>>>> not in sync with the table schema. Additionally, column-level 
>> >>>>>>>> comments, while useful, do not enforce a standardized format.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am also interested in hearing thoughts or experiences around 
>> >>>>>>>> whether this problem is addressed at the catalog level in any of 
>> >>>>>>>> the implementations (e.g., AWS Glue). My impression is that even 
>> >>>>>>>> with catalog-level implementations, there's still a need for 
>> >>>>>>>> continual validation against the table schema. Further, 
>> >>>>>>>> catalog-specific implementations will lack a standardized 
>> >>>>>>>> specification. A spec could be beneficial for areas requiring 
>> >>>>>>>> consistent and structured metadata management.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I realize that introducing this feature may necessitate the 
>> >>>>>>>> development of APIs in various engines to set these properties or 
>> >>>>>>>> tags, such as extensions in Spark or Trino SQL. However, I believe 
>> >>>>>>>> it’s a worthwhile discussion to have, separate from whether Iceberg 
>> >>>>>>>> should include these features in its APIs. For the sake of this 
>> >>>>>>>> thread we can focus on the Iceberg APIs aspect.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Here are some references to similar concepts in other systems:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> * Avro attributes: Avro 1.10.2 Specification - Schemas (see 
>> >>>>>>>> "Attributes not defined in this document are permitted as 
>> >>>>>>>> metadata").
>> >>>>>>>>
>> >>>>>>>> * BigQuery policy tags: BigQuery Column-level Security.
>> >>>>>>>>
>> >>>>>>>> * Snowflake object tagging: Snowflake Object Tagging Documentation 
>> >>>>>>>> (see references to "MODIFY COLUMN").
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Looking forward to your insights on whether addressing this issue 
>> >>>>>>>> at the Iceberg specification and API level is a reasonable 
>> >>>>>>>> direction.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Walaa.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> John Zhuge

Re: Column-Level Key-Value Properties (Tags) in Iceberg

Reply via email to