Re: Column-Level Key-Value Properties (Tags) in Iceberg

Walaa Eldin Moustafa Mon, 08 Jan 2024 15:58:16 -0800

I think as far as the spec is concerned, we can treat those tags similar to
how the doc strings are treated. Currently, the spec statement on doc
strings is the following:


"Fields may have an optional comment or doc string."

I agree that the APIs should be designed in such a way that prevents
engines that do not know about them from inadvertently resetting them, but
I think this achievable as well (and I expect it will be similar to how doc
string APIs are designed to a large extent).

Thanks,
Walaa.


On Mon, Jan 8, 2024 at 2:48 PM Daniel Weeks <daniel.c.we...@gmail.com>
wrote:

> JB,
>
> I would draw a distinction between catalog and this proposed feature in
> that the catalog is actually not part of the spec, so it is entirely up to
> the engine and is optional.
>
> When it comes to the table spec, "optional" does not mean that it does not
> have to be implemented/supported.  Any engine/library that produces
> metadata would need to support column-level properties so that it does not
> drop or improperly handle the metadata elements, even if it does not expose
> a way to view/manipulate them.  This is why scrutiny of spec changes is
> critical.
>
> +1 to what you said about documentation and support.
>
> -Dan
>
>
>
> On Mon, Jan 8, 2024 at 1:38 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Dan,
>>
>> I agree: it will depend on the engine capabilities. That said, it's
>> similar to catalog: each catalog might have different
>> approaches/features/capabilities, so engines might have different
>> capabilities as well.
>> If it's an optional feature in the spec, and each engine might or
>> might not implement it, that's ok. But it's certainly not a
>> requirement.
>> That said, we would need to clearly document the capabilities of each
>> engine (and catalog) (I don't say this documentation should be in
>> Iceberg, but engine "providers" would need to clearly state the
>> supported features).
>>
>> Regards
>> JB
>>
>> On Mon, Jan 8, 2024 at 6:33 AM Daniel Weeks <dwe...@apache.org> wrote:
>> >
>> > The main risk I see is that this adds complexity and there may be
>> limited use of the feature, which makes me question the value.  Spark seems
>> like the most likely/obvious to add native support for column-level
>> properties, but there are a wide range of engines that may never really
>> adopt this (e.g. Trino, Dremio, Doris, Starrocks, Redshift) as there isn't
>> SQL specification for table/column properties to my knowledge.
>> >
>> > I do think it would be nice for engines that have similar concepts if
>> it really can be natively integrated and I'm sure there are other use cases
>> for column properties, but it still feels somewhat niche.
>> >
>> > That being said, I'm not opposed and if there's interest in getting a
>> proposal put together for the spec changes, we'll get a much better idea of
>> any challenges.
>> >
>> > Thanks,
>> > -Dan
>> >
>> > On Thu, Jan 4, 2024 at 11:55 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>
>> >> Agree that it should not be use case specific. There could be other
>> applications beyond governance. As John mentioned, ML is another domain,
>> and it is actually the case at LinkedIn as well.
>> >>
>> >> I would approach this with the understanding that the key requirement
>> is to add key/value properties at the column level, not necessarily solving
>> for compliance. Compliance is just one of the applications and can leverage
>> this feature in many ways. But one of the key requirements in compliance,
>> ML, and other applications is enriching column-level metadata. Other
>> systems (Avro, BigQuery, Snowflake) do that too as pointed out in the
>> original message. Since Iceberg is the source of truth for
>> schema/column/field data, it sounds reasonable that the column-level
>> metadata should co-exist in the same place, hence the Iceberg-level
>> proposal. Other external solutions are possible of course (for column level
>> metadata, not necessarily "compliance"), but with the compromise of
>> possible schema drift and inconsistency. For example, at LinkedIn, we use
>> Datahub for compliance annotations/tags (this is an example of an external
>> system, even outside the catalog) and use Avro schema literals for ML
>> column-level metadata (this is an example of table level property). In both
>> situations, it would have been better if the tags co-existed with the
>> column definitions. So the tradeoff is really between: (1) Enhancing
>> Iceberg spec to minimize inconsistency in this domain, or (2) Letting
>> Iceberg users come up with custom, disparate, and potentially inconsistent
>> solutions. What do you all think?
>> >>
>> >> Thanks,
>> >> Walaa.
>> >>
>> >> On Thu, Jan 4, 2024 at 11:14 AM Daniel Weeks <dwe...@apache.org>
>> wrote:
>> >>>
>> >>> I not opposed to the idea of adding column-level properties with a
>> few considerations:
>> >>>
>> >>> We shouldn't explicitly tie it to a particular use case like data
>> governance.  You may be able leverage this for those capabilities, but
>> adding anything use case specific gets into some really opinionated areas
>> and makes the feature less generalizable.
>> >>> We need to be really explicit about the behaviors around evolution,
>> tags and branches as it could have implications about features built around
>> this behave.
>> >>> Iceberg would need to be the source of truth for this information to
>> keep external tags from misrepresenting the underlying schema definition.
>> >>>
>> >>> I would agree with Jack that there may be other ways to approach
>> policy information so we should explore those and see if those would render
>> this functionality less useful overall (I'm sure there are ways we can use
>> column-level properties, but if the main driver is policy, this may not be
>> worth the investment at the moment).
>> >>>
>> >>> -Dan
>> >>>
>> >>> On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>> >>>>
>> >>>> This proposal sounds good to me.
>> >>>>
>> >>>>> If we talk specifically about governance features, I am not sure if
>> column property is the best way though. Consider the case of having a
>> column which was not PII, but becomes PII because certain law has passed.
>> The operation a user would perform in this case is something like "ALTER
>> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg
>> schema is versioned, that means if you time travel to some time before the
>> MODIFY COLUMN operation, the PII column becomes still accessible.
>> >>>>
>> >>>>
>> >>>> This sounds like reasonable behavior to me. This is just like we do
>> an ddl like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)", and
>> if we to time travel to older version, we should also not see the
>> new_column.
>> >>>>
>> >>>> On Thu, Jan 4, 2024 at 6:26 AM John Zhuge <jzh...@apache.org> wrote:
>> >>>>>
>> >>>>> Hi Walaa,
>> >>>>>
>> >>>>> Netflix internal Spark and Iceberg have supported column metadata
>> in Iceberg tables since Spark 2.4. The Spark data type is
>> `org.apache.spark.sql.types.Metadata` in StructType. The feature is used by
>> ML teams.
>> >>>>>
>> >>>>> It'd be great for the feature to be adopted.
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Thanks Jack!
>> >>>>>>
>> >>>>>> I think generic key value pairs are still valuable, even for data
>> governance.
>> >>>>>>
>> >>>>>> Regarding schema versions and PII evolution over time, I actually
>> think it is a good feature to keep PII and schema in sync across versions
>> for data reproducibility. Consistency is key in time travel scenarios - the
>> objective should be to replicate data states accurately, regardless of
>> subsequent changes in column tags. On the other hand, organizations
>> typically make special arrangements when it comes to addressing compliance
>> in the context of time travel. For example in the data deletion use case,
>> special accomodation should take place to address the fact that time travel
>> can facilitate restoring the data. Finally, I am not very concerned about
>> the case when a field evolves to PII=true while it is still set to
>> PII=false in the time travel window. Typically, the time travel window is
>> in the order of days but regulation enforcement window is in the order of
>> months. Most often, the data versions with PII=false would have cycled out
>> of the system before the regulatory enforcement is in effect.
>> >>>>>>
>> >>>>>> I also think that the catalog level example in AWS Glue still
>> needs to consistently ensure schema compatibility? How does it ensure that
>> the columns referenced in the policies are in sync with the Iceberg table
>> schema, especially when the Iceberg table schema is evolved when the
>> policies and referenced columns are not?
>> >>>>>>
>> >>>>>> Regarding bringing policy and compliance semantics aspects into
>> Iceberg as a top level construct, I agree this is taking it a bit too far
>> and might be out of scope. Further, compliance policies can be quite
>> complicated, and a predefined set of permissions/access controls can be too
>> restrictive and not flexible enough to capture various compliance needs,
>> like dynamic data masking.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Walaa.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> Thanks for bringing this topic up! I can provide some perspective
>> about AWS Glue's related features.
>> >>>>>>>
>> >>>>>>> AWS Glue table definition also has a column parameters feature
>> (ref). This does not serve any governance purpose at this moment, but it is
>> a pretty convenient feature that allows users to add arbitrary tags to
>> columns. As you said, it is technically just a more fancy and more
>> structured doc field for a column, which I don't have a strong opinion
>> about adding it or not in Iceberg.
>> >>>>>>>
>> >>>>>>> If we talk specifically about governance features, I am not sure
>> if column property is the best way though. Consider the case of having a
>> column which was not PII, but becomes PII because certain law has passed.
>> The operation a user would perform in this case is something like "ALTER
>> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg
>> schema is versioned, that means if you time travel to some time before the
>> MODIFY COLUMN operation, the PII column becomes still accessible. So what
>> you really want is to globally set the column to be PII, instead of just
>> the latest column, but that becomes a bit incompatible with Iceberg's
>> versioned schema model.
>> >>>>>>>
>> >>>>>>> in AWS Glue, such governance features are provided at policy and
>> table level. The information like PII, sensitivity level are essentially
>> persisted as LakeFormation policies that are attached to the table but
>> separated from the table. After users configure column/row-level access to
>> a table through LakeFormation, what would happen is that the table response
>> received by services like EMR Spark, Athena, Glue ETL will contain an
>> additional fields of authorized columns and cell filters (ref), which
>> allows these engines to apply the authorization to any schema of the table
>> that will be used for the query. In this approach, the user's policy
>> setting is decoupled with the table's schema evolution over time, which
>> avoids problems like the one above in time travel, and many other types of
>> unintended user configuration mistakes.
>> >>>>>>>
>> >>>>>>> So I think a full governance story would mean to add something
>> similar in Iceberg's table model. For example, we could add a "policy"
>> field that contains sub-fields like the table's basic access permission
>> (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if
>> Iceberg needs its own policy spec though, that might go a bit too far.
>> >>>>>>>
>> >>>>>>> Any thoughts?
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Jack Ye
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Iceberg Developers,
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I would like to start a discussion on a potential enhancement to
>> Iceberg around the implementation of key-value style properties (tags) for
>> individual columns or fields. I believe this feature could have significant
>> applications, especially in the domain of data governance.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Here are some examples of how this feature can be potentially
>> used:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> * PII Classification: Indicating whether a field contains
>> Personally Identifiable Information (e.g., PII -> {true, false}).
>> >>>>>>>>
>> >>>>>>>> * Ontology Mapping: Associating fields with specific ontology
>> terms (e.g., Type -> {USER_ID, USER_NAME, LOCATION}).
>> >>>>>>>>
>> >>>>>>>> * Sensitivity Level Setting: Defining the sensitivity level of a
>> field (e.g., Sensitive -> {High, Medium, Low}).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> While current workarounds like table-level properties or
>> column-level comments/docs exist, they lack the structured approach needed
>> for these use cases. Table-level properties often require constant schema
>> validation and can be error-prone, especially when not in sync with the
>> table schema. Additionally, column-level comments, while useful, do not
>> enforce a standardized format.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am also interested in hearing thoughts or experiences around
>> whether this problem is addressed at the catalog level in any of the
>> implementations (e.g., AWS Glue). My impression is that even with
>> catalog-level implementations, there's still a need for continual
>> validation against the table schema. Further, catalog-specific
>> implementations will lack a standardized specification. A spec could be
>> beneficial for areas requiring consistent and structured metadata
>> management.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I realize that introducing this feature may necessitate the
>> development of APIs in various engines to set these properties or tags,
>> such as extensions in Spark or Trino SQL. However, I believe it’s a
>> worthwhile discussion to have, separate from whether Iceberg should include
>> these features in its APIs. For the sake of this thread we can focus on the
>> Iceberg APIs aspect.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Here are some references to similar concepts in other systems:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> * Avro attributes: Avro 1.10.2 Specification - Schemas (see
>> "Attributes not defined in this document are permitted as metadata").
>> >>>>>>>>
>> >>>>>>>> * BigQuery policy tags: BigQuery Column-level Security.
>> >>>>>>>>
>> >>>>>>>> * Snowflake object tagging: Snowflake Object Tagging
>> Documentation (see references to "MODIFY COLUMN").
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Looking forward to your insights on whether addressing this
>> issue at the Iceberg specification and API level is a reasonable direction.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Walaa.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> John Zhuge
>>
>

Re: Column-Level Key-Value Properties (Tags) in Iceberg

Reply via email to