Re: Support Securable Objects in Iceberg REST Catalog

Dennis Huo Tue, 02 Jul 2024 19:00:08 -0700

>
> My personal solution to this is to add a request context, which was
>> prototyped in https://github.com/apache/iceberg/pull/10359. With this,
>> an engine can describe the privileges needed when requesting table
>> metadata. The prerequisite is that the catalog trusts the information
>> passed by the engine through some authZ mechanism, and the engine uses the
>> defined privileges here in the context. For example, if the engine requests
>> table metadata for a DELETE, then the request will loadTable(table_name,
>> context={privilege=DELETE}). Would that be something feasible to solve the
>> concern?
>>
>>
Nice! I think adding request context to help signify intent with some
extensibility would definitely go a long way in helping clean up how the
REST server handles otherwise-ambiguous requests (especially for the basic
case of whether to provide a read-only or a read-write Storage credential).


I see the concern about under-documenting an open-ended map, but I agree
with you that it's important to have the extensible plumbing in place
sooner rather than later to mitigate the slow upgrade of client libraries
everywhere.

* Trust using PKI - registering an application/query-engine and use
> cryptographic signatures to validate that a request comes from "that
> specific trusted" application/engine
> * Trust using OAuth delegation
> * Trust using "network source" - if you know that only requests from your
> trusted applications/engines instances can come "from this IP" (overly
> simplified)
>

This is really interesting to consider, since an authenticated identity may
be orthogonal to the engine being used for a given request, and it sounds
like there are some scenarios folks have brought up where the Authorization
engine itself is also neither the REST Catalog nor the query engine.

These types of attribute-based policies do seem to be coming up more and
more as well. So in a way the incoming context is a combination of
identity/roles (RBAC-related things) along with (possibly
cryptographically-verifiable) "attributes" of the request source.

To avoid hard-coding any particular types of attributes, abstractly it
seems like it could be broken into:

   - Query Engine - responsible for supplying the authenticated
   identity/role(s) issuing the request and including a set of applicable
   context attributes (could include "intent" attributes as well as things
   like signature that this is a "trusted" query engine, etc)
   - REST Server - responsible for resolving the concrete operation(s)
   involved in serving the request and creating the bundle of
   validated/resolved securables and possible additional "attributes" derived
   from the securables into some kind of "authorization bundle"
   - Authorization engine - crunches the bundle of request
   identities/roles/attributes + securables/securable-attributes
      - The Authorization engine might be the REST Server itself
      - The Authorization engine could be the trusted "query engine" in
      which case the resolved "authorization bundle" would be returned to the
      query engine
      - The Authorization engine might be federated out, in which case the
      REST Server would send the bundle to the remote engine for a decision and
      the REST Server's response would reflect the appropriate outcome


On Tue, Jul 2, 2024 at 11:24 AM Robert Stupp <sn...@snazy.de> wrote:

> Oh - I'm not against having the fine(r) grained privileges per se. Just
> saying that it's at best quite complicated to enforce those "properly".
>
> The "trust" model probably deserves a separate (but related) discussion.
> There are potentially different "kinds" of how one can implement trust.
> Some things that come to my mind:
>
> * Trust using PKI - registering an application/query-engine and use
> cryptographic signatures to validate that a request comes from "that
> specific trusted" application/engine
> * Trust using OAuth delegation
> * Trust using "network source" - if you know that only requests from your
> trusted applications/engines instances can come "from this IP" (overly
> simplified)
>
> A specific (HTTP POST/PUT/DELETE) request from a trusted source could then
> indicate the finer grained privilege, like "this is an UPDATE" - and since
> the REST service can trust it, it can also rely on the indicated privilege.
>
> However, there might also be different levels of trust... (just thinking
> how complex this could become). I think, this is a really huge topic. But
> interesting :)
>
> For the scope of the securable objects improvement, I think we could
> enhance the REST spec to pass the fine(r) grained privileges plus an
> optional, opaque "blob"/HTTP header/query parameter to the REST service.
> How implementations actually implement "trust" is  then rather an
> "implementation detail".
>
>
> On 02.07.24 17:19, Jack Ye wrote:
>
> > For INSERT/UPDATE/DELETE/TRUNCATE - well, that is really tricky for the
> reasons how writes happen in Iceberg.
>
> Yes. It seems like we are arriving at the conclusion that it is easy to
> have a simple verb for all data write operations, we can call it UPDATE or
> MODIFY or WRITE_DATA. The ability to do very specific things (e.g. INSERT,
> DELETE) are technically sub-privileges, these are more difficult to define
> and enforce in Iceberg.
>
> For defining those sub-privileges, my take in the doc is that the verbs
> can be defined to check against more fundamental concepts, rather than just
> the SQL command:
> - INSERT is the privilege to add data to securable objects like tables.
> This includes SQL commands like INSERT, COPY, append-only streaming, etc.
> - DELETE is the privilege to remove data from securable objects like
> tables. This includes SQL commands like DELETE, TRUNCATE.
>
> For enforcing it, I imagine it would be easier to achieve through the
> fine-grained metadata commit, any other approach seems to be forgeable.
>
> > Eventually there's no way around "trust" between the engine and the
> catalog. Establishing "trust" in a secure way is not that easy IMO.
>
> Yes. Glue uses a shared responsibility model, where an engine can go
> through an onboarding workflow:
> https://docs.aws.amazon.com/lake-formation/latest/dg/Integrating-with-LakeFormation.html,
> and after that point as long as the engine talks to the service using the
> specified authZ mechanism, it is considered trusted. It is intentionally
> not an easy process to onboard. I don't know how other catalog vendors do
> this, or have similar concepts.
>
> -Jack
>
>
> On Tue, Jul 2, 2024 at 4:00 AM Robert Stupp <sn...@snazy.de> wrote:
>
>> Just some thoughts about "SELECT vs DESCRIBE": If a catalog can
>> distinguish these privileges, it can opt to return the manifest list
>> pointer only, if the caller has the SELECT privilege.
>>
>> For INSERT/UPDATE/DELETE/TRUNCATE - well, that is really tricky for the
>> reasons how writes happen in Iceberg. Especially for DELETEs, which can be
>> a "delete files" + "write new files" or "just" appending delete-files
>> (merge on read). It becomes even trickier if the engine does not use SQL
>> but for example "raw" Spark operations. I've got no real idea how to map
>> those to a SQL oriented privilege model.
>>
>> Eventually there's no way around "trust" between the engine and the
>> catalog. Establishing "trust" in a secure way is not that easy IMO.
>>
>>
>> On 02.07.24 06:30, Jack Ye wrote:
>>
>> Thanks Dennis for the detailed analysis and suggestions! Here are a few
>> questions and comments I have:
>>
>> > Consider expanding the set of privilege definitions to be type-specific
>>
>> I like this! It seems like it solves the problem about inheritance and
>> future grants as you said. I will think a bit more about it, update the doc
>> accordingly, and see what others think.
>>
>> > we could introduce separate privileges TABLE_READ_DATA vs
>> TABLE_READ_PROPERTIES
>>
>> In my definition in the doc, anything above table's data files is
>> considered metadata, and TABLE_DESCRIBE governs all the access. There could
>> be more fine-grained DESCRIBE that could be introduced, like
>> TABLE_DESCRIBE_PROPERTIES, TABLE_DESCRIBE_HISTORY,
>> TABLE_DESCRIBE_PARTITION. But once we get into that level, things might
>> start to overlap. What if the user has TABLE_DESCRIBE_MANIFEST, but not
>> TABLE_DESCRIBE_PARTITION? Do we show partial information about the manifest
>> and remove partition information? I don't have a good solution to that yet,
>> what do you think?
>>
>> > since "loadTable" is what the Catalog server sees, but then the engine
>> could be satisfied with just the JSON metadata or might be intending to
>> just crack open manifest files to select some aggregate statistics, or
>> might be going all the way to Parquet files.
>>
>> My personal solution to this is to add a request context, which was
>> prototyped in https://github.com/apache/iceberg/pull/10359. With this,
>> an engine can describe the privileges needed when requesting table
>> metadata. The prerequisite is that the catalog trusts the information
>> passed by the engine through some authZ mechanism, and the engine uses the
>> defined privileges here in the context. For example, if the engine requests
>> table metadata for a DELETE, then the request will loadTable(table_name,
>> context={privilege=DELETE}). Would that be something feasible to solve the
>> concern?
>>
>> > mapping INSERT/DELETE/UPDATE all to TABLE_WRITE_DATA since at least for
>> now, from the Catalog's perspective, any deletes require being able to
>> write new manifests, and anything that can do inserts by writing new
>> manifests can also effectively "delete" data in the newest snapshot.
>>
>> Yes I agree the privileges to insert, delete and update seems redundant
>> given the writer can commit whatever manifest list eventually. I think some
>> systems have a similar concept of just MODIFY privilege.
>>
>> But what if it is used under the fine-grained metadata commit proposal? (
>> https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit)
>> Then in that case an insert would result in a different action type in
>> UpdateTable compared to update and delete. It seems like we should try to
>> reach a consensus on the general direction of this proposal first.
>>
>> -Jack
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 28, 2024 at 8:53 PM Dennis Huo <huoi...@gmail.com> wrote:
>>
>>> +1, Thanks Jack and team for getting the discussion started with this
>>> proposal!
>>>
>>> Much of this is well aligned with what we noticed when implementing RBAC
>>> for Polaris Catalog, namely that even if a more complicated User/Role
>>> structure exists outside of the catalog, that it's necessary to be able to
>>> express some common building blocks around "grantee" roles/principals and
>>> scoping/definitions of grants/privileges to make RBAC enforcement work well
>>> and be more standardized across engines.
>>>
>>> Your suggestions about initially trying to avoid known problems with
>>> things like "OWNER" privileges and problems depending on the "grantor" in
>>> grant records definitely seem like good ideas.
>>>
>>> One thing that came up when trying to distill catalog-enforceable
>>> privileges in Polaris was that by the nature of Iceberg's metadata model,
>>> traditional SQL-style privileges ran into rough edges when it came to
>>> distinguishing e.g. SELECT vs DESCRIBE, or UPDATE vs INSERT vs DELETE,
>>> since "loadTable" is what the Catalog server sees, but then the engine
>>> could be satisfied with just the JSON metadata or might be intending to
>>> just crack open manifest files to select some aggregate statistics, or
>>> might be going all the way to Parquet files.
>>>
>>> One way to address this is if we're willing to make privilege
>>> definitions more closely reflect the implementation semantics, e.g. mapping
>>> INSERT/DELETE/UPDATE all to TABLE_WRITE_DATA since at least for now, from
>>> the Catalog's perspective, any deletes require being able to write new
>>> manifests, and anything that can do inserts by writing new manifests can
>>> also effectively "delete" data in the newest snapshot.
>>>
>>> It also seems like there's a relationship between having more
>>> type-specific privileges, the ability to have unambiguous hierarchical
>>> grants (e.g. granting TABLE_READ_DATA on a namespace to inherit the
>>> privilege in all child tables), and also having a way to express
>>> storage-credential-vending privileges under the same model.
>>>
>>> A few suggestions relating to this:
>>>
>>>    - Consider expanding the set of privilege definitions to be
>>>    type-specific (beyond inferring the type-privilege from the object on 
>>> which
>>>    a privilege is granted). Maybe there should still be a common convention
>>>    for all the "pure CRUDL" operations, but then types might have some
>>>    additional type-specific privileges too
>>>       - Example: NAMESPACE_CREATE, NAMESPACE_READ_PROPERTIES,
>>>       NAMESPACE_WRITE_PROPERTIES,  NAMESPACE_DROP,  NAMESPACE_LIST
>>>    - Allow/define a convention for inheriting grants in the securable
>>>    object hierarchy -- though it makes sense to also allow for 
>>> non-inheritance
>>>    if an implementation wants to keep the model simple, if we do have
>>>    type-specific privileges, it at least mitigates one of the listed 
>>> concerns
>>>    about accidental privileges.
>>>       - For example, if the privilege is only DESCRIBE, then granting
>>>       DESCRIBE on a namespace isn't clear whether it should also confer 
>>> DESCRIBE
>>>       on tables/views underneath it. But we could say 
>>> NAMESPACE_READ_PROPERTIES
>>>       on a namespace doesn't mean any kind of TABLE/VIEW privileges, while
>>>       TABLE_READ_PROPERTIES granted on a namespace would more clearly mean 
>>> to
>>>       inherit the ability to read table properties underneath that 
>>> namespace.
>>>       - Hierarchical grants probably also address some of the same use
>>>       cases that people might otherwise address with FUTURE GRANTS, and for 
>>> some
>>>       scenarios FUTURE GRANTS might be the more complex or error-prone 
>>> alternative
>>>    - To handle the concept of Catalog-based storage-credential vending,
>>>    we could introduce separate privileges TABLE_READ_DATA vs
>>>    TABLE_READ_PROPERTIES and the mutate counter parts TABLE_WRITE_DATA vs
>>>    TABLE_WRITE_PROPERTIES. Implementation-wise it could just mean
>>>    TABLE_READ_DATA/TABLE_WRITE_DATA enable receiving appropriately-scoped
>>>    storage credentials (e.g. read-only subscoped session token for
>>>    TABLE_READ_DATA) in things like loadTable and
>>>    createTable(stage-create=true). Whereas
>>>    TABLE_READ_PROPERTIES/TABLE_WRITE_PROPERTIES would only enable whatever 
>>> the
>>>    REST Catalog server is able to handle directly in the REST 
>>> request/response.
>>>
>>> Would love to hear anyone's thoughts on these areas.
>>>
>>> Cheers,
>>> Dennis Huo
>>>
>>>
>>> On 2024/06/08 19:12:10 Walaa Eldin Moustafa wrote:
>>> > Thanks Jack and team for working on this proposal. I went over it and
>>> it is
>>> > very well written. I particularly like:
>>> >
>>> > (1) The fact that it is adopting the SQL standard and adjusting some
>>> of its
>>> > semantics to fit the Iceberg model.
>>> >
>>> > (2) It includes views from v1. Views are a very important tool for
>>> policy
>>> > enforcement. We have built a dynamic privacy and compliance enforcement
>>> > catalog extension at LinkedIn using views [1], and one of the main
>>> > improvements to that catalog extension would be securable view objects.
>>> > Admittedly, it might require further improvements to compute engines to
>>> > implement the permissions, but having an Iceberg spec would be the
>>> first
>>> > step.
>>> >
>>> > Looking forward to the next steps of the proposal discussion and
>>> adoption.
>>> >
>>> > [1]
>>> >
>>> https://www.slideshare.net/slideshow/viewshift-hassle-free-dynamic-policy-enforcement-for-every-data-lake/269577447
>>> >
>>> > Thanks,
>>> > Walaa.
>>> >
>>> >
>>> > On Thu, May 30, 2024 at 10:35 PM Jack Ye <ye...@gmail.com> wrote:
>>> >
>>> > > Hi everyone,
>>> > >
>>> > > Me and a few colleagues at AWS would like to discuss a new proposal
>>> for
>>> > > supporting securable objects in the Iceberg REST catalog spec.
>>> > >
>>> > > Here is our proposal in Google doc:
>>> > >
>>> https://docs.google.com/document/d/1KmIDbPuN6IYF0nWs9ostXIB9F4b8iH3zZO0hjgs1lm4/edit
>>> > >
>>> > > And here is the corresponding GitHub issue:
>>> > > https://github.com/apache/iceberg/issues/10407
>>> > >
>>> > > I will also paste the intro here for an overview. There are 2 main
>>> reasons
>>> > > for us to look into this area and draft this proposal:
>>> > >
>>> > > *IRC lacks clear guidelines on access management requirements:*
>>> > >
>>> > > This is feedback we heard frequently when interviewing AWS customers
>>> using
>>> > > Iceberg and considering building an IRC. Today Iceberg objects
>>> (namespaces,
>>> > > tables, views) are not securable within the Iceberg catalog itself,
>>> and
>>> > > need to be secured using an auxiliary system. This means that an
>>> > > organization building an IRC service needs to wrap many important
>>> > > operations into custom-built APIs for downstream users to consume
>>> (e.g. an
>>> > > API to grant Iceberg table access on S3 needs to grant corresponding
>>> IAM
>>> > > users/roles the right S3 policy or ACL setting). Huge amount of
>>> effort
>>> > > needs to be spent to figure out what are the missing APIs in IRC to
>>> satisfy
>>> > > enterprise level data warehouse access management requirements.
>>> > >
>>> > > There are some IRC products that offer vendor-specific APIs outside
>>> IRC to
>>> > > perform those operations, but this means that users are locked-in to
>>> this
>>> > > vendor’s securable object management system when using the IRC
>>> solution,
>>> > > and do not have the true freedom to easily switch to another
>>> solution if it
>>> > > offers better price-performance.
>>> > >
>>> > > We understand that Iceberg is not a security product, and it is not
>>> the
>>> > > best interest of the community to dive too deep into security-related
>>> > > domains. However, we believe that *we should at least offer the right
>>> > > interfaces and set the right standards for how Iceberg catalog
>>> expresses
>>> > > securable objects and how Iceberg catalog users interact with those
>>> objects*,
>>> > > such that (1) users that would like to build IRC can have a clear
>>> guideline
>>> > > of what API constract to implement for managing access to objects in
>>> IRC,
>>> > > and (2) users that are on one IRC product do not need to be
>>> locked-in due
>>> > > to access management aspects.
>>> > >
>>> > > Would really appreciate any feedback on this topic and proposal!
>>> > >
>>> > > Best,
>>> > > Jack Ye
>>> > >
>>> >
>>>
>> --
>> Robert Stupp
>> @snazy
>>
>> --
> Robert Stupp
> @snazy
>
>

Re: Support Securable Objects in Iceberg REST Catalog

Reply via email to