Re: Support Securable Objects in Iceberg REST Catalog

Robert Stupp Tue, 02 Jul 2024 11:24:04 -0700

Oh - I'm not against having the fine(r) grained privileges per se. Justsaying that it's at best quite complicated to enforce those "properly".

The "trust" model probably deserves a separate (but related) discussion.There are potentially different "kinds" of how one can implement trust.Some things that come to my mind:

* Trust using PKI - registering an application/query-engine and usecryptographic signatures to validate that a request comes from "thatspecific trusted" application/engine

* Trust using OAuth delegation

* Trust using "network source" - if you know that only requests fromyour trusted applications/engines instances can come "from this IP"(overly simplified)

A specific (HTTP POST/PUT/DELETE) request from a trusted source couldthen indicate the finer grained privilege, like "this is an UPDATE" -and since the REST service can trust it, it can also rely on theindicated privilege.

However, there might also be different levels of trust... (just thinkinghow complex this could become). I think, this is a really huge topic.But interesting :)

For the scope of the securable objects improvement, I think we couldenhance the REST spec to pass the fine(r) grained privileges plus anoptional, opaque "blob"/HTTP header/query parameter to the REST service.How implementations actually implement "trust" is then rather an"implementation detail".



On 02.07.24 17:19, Jack Ye wrote:

> For INSERT/UPDATE/DELETE/TRUNCATE - well, that is really tricky forthe reasons how writes happen in Iceberg.

Yes. It seems like we are arriving at the conclusion that it is easyto have a simple verb for all data write operations, we can call itUPDATE or MODIFY or WRITE_DATA. The ability to do very specific things(e.g. INSERT, DELETE) are technically sub-privileges, these are moredifficult to define and enforce in Iceberg.

For defining those sub-privileges, my take in the doc is that theverbs can be defined to check against more fundamental concepts,rather than just the SQL command:- INSERT is the privilege to add data to securable objects liketables. This includes SQL commands like INSERT, COPY, append-onlystreaming, etc.- DELETE is the privilege to remove data from securable objects liketables. This includes SQL commands like DELETE, TRUNCATE.

For enforcing it, I imagine it would be easier to achieve through thefine-grained metadata commit, any other approach seems to be forgeable.

> Eventually there's no way around "trust" between the engine and thecatalog. Establishing "trust" in a secure way is not that easy IMO.

Yes. Glue uses a shared responsibility model, where an engine can gothrough an onboarding workflow:https://docs.aws.amazon.com/lake-formation/latest/dg/Integrating-with-LakeFormation.html,and after that point as long as the engine talks to the service usingthe specified authZ mechanism, it is considered trusted. It isintentionally not an easy process to onboard. I don't know how othercatalog vendors do this, or have similar concepts.


-Jack


On Tue, Jul 2, 2024 at 4:00 AM Robert Stupp <sn...@snazy.de> wrote:

    Just some thoughts about "SELECT vs DESCRIBE": If a catalog can
    distinguish these privileges, it can opt to return the manifest
    list pointer only, if the caller has the SELECT privilege.

    For INSERT/UPDATE/DELETE/TRUNCATE - well, that is really tricky
    for the reasons how writes happen in Iceberg. Especially for
    DELETEs, which can be a "delete files" + "write new files" or
    "just" appending delete-files (merge on read). It becomes even
    trickier if the engine does not use SQL but for example "raw"
    Spark operations. I've got no real idea how to map those to a SQL
    oriented privilege model.

    Eventually there's no way around "trust" between the engine and
    the catalog. Establishing "trust" in a secure way is not that easy
    IMO.


    On 02.07.24 06:30, Jack Ye wrote:

    Thanks Dennis for the detailed analysis and suggestions! Here are
    a few questions and comments I have:

    > Consider expanding the set of privilege definitions to be
    type-specific

    I like this! It seems like it solves the problem about
    inheritance and future grants as you said. I will think a bit
    more about it, update the doc accordingly, and see what others think.

    > we could introduce separate privileges TABLE_READ_DATA vs
    TABLE_READ_PROPERTIES

    In my definition in the doc, anything above table's data files is
    considered metadata, and TABLE_DESCRIBE governs all the access.
    There could be more fine-grained DESCRIBE that could be
    introduced, like TABLE_DESCRIBE_PROPERTIES,
    TABLE_DESCRIBE_HISTORY, TABLE_DESCRIBE_PARTITION. But once we get
    into that level, things might start to overlap. What if the user
    has TABLE_DESCRIBE_MANIFEST, but not TABLE_DESCRIBE_PARTITION? Do
    we show partial information about the manifest and remove
    partition information? I don't have a good solution to that yet,
    what do you think?

    > since "loadTable" is what the Catalog server sees, but then the
    engine could be satisfied with just the JSON metadata or might be
    intending to just crack open manifest files to select some
    aggregate statistics, or might be going all the way to Parquet files.

    My personal solution to this is to add a request context, which
    was prototyped in https://github.com/apache/iceberg/pull/10359.
    With this, an engine can describe the privileges needed when
    requesting table metadata. The prerequisite is that the catalog
    trusts the information passed by the engine through some authZ
    mechanism, and the engine uses the defined privileges here in the
    context. For example, if the engine requests table metadata for a
    DELETE, then the request will loadTable(table_name,
    context={privilege=DELETE}). Would that be something feasible to
    solve the concern?

    > mapping INSERT/DELETE/UPDATE all to TABLE_WRITE_DATA since at
    least for now, from the Catalog's perspective, any deletes
    require being able to write new manifests, and anything that can
    do inserts by writing new manifests can also effectively "delete"
    data in the newest snapshot.

    Yes I agree the privileges to insert, delete and update seems
    redundant given the writer can commit whatever manifest list
    eventually. I think some systems have a similar concept of just
    MODIFY privilege.

    But what if it is used under the fine-grained metadata commit
    proposal?
    
(https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit)
    Then in that case an insert would result in a different action
    type in UpdateTable compared to update and delete. It seems like
    we should try to reach a consensus on the general direction of
    this proposal first.

    -Jack






    On Fri, Jun 28, 2024 at 8:53 PM Dennis Huo <huoi...@gmail.com> wrote:

        +1, Thanks Jack and team for getting the discussion started
        with this proposal!

        Much of this is well aligned with what we noticed when
        implementing RBAC for Polaris Catalog, namely that even if a
        more complicated User/Role structure exists outside of the
        catalog, that it's necessary to be able to express some
        common building blocks around "grantee" roles/principals and
        scoping/definitions of grants/privileges to make RBAC
        enforcement work well and be more standardized across engines.

        Your suggestions about initially trying to avoid known
        problems with things like "OWNER" privileges and problems
        depending on the "grantor" in grant records definitely seem
        like good ideas.

        One thing that came up when trying to distill
        catalog-enforceable privileges in Polaris was that by the
        nature of Iceberg's metadata model, traditional SQL-style
        privileges ran into rough edges when it came to
        distinguishing e.g. SELECT vs DESCRIBE, or UPDATE vs INSERT
        vs DELETE, since "loadTable" is what the Catalog server sees,
        but then the engine could be satisfied with just the JSON
        metadata or might be intending to just crack open manifest
        files to select some aggregate statistics, or might be going
        all the way to Parquet files.

        One way to address this is if we're willing to make privilege
        definitions more closely reflect the implementation
        semantics, e.g. mapping INSERT/DELETE/UPDATE all to
        TABLE_WRITE_DATA since at least for now, from the Catalog's
        perspective, any deletes require being able to write new
        manifests, and anything that can do inserts by writing new
        manifests can also effectively "delete" data in the newest
        snapshot.

        It also seems like there's a relationship between having more
        type-specific privileges, the ability to have unambiguous
        hierarchical grants (e.g. granting TABLE_READ_DATA on a
        namespace to inherit the privilege in all child tables), and
        also having a way to express storage-credential-vending
        privileges under the same model.

        A few suggestions relating to this:

          * Consider expanding the set of privilege definitions to be
            type-specific (beyond inferring the type-privilege from
            the object on which a privilege is granted). Maybe there
            should still be a common convention for all the "pure
            CRUDL" operations, but then types might have some
            additional type-specific privileges too
              o Example: NAMESPACE_CREATE, NAMESPACE_READ_PROPERTIES,
                NAMESPACE_WRITE_PROPERTIES, NAMESPACE_DROP, 
                NAMESPACE_LIST
          * Allow/define a convention for inheriting grants in the
            securable object hierarchy -- though it makes sense to
            also allow for non-inheritance if an implementation wants
            to keep the model simple, if we do have type-specific
            privileges, it at least mitigates one of the listed
            concerns about accidental privileges.
              o For example, if the privilege is only DESCRIBE, then
                granting DESCRIBE on a namespace isn't clear whether
                it should also confer DESCRIBE on tables/views
                underneath it. But we could say
                NAMESPACE_READ_PROPERTIES on a namespace doesn't mean
                any kind of TABLE/VIEW privileges, while
                TABLE_READ_PROPERTIES granted on a namespace would
                more clearly mean to inherit the ability to read
                table properties underneath that namespace.
              o Hierarchical grants probably also address some of the
                same use cases that people might otherwise address
                with FUTURE GRANTS, and for some scenarios FUTURE
                GRANTS might be the more complex or error-prone
                alternative
          * To handle the concept of Catalog-based storage-credential
            vending, we could introduce separate privileges
            TABLE_READ_DATA vs TABLE_READ_PROPERTIES and the mutate
            counter parts TABLE_WRITE_DATA vs TABLE_WRITE_PROPERTIES.
            Implementation-wise it could just mean
            TABLE_READ_DATA/TABLE_WRITE_DATA enable receiving
            appropriately-scoped storage credentials (e.g. read-only
            subscoped session token for TABLE_READ_DATA) in things
            like loadTable and createTable(stage-create=true).
            Whereas TABLE_READ_PROPERTIES/TABLE_WRITE_PROPERTIES
            would only enable whatever the REST Catalog server is
            able to handle directly in the REST request/response.

        Would love to hear anyone's thoughts on these areas.

        Cheers,
        Dennis Huo


        On 2024/06/08 19:12:10 Walaa Eldin Moustafa wrote:
        > Thanks Jack and team for working on this proposal. I went
        over it and it is
        > very well written. I particularly like:
        >
        > (1) The fact that it is adopting the SQL standard and
        adjusting some of its
        > semantics to fit the Iceberg model.
        >
        > (2) It includes views from v1. Views are a very important
        tool for policy
        > enforcement. We have built a dynamic privacy and compliance
        enforcement
        > catalog extension at LinkedIn using views [1], and one of
        the main
        > improvements to that catalog extension would be securable
        view objects.
        > Admittedly, it might require further improvements to
        compute engines to
        > implement the permissions, but having an Iceberg spec would
        be the first
        > step.
        >
        > Looking forward to the next steps of the proposal
        discussion and adoption.
        >
        > [1]
        >
        
https://www.slideshare.net/slideshow/viewshift-hassle-free-dynamic-policy-enforcement-for-every-data-lake/269577447
        >
        > Thanks,
        > Walaa.
        >
        >
        > On Thu, May 30, 2024 at 10:35 PM Jack Ye <ye...@gmail.com>
        wrote:
        >
        > > Hi everyone,
        > >
        > > Me and a few colleagues at AWS would like to discuss a
        new proposal for
        > > supporting securable objects in the Iceberg REST catalog
        spec.
        > >
        > > Here is our proposal in Google doc:
        > >
        
https://docs.google.com/document/d/1KmIDbPuN6IYF0nWs9ostXIB9F4b8iH3zZO0hjgs1lm4/edit
        > >
        > > And here is the corresponding GitHub issue:
        > > https://github.com/apache/iceberg/issues/10407
        > >
        > > I will also paste the intro here for an overview. There
        are 2 main reasons
        > > for us to look into this area and draft this proposal:
        > >
        > > *IRC lacks clear guidelines on access management
        requirements:*
        > >
        > > This is feedback we heard frequently when interviewing
        AWS customers using
        > > Iceberg and considering building an IRC. Today Iceberg
        objects (namespaces,
        > > tables, views) are not securable within the Iceberg
        catalog itself, and
        > > need to be secured using an auxiliary system. This means
        that an
        > > organization building an IRC service needs to wrap many
        important
        > > operations into custom-built APIs for downstream users to
        consume (e.g. an
        > > API to grant Iceberg table access on S3 needs to grant
        corresponding IAM
        > > users/roles the right S3 policy or ACL setting). Huge
        amount of effort
        > > needs to be spent to figure out what are the missing APIs
        in IRC to satisfy
        > > enterprise level data warehouse access management
        requirements.
        > >
        > > There are some IRC products that offer vendor-specific
        APIs outside IRC to
        > > perform those operations, but this means that users are
        locked-in to this
        > > vendor’s securable object management system when using
        the IRC solution,
        > > and do not have the true freedom to easily switch to
        another solution if it
        > > offers better price-performance.
        > >
        > > We understand that Iceberg is not a security product, and
        it is not the
        > > best interest of the community to dive too deep into
        security-related
        > > domains. However, we believe that *we should at least
        offer the right
        > > interfaces and set the right standards for how Iceberg
        catalog expresses
        > > securable objects and how Iceberg catalog users interact
        with those objects*,
        > > such that (1) users that would like to build IRC can have
        a clear guideline
        > > of what API constract to implement for managing access to
        objects in IRC,
        > > and (2) users that are on one IRC product do not need to
        be locked-in due
        > > to access management aspects.
        > >
        > > Would really appreciate any feedback on this topic and
        proposal!
        > >
        > > Best,
        > > Jack Ye
        > >
        >

--Robert Stupp

    @snazy

--
Robert Stupp
@snazy

Re: Support Securable Objects in Iceberg REST Catalog

Reply via email to