Thanks Dennis for the detailed analysis and suggestions! Here are
a few questions and comments I have:
> Consider expanding the set of privilege definitions to be
type-specific
I like this! It seems like it solves the problem about
inheritance and future grants as you said. I will think a bit
more about it, update the doc accordingly, and see what others think.
> we could introduce separate privileges TABLE_READ_DATA vs
TABLE_READ_PROPERTIES
In my definition in the doc, anything above table's data files is
considered metadata, and TABLE_DESCRIBE governs all the access.
There could be more fine-grained DESCRIBE that could be
introduced, like TABLE_DESCRIBE_PROPERTIES,
TABLE_DESCRIBE_HISTORY, TABLE_DESCRIBE_PARTITION. But once we get
into that level, things might start to overlap. What if the user
has TABLE_DESCRIBE_MANIFEST, but not TABLE_DESCRIBE_PARTITION? Do
we show partial information about the manifest and remove
partition information? I don't have a good solution to that yet,
what do you think?
> since "loadTable" is what the Catalog server sees, but then the
engine could be satisfied with just the JSON metadata or might be
intending to just crack open manifest files to select some
aggregate statistics, or might be going all the way to Parquet files.
My personal solution to this is to add a request context, which
was prototyped in https://github.com/apache/iceberg/pull/10359.
With this, an engine can describe the privileges needed when
requesting table metadata. The prerequisite is that the catalog
trusts the information passed by the engine through some authZ
mechanism, and the engine uses the defined privileges here in the
context. For example, if the engine requests table metadata for a
DELETE, then the request will loadTable(table_name,
context={privilege=DELETE}). Would that be something feasible to
solve the concern?
> mapping INSERT/DELETE/UPDATE all to TABLE_WRITE_DATA since at
least for now, from the Catalog's perspective, any deletes
require being able to write new manifests, and anything that can
do inserts by writing new manifests can also effectively "delete"
data in the newest snapshot.
Yes I agree the privileges to insert, delete and update seems
redundant given the writer can commit whatever manifest list
eventually. I think some systems have a similar concept of just
MODIFY privilege.
But what if it is used under the fine-grained metadata commit
proposal?
(https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit)
Then in that case an insert would result in a different action
type in UpdateTable compared to update and delete. It seems like
we should try to reach a consensus on the general direction of
this proposal first.
-Jack
On Fri, Jun 28, 2024 at 8:53 PM Dennis Huo <huoi...@gmail.com> wrote:
+1, Thanks Jack and team for getting the discussion started
with this proposal!
Much of this is well aligned with what we noticed when
implementing RBAC for Polaris Catalog, namely that even if a
more complicated User/Role structure exists outside of the
catalog, that it's necessary to be able to express some
common building blocks around "grantee" roles/principals and
scoping/definitions of grants/privileges to make RBAC
enforcement work well and be more standardized across engines.
Your suggestions about initially trying to avoid known
problems with things like "OWNER" privileges and problems
depending on the "grantor" in grant records definitely seem
like good ideas.
One thing that came up when trying to distill
catalog-enforceable privileges in Polaris was that by the
nature of Iceberg's metadata model, traditional SQL-style
privileges ran into rough edges when it came to
distinguishing e.g. SELECT vs DESCRIBE, or UPDATE vs INSERT
vs DELETE, since "loadTable" is what the Catalog server sees,
but then the engine could be satisfied with just the JSON
metadata or might be intending to just crack open manifest
files to select some aggregate statistics, or might be going
all the way to Parquet files.
One way to address this is if we're willing to make privilege
definitions more closely reflect the implementation
semantics, e.g. mapping INSERT/DELETE/UPDATE all to
TABLE_WRITE_DATA since at least for now, from the Catalog's
perspective, any deletes require being able to write new
manifests, and anything that can do inserts by writing new
manifests can also effectively "delete" data in the newest
snapshot.
It also seems like there's a relationship between having more
type-specific privileges, the ability to have unambiguous
hierarchical grants (e.g. granting TABLE_READ_DATA on a
namespace to inherit the privilege in all child tables), and
also having a way to express storage-credential-vending
privileges under the same model.
A few suggestions relating to this:
* Consider expanding the set of privilege definitions to be
type-specific (beyond inferring the type-privilege from
the object on which a privilege is granted). Maybe there
should still be a common convention for all the "pure
CRUDL" operations, but then types might have some
additional type-specific privileges too
o Example: NAMESPACE_CREATE, NAMESPACE_READ_PROPERTIES,
NAMESPACE_WRITE_PROPERTIES, NAMESPACE_DROP,
NAMESPACE_LIST
* Allow/define a convention for inheriting grants in the
securable object hierarchy -- though it makes sense to
also allow for non-inheritance if an implementation wants
to keep the model simple, if we do have type-specific
privileges, it at least mitigates one of the listed
concerns about accidental privileges.
o For example, if the privilege is only DESCRIBE, then
granting DESCRIBE on a namespace isn't clear whether
it should also confer DESCRIBE on tables/views
underneath it. But we could say
NAMESPACE_READ_PROPERTIES on a namespace doesn't mean
any kind of TABLE/VIEW privileges, while
TABLE_READ_PROPERTIES granted on a namespace would
more clearly mean to inherit the ability to read
table properties underneath that namespace.
o Hierarchical grants probably also address some of the
same use cases that people might otherwise address
with FUTURE GRANTS, and for some scenarios FUTURE
GRANTS might be the more complex or error-prone
alternative
* To handle the concept of Catalog-based storage-credential
vending, we could introduce separate privileges
TABLE_READ_DATA vs TABLE_READ_PROPERTIES and the mutate
counter parts TABLE_WRITE_DATA vs TABLE_WRITE_PROPERTIES.
Implementation-wise it could just mean
TABLE_READ_DATA/TABLE_WRITE_DATA enable receiving
appropriately-scoped storage credentials (e.g. read-only
subscoped session token for TABLE_READ_DATA) in things
like loadTable and createTable(stage-create=true).
Whereas TABLE_READ_PROPERTIES/TABLE_WRITE_PROPERTIES
would only enable whatever the REST Catalog server is
able to handle directly in the REST request/response.
Would love to hear anyone's thoughts on these areas.
Cheers,
Dennis Huo
On 2024/06/08 19:12:10 Walaa Eldin Moustafa wrote:
> Thanks Jack and team for working on this proposal. I went
over it and it is
> very well written. I particularly like:
>
> (1) The fact that it is adopting the SQL standard and
adjusting some of its
> semantics to fit the Iceberg model.
>
> (2) It includes views from v1. Views are a very important
tool for policy
> enforcement. We have built a dynamic privacy and compliance
enforcement
> catalog extension at LinkedIn using views [1], and one of
the main
> improvements to that catalog extension would be securable
view objects.
> Admittedly, it might require further improvements to
compute engines to
> implement the permissions, but having an Iceberg spec would
be the first
> step.
>
> Looking forward to the next steps of the proposal
discussion and adoption.
>
> [1]
>
https://www.slideshare.net/slideshow/viewshift-hassle-free-dynamic-policy-enforcement-for-every-data-lake/269577447
>
> Thanks,
> Walaa.
>
>
> On Thu, May 30, 2024 at 10:35 PM Jack Ye <ye...@gmail.com>
wrote:
>
> > Hi everyone,
> >
> > Me and a few colleagues at AWS would like to discuss a
new proposal for
> > supporting securable objects in the Iceberg REST catalog
spec.
> >
> > Here is our proposal in Google doc:
> >
https://docs.google.com/document/d/1KmIDbPuN6IYF0nWs9ostXIB9F4b8iH3zZO0hjgs1lm4/edit
> >
> > And here is the corresponding GitHub issue:
> > https://github.com/apache/iceberg/issues/10407
> >
> > I will also paste the intro here for an overview. There
are 2 main reasons
> > for us to look into this area and draft this proposal:
> >
> > *IRC lacks clear guidelines on access management
requirements:*
> >
> > This is feedback we heard frequently when interviewing
AWS customers using
> > Iceberg and considering building an IRC. Today Iceberg
objects (namespaces,
> > tables, views) are not securable within the Iceberg
catalog itself, and
> > need to be secured using an auxiliary system. This means
that an
> > organization building an IRC service needs to wrap many
important
> > operations into custom-built APIs for downstream users to
consume (e.g. an
> > API to grant Iceberg table access on S3 needs to grant
corresponding IAM
> > users/roles the right S3 policy or ACL setting). Huge
amount of effort
> > needs to be spent to figure out what are the missing APIs
in IRC to satisfy
> > enterprise level data warehouse access management
requirements.
> >
> > There are some IRC products that offer vendor-specific
APIs outside IRC to
> > perform those operations, but this means that users are
locked-in to this
> > vendor’s securable object management system when using
the IRC solution,
> > and do not have the true freedom to easily switch to
another solution if it
> > offers better price-performance.
> >
> > We understand that Iceberg is not a security product, and
it is not the
> > best interest of the community to dive too deep into
security-related
> > domains. However, we believe that *we should at least
offer the right
> > interfaces and set the right standards for how Iceberg
catalog expresses
> > securable objects and how Iceberg catalog users interact
with those objects*,
> > such that (1) users that would like to build IRC can have
a clear guideline
> > of what API constract to implement for managing access to
objects in IRC,
> > and (2) users that are on one IRC product do not need to
be locked-in due
> > to access management aspects.
> >
> > Would really appreciate any feedback on this topic and
proposal!
> >
> > Best,
> > Jack Ye
> >
>