Re: There is no easy way to secure Iceberg data. How can we improve?

Steve Loughran Fri, 03 Jan 2025 04:50:21 -0800

actually, there is a way for the catalog to return S3 objects without
granting access to the entire bucket: aws presigning:


https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html
This offers time-bounded access to an object

catalog will need to generate and return the presigned URLs and then the
applications will use these URLs to load the files.

All other access to the bucket (list etc) would have to be locked down

I have used the s3a fs to download artifacts with signatures, but never
generated the signatures myself. It does not have the capability to write
objects to a presigned url, and I don't see that in S3FileIO either.

signature creation will need to be homework for the catalog.



On Thu, 2 Jan 2025 at 17:35, Jean-Baptiste Onofré <[email protected]> wrote:

> Hi Vladimir,
>
> Thanks for starting this discussion.
>
> I agree with you that the REST catalog "should" be the centralized
> security mechanism (Polaris is a good example). However, we have two
> challenges today:
> - there's no enforcement to use the REST catalog. Some engines are
> still directly accessing the metadata.json without going through a
> catalog. Without "enforcing" catalog use (and especially REST
> catalog), it's not really possible to have a centralized security
> mechanism across engines.
> - the "entity" permission model (table, view, namespace) is REST
> catalog impl side (server side).
>
> I think we are mixing two security layers here: the REST and entity
> security (RBAC, etc) and the storage (credential vending).
>
> Thinking aloud, I would consider the storage as "internal security"
> and REST catalog as "user facing security". Why not consider
> "enforcing" REST Catalog in the Iceberg ecosystem ? It would
> "standardize" the "user facing security" (and the implementation can
> implement credentials vending for the storage).
>
> Just my $0.01 :)
>
> Regards
> JB
>
> On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov <[email protected]>
> wrote:
> >
> > Hi,
> >
> > Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
> >
> > Controlled environment. E.g., Google BigQuery has special
> readers/writers for open formats, tightly integrated with managed engines.
> Doesn't work outside of a specific cloud vendor.
> > Securing storage. E.g., various S3 access policies. Works for individual
> files/buckets but can hardly address important access restrictions, such as
> column access permissions, masking, and filtering. Tightly integrated
> solutions, such as AWS S3 Tables, can potentially solve these, but this
> implies a cloud vendor lock-in.
> > Catalog-level permissions. For example, a Tabular/Polaris role model,
> possibly with vended credentials or remote request signature. Works for
> coarse-grained access permissions but fails to deliver proper access
> control for individual columns, as well as masking and filtering.
> > Centralized security service. E.g., Apache Ranger, OPA. It could provide
> whatever security permissions, but each engine must provide its own
> integration with the service. Also, some admins of such services usually
> have to duplicate access permissions between different engines. For
> example, the column masking policy for Trino in Apache Ranger will not work
> for Apache Spark.
> > Securing data with virtual views. Works for individual engines, but not
> across engines. There is an ongoing discussion about common IR with
> Substrait, but given the complexity of engine dialects, we can hardly
> expect truly reusable views any time soon. Moreover, similarly to Apache
> Ranger, this shifts security decisions towards the engine, which is not
> good.
> >
> > To the best of my knowledge, the above-mentioned strategies are some of
> the "state-of-the-art"  techniques for secure lakehouse access. I would
> argue that none of these strategies are open, secure, interoperable, and
> convenient for end users simultaneously. Compare it with security
> management in monolithic systems, such as Vertica: execute a couple of SQL
> statements, done.
> >
> > Having a solid vision of a secure lakehouse could be a major advantage
> for Apache Iceberg. I would like to kindly ask the community about your
> thoughts on what are the current major pain points with your Iceberg-based
> deployments security and what could be done at the Iceber level to further
> improve it.
> >
> > My 5 cents. REST catalog is a very good candidate for a centralized
> security mechanism for the whole lakehouse, irrespective of the engine that
> accesses data. However, the security capabilities of the current REST
> protocol are limited. We can secure individual catalogs, namespaces, and
> tables. But we cannot:
> >
> > Define individual column permission
> > Apply column making
> > Apply row-level filtering
> >
> > Without solutions to these requirements, Iceberg will not be able to
> provide complete and coherent data access without resorting to third-party
> solutions or closed cloud vendor ecosystems.
> >
> > Given that data is organized in a columnar fashion in Parquet/ORC, which
> is oblivious to catalog and store, and Iceberg itself cannot evaluate
> additional filters, what can we do? Are there any iterative improvements
> that we can make to the Iceberg protocol to improve these? And is it
> Iceberg concern in the first place, or shall we refrain from going into
> this security rabbit hole?
> >
> > Several very rough examples of potential improvements:
> >
> > We can think about splitting table data into multiple files for
> column-level security and masking. For example, instead of storing columns
> [a, b, c] in the same Parquet file, we split them into three files: [a, b],
> [c], [c_masked]. Then, individual policies could be applied to these files
> at the catalog or storage layer. This requires spec change.
> > For row-level filtering, we can think of a table redirection. That is, a
> user asks for table "A", and we return the table metadata for "A_filtered"
> with different data. It is not an ideal solution at all: it is not flexible
> enough, requires data duplication, requires extensive support at the engine
> level, etc. But might be better than nothing.
> >
> > Many potential Iceberg users still do not understand how to secure the
> lakehouse. I would appreciate your feedback on the matter.
> >
> > Regards,
> > --
> > Vladimir Ozerov
> >
>

Re: There is no easy way to secure Iceberg data. How can we improve?

Reply via email to