Re: There is no easy way to secure Iceberg data. How can we improve?

Micah Kornfield Fri, 03 Jan 2025 09:42:58 -0800

Hi Vladimir and JB,

There have been some previous discussions on security [1].



> We can think about splitting table data into multiple files for
> column-level security and masking. For example, instead of storing columns
> [a, b, c] in the same Parquet file, we split them into three files: [a, b],
> [c], [c_masked]. Then, individual policies could be applied to these files
> at the catalog or storage layer.


IMO, I think this would add too much complexity to the specification.
Parquet, in theory, has metadata available to split columns across files
but the Parquet community has chosen not to actually implement this in any
of its readers (mostly due to complexity and compatibility reasons).

For row-level filtering, we can think of a table redirection. That is,
> a user asks for table "A", and we return the table metadata for
> "A_filtered" with different data. It is not an ideal solution at all: it is
> not flexible enough, requires data duplication, requires extensive support
> at the engine level, etc. But might be better than nothing.


Based on the prior discussions there are potentially other models to
consider:
1.  A shared responsibility model, where compute engines can be registered
as "trusted" to implement  the access controls registered in the REST API.
This was already touched on above as not necessarily being desirable.
2.  For non-trusted engines provide a table data service that acts as a
secure proxy to the data to enforce access controls (e.g. an Arrow Flight
or Flight SQL service [2][3] service or an extension beyond this [4]).  The
scan planning APIs in the REST service are already a step in this
direction.

I think between these two it should provide an incremental path for
handling secure tables. A large number of use-cases can be supported by
ensuring trusted Spark/Trino clusters are available.  Other engines can
either add the necessary support on their own timeline or if data access
for those is a requirement, data administrators can set up the proxy
service.


>  Why not consider
> "enforcing" REST Catalog in the Iceberg ecosystem ?


I think for security purposes this makes sense.  As a general requirement,
having the flexibility of different catalogs depending on implementation
needs still makes sense to me.


[1] https://lists.apache.org/thread/4swop72zgcr8rrmwvb51rlk0vnb8joyz
[2] https://arrow.apache.org/docs/format/Flight.html
[3] https://arrow.apache.org/docs/format/FlightSql.html
[4] https://lists.apache.org/thread/g4jkyh4o8rqk16cl3mo3wb2h00y92z9j

On Thu, Jan 2, 2025 at 9:36 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

> Hi Vladimir,
>
> Thanks for starting this discussion.
>
> I agree with you that the REST catalog "should" be the centralized
> security mechanism (Polaris is a good example). However, we have two
> challenges today:
> - there's no enforcement to use the REST catalog. Some engines are
> still directly accessing the metadata.json without going through a
> catalog. Without "enforcing" catalog use (and especially REST
> catalog), it's not really possible to have a centralized security
> mechanism across engines.
> - the "entity" permission model (table, view, namespace) is REST
> catalog impl side (server side).
>
> I think we are mixing two security layers here: the REST and entity
> security (RBAC, etc) and the storage (credential vending).
>
> Thinking aloud, I would consider the storage as "internal security"
> and REST catalog as "user facing security". Why not consider
> "enforcing" REST Catalog in the Iceberg ecosystem ? It would
> "standardize" the "user facing security" (and the implementation can
> implement credentials vending for the storage).
>
> Just my $0.01 :)
>
> Regards
> JB
>
> On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov <voze...@querifylabs.com>
> wrote:
> >
> > Hi,
> >
> > Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
> >
> > Controlled environment. E.g., Google BigQuery has special
> readers/writers for open formats, tightly integrated with managed engines.
> Doesn't work outside of a specific cloud vendor.
> > Securing storage. E.g., various S3 access policies. Works for individual
> files/buckets but can hardly address important access restrictions, such as
> column access permissions, masking, and filtering. Tightly integrated
> solutions, such as AWS S3 Tables, can potentially solve these, but this
> implies a cloud vendor lock-in.
> > Catalog-level permissions. For example, a Tabular/Polaris role model,
> possibly with vended credentials or remote request signature. Works for
> coarse-grained access permissions but fails to deliver proper access
> control for individual columns, as well as masking and filtering.
> > Centralized security service. E.g., Apache Ranger, OPA. It could provide
> whatever security permissions, but each engine must provide its own
> integration with the service. Also, some admins of such services usually
> have to duplicate access permissions between different engines. For
> example, the column masking policy for Trino in Apache Ranger will not work
> for Apache Spark.
> > Securing data with virtual views. Works for individual engines, but not
> across engines. There is an ongoing discussion about common IR with
> Substrait, but given the complexity of engine dialects, we can hardly
> expect truly reusable views any time soon. Moreover, similarly to Apache
> Ranger, this shifts security decisions towards the engine, which is not
> good.
> >
> > To the best of my knowledge, the above-mentioned strategies are some of
> the "state-of-the-art"  techniques for secure lakehouse access. I would
> argue that none of these strategies are open, secure, interoperable, and
> convenient for end users simultaneously. Compare it with security
> management in monolithic systems, such as Vertica: execute a couple of SQL
> statements, done.
> >
> > Having a solid vision of a secure lakehouse could be a major advantage
> for Apache Iceberg. I would like to kindly ask the community about your
> thoughts on what are the current major pain points with your Iceberg-based
> deployments security and what could be done at the Iceber level to further
> improve it.
> >
> > My 5 cents. REST catalog is a very good candidate for a centralized
> security mechanism for the whole lakehouse, irrespective of the engine that
> accesses data. However, the security capabilities of the current REST
> protocol are limited. We can secure individual catalogs, namespaces, and
> tables. But we cannot:
> >
> > Define individual column permission
> > Apply column making
> > Apply row-level filtering
> >
> > Without solutions to these requirements, Iceberg will not be able to
> provide complete and coherent data access without resorting to third-party
> solutions or closed cloud vendor ecosystems.
> >
> > Given that data is organized in a columnar fashion in Parquet/ORC, which
> is oblivious to catalog and store, and Iceberg itself cannot evaluate
> additional filters, what can we do? Are there any iterative improvements
> that we can make to the Iceberg protocol to improve these? And is it
> Iceberg concern in the first place, or shall we refrain from going into
> this security rabbit hole?
> >
> > Several very rough examples of potential improvements:
> >
> > We can think about splitting table data into multiple files for
> column-level security and masking. For example, instead of storing columns
> [a, b, c] in the same Parquet file, we split them into three files: [a, b],
> [c], [c_masked]. Then, individual policies could be applied to these files
> at the catalog or storage layer. This requires spec change.
> > For row-level filtering, we can think of a table redirection. That is, a
> user asks for table "A", and we return the table metadata for "A_filtered"
> with different data. It is not an ideal solution at all: it is not flexible
> enough, requires data duplication, requires extensive support at the engine
> level, etc. But might be better than nothing.
> >
> > Many potential Iceberg users still do not understand how to secure the
> lakehouse. I would appreciate your feedback on the matter.
> >
> > Regards,
> > --
> > Vladimir Ozerov
> >
>

Re: There is no easy way to secure Iceberg data. How can we improve?

Reply via email to