Hi Vladimir, Thanks for starting this discussion.
I agree with you that the REST catalog "should" be the centralized security mechanism (Polaris is a good example). However, we have two challenges today: - there's no enforcement to use the REST catalog. Some engines are still directly accessing the metadata.json without going through a catalog. Without "enforcing" catalog use (and especially REST catalog), it's not really possible to have a centralized security mechanism across engines. - the "entity" permission model (table, view, namespace) is REST catalog impl side (server side). I think we are mixing two security layers here: the REST and entity security (RBAC, etc) and the storage (credential vending). Thinking aloud, I would consider the storage as "internal security" and REST catalog as "user facing security". Why not consider "enforcing" REST Catalog in the Iceberg ecosystem ? It would "standardize" the "user facing security" (and the implementation can implement credentials vending for the storage). Just my $0.01 :) Regards JB On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov <voze...@querifylabs.com> wrote: > > Hi, > > Apache Iceberg can address multiple analytical scenarios, including ETL, > streaming, ad-hoc queries, etc. One important obstacle in Iceberg integration > nowadays is secure access to Iceberg tables across multiple tools and > engines. There are several typical approaches to lakehouse security: > > Controlled environment. E.g., Google BigQuery has special readers/writers for > open formats, tightly integrated with managed engines. Doesn't work outside > of a specific cloud vendor. > Securing storage. E.g., various S3 access policies. Works for individual > files/buckets but can hardly address important access restrictions, such as > column access permissions, masking, and filtering. Tightly integrated > solutions, such as AWS S3 Tables, can potentially solve these, but this > implies a cloud vendor lock-in. > Catalog-level permissions. For example, a Tabular/Polaris role model, > possibly with vended credentials or remote request signature. Works for > coarse-grained access permissions but fails to deliver proper access control > for individual columns, as well as masking and filtering. > Centralized security service. E.g., Apache Ranger, OPA. It could provide > whatever security permissions, but each engine must provide its own > integration with the service. Also, some admins of such services usually have > to duplicate access permissions between different engines. For example, the > column masking policy for Trino in Apache Ranger will not work for Apache > Spark. > Securing data with virtual views. Works for individual engines, but not > across engines. There is an ongoing discussion about common IR with > Substrait, but given the complexity of engine dialects, we can hardly expect > truly reusable views any time soon. Moreover, similarly to Apache Ranger, > this shifts security decisions towards the engine, which is not good. > > To the best of my knowledge, the above-mentioned strategies are some of the > "state-of-the-art" techniques for secure lakehouse access. I would argue > that none of these strategies are open, secure, interoperable, and convenient > for end users simultaneously. Compare it with security management in > monolithic systems, such as Vertica: execute a couple of SQL statements, done. > > Having a solid vision of a secure lakehouse could be a major advantage for > Apache Iceberg. I would like to kindly ask the community about your thoughts > on what are the current major pain points with your Iceberg-based deployments > security and what could be done at the Iceber level to further improve it. > > My 5 cents. REST catalog is a very good candidate for a centralized security > mechanism for the whole lakehouse, irrespective of the engine that accesses > data. However, the security capabilities of the current REST protocol are > limited. We can secure individual catalogs, namespaces, and tables. But we > cannot: > > Define individual column permission > Apply column making > Apply row-level filtering > > Without solutions to these requirements, Iceberg will not be able to provide > complete and coherent data access without resorting to third-party solutions > or closed cloud vendor ecosystems. > > Given that data is organized in a columnar fashion in Parquet/ORC, which is > oblivious to catalog and store, and Iceberg itself cannot evaluate additional > filters, what can we do? Are there any iterative improvements that we can > make to the Iceberg protocol to improve these? And is it Iceberg concern in > the first place, or shall we refrain from going into this security rabbit > hole? > > Several very rough examples of potential improvements: > > We can think about splitting table data into multiple files for column-level > security and masking. For example, instead of storing columns [a, b, c] in > the same Parquet file, we split them into three files: [a, b], [c], > [c_masked]. Then, individual policies could be applied to these files at the > catalog or storage layer. This requires spec change. > For row-level filtering, we can think of a table redirection. That is, a user > asks for table "A", and we return the table metadata for "A_filtered" with > different data. It is not an ideal solution at all: it is not flexible > enough, requires data duplication, requires extensive support at the engine > level, etc. But might be better than nothing. > > Many potential Iceberg users still do not understand how to secure the > lakehouse. I would appreciate your feedback on the matter. > > Regards, > -- > Vladimir Ozerov >