My understanding is if apache iceberg supports column level encryption, currently it is only table level, that + credential vending should allow fine grained read access to particular columns.
On Wed, Jan 1, 2025 at 1:51 PM Vladimir Ozerov <voze...@querifylabs.com> wrote: > Hi, > > Apache Iceberg can address multiple analytical scenarios, including ETL, > streaming, ad-hoc queries, etc. One important obstacle in Iceberg > integration nowadays is secure access to Iceberg tables across multiple > tools and engines. There are several typical approaches to lakehouse > security: > > 1. Controlled environment. E.g., Google BigQuery has special > readers/writers for open formats, tightly integrated with managed engines. > Doesn't work outside of a specific cloud vendor. > 2. Securing storage. E.g., various S3 access policies. Works for > individual files/buckets but can hardly address important access > restrictions, such as column access permissions, masking, and filtering. > Tightly integrated solutions, such as AWS S3 Tables, can potentially solve > these, but this implies a cloud vendor lock-in. > 3. Catalog-level permissions. For example, a Tabular/Polaris role > model, possibly with vended credentials or remote request signature. Works > for coarse-grained access permissions but fails to deliver proper access > control for individual columns, as well as masking and filtering. > 4. Centralized security service. E.g., Apache Ranger, OPA. It could > provide whatever security permissions, but each engine must provide its own > integration with the service. Also, some admins of such services usually > have to duplicate access permissions between different engines. For > example, the column masking policy for Trino in Apache Ranger will not work > for Apache Spark. > 5. Securing data with virtual views. Works for individual engines, but > not across engines. There is an ongoing discussion about common IR with > Substrait, but given the complexity of engine dialects, we can hardly > expect truly reusable views any time soon. Moreover, similarly to Apache > Ranger, this shifts security decisions towards the engine, which is not > good. > > To the best of my knowledge, the above-mentioned strategies are some of > the "state-of-the-art" techniques for secure lakehouse access. I would > argue that none of these strategies are open, secure, interoperable, and > convenient for end users simultaneously. Compare it with security > management in monolithic systems, such as Vertica: execute a couple of SQL > statements, done. > > Having a solid vision of a secure lakehouse could be a major advantage for > Apache Iceberg. I would like to kindly ask the community about your > thoughts on what are the current major pain points with your Iceberg-based > deployments security and what could be done at the Iceber level to further > improve it. > > My 5 cents. REST catalog is a very good candidate for a centralized > security mechanism for the whole lakehouse, irrespective of the engine that > accesses data. However, the security capabilities of the current REST > protocol are limited. We can secure individual catalogs, namespaces, and > tables. But we cannot: > > 1. Define individual column permission > 2. Apply column making > 3. Apply row-level filtering > > Without solutions to these requirements, Iceberg will not be able to > provide complete and coherent data access without resorting to third-party > solutions or closed cloud vendor ecosystems. > > Given that data is organized in a columnar fashion in Parquet/ORC, which > is oblivious to catalog and store, and Iceberg itself cannot evaluate > additional filters, what can we do? Are there any iterative > improvements that we can make to the Iceberg protocol to improve these? And > is it Iceberg concern in the first place, or shall we refrain from going > into this security rabbit hole? > > Several very rough examples of potential improvements: > > 1. We can think about splitting table data into multiple files for > column-level security and masking. For example, instead of storing columns > [a, b, c] in the same Parquet file, we split them into three files: [a, b], > [c], [c_masked]. Then, individual policies could be applied to these files > at the catalog or storage layer. This requires spec change. > 2. For row-level filtering, we can think of a table redirection. That > is, a user asks for table "A", and we return the table metadata for > "A_filtered" with different data. It is not an ideal solution at all: it is > not flexible enough, requires data duplication, requires extensive support > at the engine level, etc. But might be better than nothing. > > Many potential Iceberg users still do not understand how to secure the > lakehouse. I would appreciate your feedback on the matter. > > Regards, > -- > > *Vladimir Ozerov* > >