Hi, Apache Iceberg can address multiple analytical scenarios, including ETL, streaming, ad-hoc queries, etc. One important obstacle in Iceberg integration nowadays is secure access to Iceberg tables across multiple tools and engines. There are several typical approaches to lakehouse security:
1. Controlled environment. E.g., Google BigQuery has special readers/writers for open formats, tightly integrated with managed engines. Doesn't work outside of a specific cloud vendor. 2. Securing storage. E.g., various S3 access policies. Works for individual files/buckets but can hardly address important access restrictions, such as column access permissions, masking, and filtering. Tightly integrated solutions, such as AWS S3 Tables, can potentially solve these, but this implies a cloud vendor lock-in. 3. Catalog-level permissions. For example, a Tabular/Polaris role model, possibly with vended credentials or remote request signature. Works for coarse-grained access permissions but fails to deliver proper access control for individual columns, as well as masking and filtering. 4. Centralized security service. E.g., Apache Ranger, OPA. It could provide whatever security permissions, but each engine must provide its own integration with the service. Also, some admins of such services usually have to duplicate access permissions between different engines. For example, the column masking policy for Trino in Apache Ranger will not work for Apache Spark. 5. Securing data with virtual views. Works for individual engines, but not across engines. There is an ongoing discussion about common IR with Substrait, but given the complexity of engine dialects, we can hardly expect truly reusable views any time soon. Moreover, similarly to Apache Ranger, this shifts security decisions towards the engine, which is not good. To the best of my knowledge, the above-mentioned strategies are some of the "state-of-the-art" techniques for secure lakehouse access. I would argue that none of these strategies are open, secure, interoperable, and convenient for end users simultaneously. Compare it with security management in monolithic systems, such as Vertica: execute a couple of SQL statements, done. Having a solid vision of a secure lakehouse could be a major advantage for Apache Iceberg. I would like to kindly ask the community about your thoughts on what are the current major pain points with your Iceberg-based deployments security and what could be done at the Iceber level to further improve it. My 5 cents. REST catalog is a very good candidate for a centralized security mechanism for the whole lakehouse, irrespective of the engine that accesses data. However, the security capabilities of the current REST protocol are limited. We can secure individual catalogs, namespaces, and tables. But we cannot: 1. Define individual column permission 2. Apply column making 3. Apply row-level filtering Without solutions to these requirements, Iceberg will not be able to provide complete and coherent data access without resorting to third-party solutions or closed cloud vendor ecosystems. Given that data is organized in a columnar fashion in Parquet/ORC, which is oblivious to catalog and store, and Iceberg itself cannot evaluate additional filters, what can we do? Are there any iterative improvements that we can make to the Iceberg protocol to improve these? And is it Iceberg concern in the first place, or shall we refrain from going into this security rabbit hole? Several very rough examples of potential improvements: 1. We can think about splitting table data into multiple files for column-level security and masking. For example, instead of storing columns [a, b, c] in the same Parquet file, we split them into three files: [a, b], [c], [c_masked]. Then, individual policies could be applied to these files at the catalog or storage layer. This requires spec change. 2. For row-level filtering, we can think of a table redirection. That is, a user asks for table "A", and we return the table metadata for "A_filtered" with different data. It is not an ideal solution at all: it is not flexible enough, requires data duplication, requires extensive support at the engine level, etc. But might be better than nothing. Many potential Iceberg users still do not understand how to secure the lakehouse. I would appreciate your feedback on the matter. Regards, -- *Vladimir Ozerov*