Hi,

Apache Iceberg can address multiple analytical scenarios, including ETL,
streaming, ad-hoc queries, etc. One important obstacle in Iceberg
integration nowadays is secure access to Iceberg tables across multiple
tools and engines. There are several typical approaches to lakehouse
security:

   1. Controlled environment. E.g., Google BigQuery has special
   readers/writers for open formats, tightly integrated with managed engines.
   Doesn't work outside of a specific cloud vendor.
   2. Securing storage. E.g., various S3 access policies. Works for
   individual files/buckets but can hardly address important access
   restrictions, such as column access permissions, masking, and filtering.
   Tightly integrated solutions, such as AWS S3 Tables, can potentially solve
   these, but this implies a cloud vendor lock-in.
   3. Catalog-level permissions. For example, a Tabular/Polaris role model,
   possibly with vended credentials or remote request signature. Works for
   coarse-grained access permissions but fails to deliver proper access
   control for individual columns, as well as masking and filtering.
   4. Centralized security service. E.g., Apache Ranger, OPA. It could
   provide whatever security permissions, but each engine must provide its own
   integration with the service. Also, some admins of such services usually
   have to duplicate access permissions between different engines. For
   example, the column masking policy for Trino in Apache Ranger will not work
   for Apache Spark.
   5. Securing data with virtual views. Works for individual engines, but
   not across engines. There is an ongoing discussion about common IR with
   Substrait, but given the complexity of engine dialects, we can hardly
   expect truly reusable views any time soon. Moreover, similarly to Apache
   Ranger, this shifts security decisions towards the engine, which is not
   good.

To the best of my knowledge, the above-mentioned strategies are some of the
"state-of-the-art"  techniques for secure lakehouse access. I would argue
that none of these strategies are open, secure, interoperable, and
convenient for end users simultaneously. Compare it with security
management in monolithic systems, such as Vertica: execute a couple of SQL
statements, done.

Having a solid vision of a secure lakehouse could be a major advantage for
Apache Iceberg. I would like to kindly ask the community about your
thoughts on what are the current major pain points with your Iceberg-based
deployments security and what could be done at the Iceber level to further
improve it.

My 5 cents. REST catalog is a very good candidate for a centralized
security mechanism for the whole lakehouse, irrespective of the engine that
accesses data. However, the security capabilities of the current REST
protocol are limited. We can secure individual catalogs, namespaces, and
tables. But we cannot:

   1. Define individual column permission
   2. Apply column making
   3. Apply row-level filtering

Without solutions to these requirements, Iceberg will not be able to
provide complete and coherent data access without resorting to third-party
solutions or closed cloud vendor ecosystems.

Given that data is organized in a columnar fashion in Parquet/ORC, which is
oblivious to catalog and store, and Iceberg itself cannot evaluate
additional filters, what can we do? Are there any iterative
improvements that we can make to the Iceberg protocol to improve these? And
is it Iceberg concern in the first place, or shall we refrain from going
into this security rabbit hole?

Several very rough examples of potential improvements:

   1. We can think about splitting table data into multiple files for
   column-level security and masking. For example, instead of storing columns
   [a, b, c] in the same Parquet file, we split them into three files: [a, b],
   [c], [c_masked]. Then, individual policies could be applied to these files
   at the catalog or storage layer. This requires spec change.
   2. For row-level filtering, we can think of a table redirection. That
   is, a user asks for table "A", and we return the table metadata for
   "A_filtered" with different data. It is not an ideal solution at all: it is
   not flexible enough, requires data duplication, requires extensive support
   at the engine level, etc. But might be better than nothing.

Many potential Iceberg users still do not understand how to secure the
lakehouse. I would appreciate your feedback on the matter.

Regards,
-- 

*Vladimir Ozerov*

Reply via email to