Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-01 Thread Joshua Kolash
My understanding is if apache iceberg supports column level encryption,
currently it is only table level, that + credential vending should allow
fine grained read access to particular columns.

On Wed, Jan 1, 2025 at 1:51 PM Vladimir Ozerov 
wrote:

> Hi,
>
> Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
>
>1. Controlled environment. E.g., Google BigQuery has special
>readers/writers for open formats, tightly integrated with managed engines.
>Doesn't work outside of a specific cloud vendor.
>2. Securing storage. E.g., various S3 access policies. Works for
>individual files/buckets but can hardly address important access
>restrictions, such as column access permissions, masking, and filtering.
>Tightly integrated solutions, such as AWS S3 Tables, can potentially solve
>these, but this implies a cloud vendor lock-in.
>3. Catalog-level permissions. For example, a Tabular/Polaris role
>model, possibly with vended credentials or remote request signature. Works
>for coarse-grained access permissions but fails to deliver proper access
>control for individual columns, as well as masking and filtering.
>4. Centralized security service. E.g., Apache Ranger, OPA. It could
>provide whatever security permissions, but each engine must provide its own
>integration with the service. Also, some admins of such services usually
>have to duplicate access permissions between different engines. For
>example, the column masking policy for Trino in Apache Ranger will not work
>for Apache Spark.
>5. Securing data with virtual views. Works for individual engines, but
>not across engines. There is an ongoing discussion about common IR with
>Substrait, but given the complexity of engine dialects, we can hardly
>expect truly reusable views any time soon. Moreover, similarly to Apache
>Ranger, this shifts security decisions towards the engine, which is not
>good.
>
> To the best of my knowledge, the above-mentioned strategies are some of
> the "state-of-the-art"  techniques for secure lakehouse access. I would
> argue that none of these strategies are open, secure, interoperable, and
> convenient for end users simultaneously. Compare it with security
> management in monolithic systems, such as Vertica: execute a couple of SQL
> statements, done.
>
> Having a solid vision of a secure lakehouse could be a major advantage for
> Apache Iceberg. I would like to kindly ask the community about your
> thoughts on what are the current major pain points with your Iceberg-based
> deployments security and what could be done at the Iceber level to further
> improve it.
>
> My 5 cents. REST catalog is a very good candidate for a centralized
> security mechanism for the whole lakehouse, irrespective of the engine that
> accesses data. However, the security capabilities of the current REST
> protocol are limited. We can secure individual catalogs, namespaces, and
> tables. But we cannot:
>
>1. Define individual column permission
>2. Apply column making
>3. Apply row-level filtering
>
> Without solutions to these requirements, Iceberg will not be able to
> provide complete and coherent data access without resorting to third-party
> solutions or closed cloud vendor ecosystems.
>
> Given that data is organized in a columnar fashion in Parquet/ORC, which
> is oblivious to catalog and store, and Iceberg itself cannot evaluate
> additional filters, what can we do? Are there any iterative
> improvements that we can make to the Iceberg protocol to improve these? And
> is it Iceberg concern in the first place, or shall we refrain from going
> into this security rabbit hole?
>
> Several very rough examples of potential improvements:
>
>1. We can think about splitting table data into multiple files for
>column-level security and masking. For example, instead of storing columns
>[a, b, c] in the same Parquet file, we split them into three files: [a, b],
>[c], [c_masked]. Then, individual policies could be applied to these files
>at the catalog or storage layer. This requires spec change.
>2. For row-level filtering, we can think of a table redirection. That
>is, a user asks for table "A", and we return the table metadata for
>"A_filtered" with different data. It is not an ideal solution at all: it is
>not flexible enough, requires data duplication, requires extensive support
>at the engine level, etc. But might be better than nothing.
>
> Many potential Iceberg users still do not understand how to secure the
> lakehouse. I would appreciate your feedback on the matter.
>
> Regards,
> --
>
> *Vladimir Ozerov*
>
>


There is no easy way to secure Iceberg data. How can we improve?

2025-01-01 Thread Vladimir Ozerov
Hi,

Apache Iceberg can address multiple analytical scenarios, including ETL,
streaming, ad-hoc queries, etc. One important obstacle in Iceberg
integration nowadays is secure access to Iceberg tables across multiple
tools and engines. There are several typical approaches to lakehouse
security:

   1. Controlled environment. E.g., Google BigQuery has special
   readers/writers for open formats, tightly integrated with managed engines.
   Doesn't work outside of a specific cloud vendor.
   2. Securing storage. E.g., various S3 access policies. Works for
   individual files/buckets but can hardly address important access
   restrictions, such as column access permissions, masking, and filtering.
   Tightly integrated solutions, such as AWS S3 Tables, can potentially solve
   these, but this implies a cloud vendor lock-in.
   3. Catalog-level permissions. For example, a Tabular/Polaris role model,
   possibly with vended credentials or remote request signature. Works for
   coarse-grained access permissions but fails to deliver proper access
   control for individual columns, as well as masking and filtering.
   4. Centralized security service. E.g., Apache Ranger, OPA. It could
   provide whatever security permissions, but each engine must provide its own
   integration with the service. Also, some admins of such services usually
   have to duplicate access permissions between different engines. For
   example, the column masking policy for Trino in Apache Ranger will not work
   for Apache Spark.
   5. Securing data with virtual views. Works for individual engines, but
   not across engines. There is an ongoing discussion about common IR with
   Substrait, but given the complexity of engine dialects, we can hardly
   expect truly reusable views any time soon. Moreover, similarly to Apache
   Ranger, this shifts security decisions towards the engine, which is not
   good.

To the best of my knowledge, the above-mentioned strategies are some of the
"state-of-the-art"  techniques for secure lakehouse access. I would argue
that none of these strategies are open, secure, interoperable, and
convenient for end users simultaneously. Compare it with security
management in monolithic systems, such as Vertica: execute a couple of SQL
statements, done.

Having a solid vision of a secure lakehouse could be a major advantage for
Apache Iceberg. I would like to kindly ask the community about your
thoughts on what are the current major pain points with your Iceberg-based
deployments security and what could be done at the Iceber level to further
improve it.

My 5 cents. REST catalog is a very good candidate for a centralized
security mechanism for the whole lakehouse, irrespective of the engine that
accesses data. However, the security capabilities of the current REST
protocol are limited. We can secure individual catalogs, namespaces, and
tables. But we cannot:

   1. Define individual column permission
   2. Apply column making
   3. Apply row-level filtering

Without solutions to these requirements, Iceberg will not be able to
provide complete and coherent data access without resorting to third-party
solutions or closed cloud vendor ecosystems.

Given that data is organized in a columnar fashion in Parquet/ORC, which is
oblivious to catalog and store, and Iceberg itself cannot evaluate
additional filters, what can we do? Are there any iterative
improvements that we can make to the Iceberg protocol to improve these? And
is it Iceberg concern in the first place, or shall we refrain from going
into this security rabbit hole?

Several very rough examples of potential improvements:

   1. We can think about splitting table data into multiple files for
   column-level security and masking. For example, instead of storing columns
   [a, b, c] in the same Parquet file, we split them into three files: [a, b],
   [c], [c_masked]. Then, individual policies could be applied to these files
   at the catalog or storage layer. This requires spec change.
   2. For row-level filtering, we can think of a table redirection. That
   is, a user asks for table "A", and we return the table metadata for
   "A_filtered" with different data. It is not an ideal solution at all: it is
   not flexible enough, requires data duplication, requires extensive support
   at the engine level, etc. But might be better than nothing.

Many potential Iceberg users still do not understand how to secure the
lakehouse. I would appreciate your feedback on the matter.

Regards,
-- 

*Vladimir Ozerov*