if the data is stored in S3 then if someone has unrestricted access to a
single store containing all the data (default without S3 access grants,
cloudera ranger extensions or some other access control mechanism to grant
access to clients without sharing credentials) - then it's effectively
impossible to stop the clients being able to read it.

encryption of the parquet data is about all you can do. I know parquet
encryption has always cited cloud KMS hardware as a keystore (
https://parquet.apache.org/docs/file-format/data-pages/encryption/ ) but I
don't know of any implementations of that. Do that and you can secure
column access by restricting which  IAM roles have decrypt permissions:
this does *not* have to be the same roles which can encrypt the data.

On Wed, 1 Jan 2025 at 18:51, Vladimir Ozerov <voze...@querifylabs.com>
wrote:

> Hi,
>
> Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
>
>    1. Controlled environment. E.g., Google BigQuery has special
>    readers/writers for open formats, tightly integrated with managed engines.
>    Doesn't work outside of a specific cloud vendor.
>    2. Securing storage. E.g., various S3 access policies. Works for
>    individual files/buckets but can hardly address important access
>    restrictions, such as column access permissions, masking, and filtering.
>    Tightly integrated solutions, such as AWS S3 Tables, can potentially solve
>    these, but this implies a cloud vendor lock-in.
>    3. Catalog-level permissions. For example, a Tabular/Polaris role
>    model, possibly with vended credentials or remote request signature. Works
>    for coarse-grained access permissions but fails to deliver proper access
>    control for individual columns, as well as masking and filtering.
>    4. Centralized security service. E.g., Apache Ranger, OPA. It could
>    provide whatever security permissions, but each engine must provide its own
>    integration with the service. Also, some admins of such services usually
>    have to duplicate access permissions between different engines. For
>    example, the column masking policy for Trino in Apache Ranger will not work
>    for Apache Spark.
>    5. Securing data with virtual views. Works for individual engines, but
>    not across engines. There is an ongoing discussion about common IR with
>    Substrait, but given the complexity of engine dialects, we can hardly
>    expect truly reusable views any time soon. Moreover, similarly to Apache
>    Ranger, this shifts security decisions towards the engine, which is not
>    good.
>
> To the best of my knowledge, the above-mentioned strategies are some of
> the "state-of-the-art"  techniques for secure lakehouse access. I would
> argue that none of these strategies are open, secure, interoperable, and
> convenient for end users simultaneously. Compare it with security
> management in monolithic systems, such as Vertica: execute a couple of SQL
> statements, done.
>
> Having a solid vision of a secure lakehouse could be a major advantage for
> Apache Iceberg. I would like to kindly ask the community about your
> thoughts on what are the current major pain points with your Iceberg-based
> deployments security and what could be done at the Iceber level to further
> improve it.
>
> My 5 cents. REST catalog is a very good candidate for a centralized
> security mechanism for the whole lakehouse, irrespective of the engine that
> accesses data. However, the security capabilities of the current REST
> protocol are limited. We can secure individual catalogs, namespaces, and
> tables. But we cannot:
>
>    1. Define individual column permission
>    2. Apply column making
>    3. Apply row-level filtering
>
> Without solutions to these requirements, Iceberg will not be able to
> provide complete and coherent data access without resorting to third-party
> solutions or closed cloud vendor ecosystems.
>
> Given that data is organized in a columnar fashion in Parquet/ORC, which
> is oblivious to catalog and store, and Iceberg itself cannot evaluate
> additional filters, what can we do? Are there any iterative
> improvements that we can make to the Iceberg protocol to improve these? And
> is it Iceberg concern in the first place, or shall we refrain from going
> into this security rabbit hole?
>
> Several very rough examples of potential improvements:
>
>    1. We can think about splitting table data into multiple files for
>    column-level security and masking. For example, instead of storing columns
>    [a, b, c] in the same Parquet file, we split them into three files: [a, b],
>    [c], [c_masked]. Then, individual policies could be applied to these files
>    at the catalog or storage layer. This requires spec change.
>    2. For row-level filtering, we can think of a table redirection. That
>    is, a user asks for table "A", and we return the table metadata for
>    "A_filtered" with different data. It is not an ideal solution at all: it is
>    not flexible enough, requires data duplication, requires extensive support
>    at the engine level, etc. But might be better than nothing.
>
> Many potential Iceberg users still do not understand how to secure the
> lakehouse. I would appreciate your feedback on the matter.
>
> Regards,
> --
>
> *Vladimir Ozerov*
>
>

Reply via email to