Re: There is no easy way to secure Iceberg data. How can we improve?

Jean-Baptiste Onofré Thu, 02 Jan 2025 09:36:07 -0800

Hi Vladimir,

Thanks for starting this discussion.


I agree with you that the REST catalog "should" be the centralized
security mechanism (Polaris is a good example). However, we have two
challenges today:
- there's no enforcement to use the REST catalog. Some engines are
still directly accessing the metadata.json without going through a
catalog. Without "enforcing" catalog use (and especially REST
catalog), it's not really possible to have a centralized security
mechanism across engines.
- the "entity" permission model (table, view, namespace) is REST
catalog impl side (server side).

I think we are mixing two security layers here: the REST and entity
security (RBAC, etc) and the storage (credential vending).

Thinking aloud, I would consider the storage as "internal security"
and REST catalog as "user facing security". Why not consider
"enforcing" REST Catalog in the Iceberg ecosystem ? It would
"standardize" the "user facing security" (and the implementation can
implement credentials vending for the storage).

Just my $0.01 :)

Regards
JB

On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov <voze...@querifylabs.com> wrote:
>
> Hi,
>
> Apache Iceberg can address multiple analytical scenarios, including ETL, 
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg integration 
> nowadays is secure access to Iceberg tables across multiple tools and 
> engines. There are several typical approaches to lakehouse security:
>
> Controlled environment. E.g., Google BigQuery has special readers/writers for 
> open formats, tightly integrated with managed engines. Doesn't work outside 
> of a specific cloud vendor.
> Securing storage. E.g., various S3 access policies. Works for individual 
> files/buckets but can hardly address important access restrictions, such as 
> column access permissions, masking, and filtering. Tightly integrated 
> solutions, such as AWS S3 Tables, can potentially solve these, but this 
> implies a cloud vendor lock-in.
> Catalog-level permissions. For example, a Tabular/Polaris role model, 
> possibly with vended credentials or remote request signature. Works for 
> coarse-grained access permissions but fails to deliver proper access control 
> for individual columns, as well as masking and filtering.
> Centralized security service. E.g., Apache Ranger, OPA. It could provide 
> whatever security permissions, but each engine must provide its own 
> integration with the service. Also, some admins of such services usually have 
> to duplicate access permissions between different engines. For example, the 
> column masking policy for Trino in Apache Ranger will not work for Apache 
> Spark.
> Securing data with virtual views. Works for individual engines, but not 
> across engines. There is an ongoing discussion about common IR with 
> Substrait, but given the complexity of engine dialects, we can hardly expect 
> truly reusable views any time soon. Moreover, similarly to Apache Ranger, 
> this shifts security decisions towards the engine, which is not good.
>
> To the best of my knowledge, the above-mentioned strategies are some of the 
> "state-of-the-art"  techniques for secure lakehouse access. I would argue 
> that none of these strategies are open, secure, interoperable, and convenient 
> for end users simultaneously. Compare it with security management in 
> monolithic systems, such as Vertica: execute a couple of SQL statements, done.
>
> Having a solid vision of a secure lakehouse could be a major advantage for 
> Apache Iceberg. I would like to kindly ask the community about your thoughts 
> on what are the current major pain points with your Iceberg-based deployments 
> security and what could be done at the Iceber level to further improve it.
>
> My 5 cents. REST catalog is a very good candidate for a centralized security 
> mechanism for the whole lakehouse, irrespective of the engine that accesses 
> data. However, the security capabilities of the current REST protocol are 
> limited. We can secure individual catalogs, namespaces, and tables. But we 
> cannot:
>
> Define individual column permission
> Apply column making
> Apply row-level filtering
>
> Without solutions to these requirements, Iceberg will not be able to provide 
> complete and coherent data access without resorting to third-party solutions 
> or closed cloud vendor ecosystems.
>
> Given that data is organized in a columnar fashion in Parquet/ORC, which is 
> oblivious to catalog and store, and Iceberg itself cannot evaluate additional 
> filters, what can we do? Are there any iterative improvements that we can 
> make to the Iceberg protocol to improve these? And is it Iceberg concern in 
> the first place, or shall we refrain from going into this security rabbit 
> hole?
>
> Several very rough examples of potential improvements:
>
> We can think about splitting table data into multiple files for column-level 
> security and masking. For example, instead of storing columns [a, b, c] in 
> the same Parquet file, we split them into three files: [a, b], [c], 
> [c_masked]. Then, individual policies could be applied to these files at the 
> catalog or storage layer. This requires spec change.
> For row-level filtering, we can think of a table redirection. That is, a user 
> asks for table "A", and we return the table metadata for "A_filtered" with 
> different data. It is not an ideal solution at all: it is not flexible 
> enough, requires data duplication, requires extensive support at the engine 
> level, etc. But might be better than nothing.
>
> Many potential Iceberg users still do not understand how to secure the 
> lakehouse. I would appreciate your feedback on the matter.
>
> Regards,
> --
> Vladimir Ozerov
>

Re: There is no easy way to secure Iceberg data. How can we improve?

Reply via email to