Re: [DISCUSS] REST Catalog bulk object lookup
A motivational example: Trino has to implement a parallel table metadata fetching recently (https://github.com/trinodb/trino/pull/23909) because otherwise metadata queries (e.g., INFORMATION_SCHEMA) was slow. Parallel metadata retrieval boosted metadata query performance significantly. But this solution is far from ideal: 1. Now catalogs will experience request bursts whenever a user or a tool attempts to list Iceberg objects in Trino. This may potentially induce unpredictable latency spikes, especially for large schemas 2. Each such request imposes a constant catalog overhead on request dispatching, serde, security checks, etc. which could be easily avoided with bulk metadata lookup 3. The aforementioned fix addresses only parallel table retrieval. But then the engine will have to support the same thing for views and materialized views, producing even more requests bursts, with considerable number of requests returning error responses because we cannot get object type and its metadata in one shot. On Tue, Dec 24, 2024 at 10:29 PM Vladimir Ozerov wrote: > Hi, > > Following the discussion [1] I'd like to formally propose an extension to > REST catalog API that allows efficient lookup of multiple catalog objects > without knowing their types in advance. > > When a query is submitted, the engine needs to resolve referenced objects. > The current REST API requires multiple catalog calls per query, because it > (1) assumes the prior knowledge of the object type (not the case for > virtually all query engines), and (2) lacks bulk object lookup operation. > This leads to increased query latency and increased REST catalog load. > > The proposal aims to solve the problem introducing an optional endpoint > that returns information about several catalogs objects, including their > type (table, view) and metadata. > > Note that the proposal attempts to solve two distinct issues via a single > endpoint: > >1. Inability to lookup the object without knowing its type >2. Inability to lookup multiple objects in a single request > > If the community finds the proposal too complicated, we can minimize the > scope to the point 1, and introduce an endpoint for object lookup without > knowing it's type. Even without bulk lookup this can help engine developers > minimize SQL query planning latency. > > Proposal: > https://docs.google.com/document/d/1KfzdQT8Q2xiV_yPNvICROCepz-Qqpm0npob7hmb40Fc/edit?usp=sharing > > [1] https://lists.apache.org/thread/g44czzpjqqhdvronqfyckw4mnxvlpn3s > > Regards, > -- > *Vladimir Ozerov* > > -- *Vladimir Ozerov* Founder querifylabs.com
Re: [DISCUSS] REST: Way to query if metadata pointer is the latest
The proposal looks great to me. Thanks Gavor for working on it. Have we created a spec change PR yet? Yufei On Thu, Dec 19, 2024 at 2:11 AM Gabor Kaszab wrote: > Hi All, > > Just an update that the proposal went through some iterations based on the > comments from Daniel Weeks. Thanks for taking a look, Daniel! > > In a nutshell this is what changed compared to the original proposal: > - The Catalog API will be intact, there is no proposed new API function > now. With this the freshness aware functionality and the ETags in > particular will not be exposed to the clients of the API. > - Instead of storing the ETags in TableMetadata we propose to store it in > RESTTableOperations since the proposal only focuses on the REST catalog. > The very same changes can be done on other TableOperations implementations > if there is going to be a need to have this for other catalogs too. > - A SoftReference cache of (TableIdentifier -> Table object) is introduced > on the RESTSessionCatalog level. This can be used for providing previous > ETags to the HTTPClient and also to answer Catalog API calls with the > latest table metadata if the REST server returns a '304 Not Modified'. > > The doc is updated with the above now: > > https://docs.google.com/document/d/1rnVSP_iv2I47giwfAe-Z3DYhKkKwWCVvCkC9rEvtaLA > > While I keep the discussion still open, I think I'll move on to take care > of the changes required for the REST spec. Will send a PR for this soon. > > Regards, > Gabor > > > On Thu, Dec 12, 2024 at 4:07 PM Jean-Baptiste Onofré > wrote: > >> Hi Gabor >> >> Thanks for the update ! I will take a look. >> >> Regards >> JB >> >> On Thu, Dec 12, 2024 at 2:52 PM Gabor Kaszab >> wrote: >> > >> > Hi Iceberg Community, >> > >> > It took me a while but I finally managed to upload the proposal for >> this as an official 'Iceberg improvement proposal'. Thanks for the feedback >> so far! >> > >> > https://github.com/apache/iceberg/issues/11766 >> > >> > Regards, >> > Gabor >> > >> > >> > On Fri, Nov 22, 2024 at 4:51 PM Taeyun Kim >> wrote: >> >> >> >> Hi, >> >> >> >> Since ETags are opaque values to the client, attributing any semantic >> meaning to them in the interaction between the client and server would, in >> my opinion, constitute a misuse/abuse of the HTTP specification. >> >> On the other hand, the server can generate the ETag value as any >> string, as long as it conforms to the grammar defined in >> https://httpwg.org/specs/rfc9110.html#field.etag . Using the metadata >> location is likely the simplest option. For reference, based on the >> grammar, ETag values cannot include spaces. Therefore, if the metadata >> location contains spaces, it may need to be encoded. The same goes for >> double quotation marks. (I just found this out after looking it up.) >> >> Anyway, in my opinion, the client must ignore any semantic meaning >> associated with the value. >> >> >> >> Thank you. >> >> >> >> -Original Message- >> >> From: "Zoltán Borók-Nagy" >> >> To: ; >> >> Cc: >> >> Sent: 2024-11-22 (금) 19:57:08 (UTC+09:00) >> >> Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the >> latest >> >> >> >> Hi, >> >> >> >> Separate version information forces the clients to manage a Table -> >> >> VersionIdentifier mapping which adds unnecessary complexity and can be >> >> error-prone. >> >> >> >> If the VersionIdentifier is embedded in the Table object then the >> >> application logic is much simpler, and the Catalog interface is not >> >> only simpler, but also hard to use incorrectly. >> >> Though this approach slightly increases the size of the Table objects. >> >> And touching the Table interface might encounter some resistance, even >> >> if it is only an extension. >> >> >> >> Yeah, VersionIdentifier doesn't need to be a String, it could be an >> >> Object, or an empty interface, and the Catalog implementation could >> >> cast it to some catalog-specific VersionIdentifierImpl. >> >> >> >> loadTableIfChanged() throwing UnsupportedOperationException is >> >> reasonable, as clients can easily fallback to loadTable. In my mind I >> >> had a use case where we cache tables without any refresh checks for a >> >> configured TTL, and after expiration we invoke reloadTable() anyway. >> >> But this use case can also be implemented even if loadTableIfChanged() >> >> throws exceptions, making this approach more flexible. >> >> >> >> About metadata_location as ETag: I don't have a strong opinion here, >> >> not sure what could go wrong if we do this. If we start with this >> >> approach we don't even need a VersionIdentifier for Tables, making the >> >> whole proposal more lightweight. >> >> >> >> Thanks Gabor for driving this and putting together a proposal! >> >> >> >> Cheers, >> >> Zoltan >> >> >> >> On Fri, Nov 22, 2024 at 11:42 AM Gabor Kaszab >> wrote: >> >> > >> >> > Hi Taeyun, >> >> > >> >> > Thanks for the writeup! Let me reflect to some areas: >> >> > >> >> >> the caller manages the version i
Re: [DISCUSS] Hive Support
That sounds really interesting in a bad way :) :( This basically means that we need to support every exact Hive versions which are used by Spark, and we need to exclude our own Hive version from the Spark runtime. On Thu, Dec 19, 2024, 04:00 Manu Zhang wrote: > Hi Peter, > >> I think we should make sure that the Iceberg Hive version is independent >> from the version used by Spark > > I'm afraid that is not how it works currently. When Spark is deployed > with hive libraries (I suppose this is common), iceberg-spark runtime must > be compatible with them. > Otherwise, we need to ask users to exclude hive libraries from Spark and > ship iceberg-spark runtime with Iceberg's hive dependencies.\ > > Regards, > Manu > > On Wed, Dec 18, 2024 at 9:08 PM Péter Váry > wrote: > >> @Manu: What will be the end result? Do we have to use the same Hive >> version in Iceberg as it is defined by Spark? I think we should make sure >> that the Iceberg Hive version is independent from the version used by Spark >> >> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com wrote: >> >>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>> >>> We can at least separate the concerns. We can remove the runtime modules >>> that are the main issue. If we compile against an older version of the Hive >>> metastore module (leaving it unchanged) that at least has a dramatically >>> reduced surface area for Java version issues. As long as the API is >>> compatible (and we haven't heard complaints that it is not) then I think >>> users can override the version in their environments. >>> >>> Ryan >>> >>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang >>> wrote: >>> Hi Daniel, I'll start a vote once I get the PR ready. Hi Ryan, Sorry, I wasn't clear in the last email that the consensus is to upgrade Hive metastore support. Well, I was too optimistic about the upgrade. Spark has only added hive 4.0 metastore support recently for Spark 4.0[1] and there will be conflicts between Spark's hive 2.3.9 and our hive 4.0 dependencies. I'm not sure there's an upgrade path before Spark 4.0. Any ideas? 1. https://issues.apache.org/jira/browse/SPARK-45265 Thanks, Manu On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com wrote: > Oh, I think I see. The upgrade to Hive 4 is just for the Hive > metastore support? When I read the thread, I thought that we weren't going > to change the metastore. That seems reasonable to me. Sorry for > the confusion. > > On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com > wrote: > >> Sorry, I must have missed something. I don't think that we should >> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive >> support entirely? Why would anyone need Hive 4 support from Iceberg when >> it >> is built into Hive 4? >> >> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks >> wrote: >> >>> Hey Manu, >>> >>> I agree with the direction here, but we should probably hold a quick >>> procedural vote just to confirm since this is a significant change in >>> support for Hive. >>> >>> -Dan >>> >>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang >>> wrote: >>> Thanks all for sharing your thoughts. It looks there's a consensus on upgrading to Hive 4 and dropping hive-runtime. I've submitted a PR[1] as the first step. Please help review. 1. https://github.com/apache/iceberg/pull/11750 Thanks, Manu On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya wrote: > Hi all, > > I also prefer option 1. I have some initiatives[1] to improve > integrations between Hive and Iceberg. The current style allows us > to > develop both Hive's core and HiveIcebergStorageHandler > simultaneously. > That would help us enhance integrations. > > - [1] https://issues.apache.org/jira/browse/HIVE-28410 > > Regards, > Okumin > > On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong > wrote: > > > > Hey Cheng, > > > > Thanks for the suggestion. The nightly snapshots are available: > https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/, > which might help when working on features that are not released yet > (eg > Nanosecond timestamps). Besides that, we should run RCs against Hive > to > check if everything works as expected. > > > > I'm leaning toward removing Hive 2 and 3 as well. > > > > Kind regards, > > Fokko > > > > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com < > rdb...@gmail.com>: > >> > >> I think that we sho
Re: There is no easy way to secure Iceberg data. How can we improve?
actually, there is a way for the catalog to return S3 objects without granting access to the entire bucket: aws presigning: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html This offers time-bounded access to an object catalog will need to generate and return the presigned URLs and then the applications will use these URLs to load the files. All other access to the bucket (list etc) would have to be locked down I have used the s3a fs to download artifacts with signatures, but never generated the signatures myself. It does not have the capability to write objects to a presigned url, and I don't see that in S3FileIO either. signature creation will need to be homework for the catalog. On Thu, 2 Jan 2025 at 17:35, Jean-Baptiste Onofré wrote: > Hi Vladimir, > > Thanks for starting this discussion. > > I agree with you that the REST catalog "should" be the centralized > security mechanism (Polaris is a good example). However, we have two > challenges today: > - there's no enforcement to use the REST catalog. Some engines are > still directly accessing the metadata.json without going through a > catalog. Without "enforcing" catalog use (and especially REST > catalog), it's not really possible to have a centralized security > mechanism across engines. > - the "entity" permission model (table, view, namespace) is REST > catalog impl side (server side). > > I think we are mixing two security layers here: the REST and entity > security (RBAC, etc) and the storage (credential vending). > > Thinking aloud, I would consider the storage as "internal security" > and REST catalog as "user facing security". Why not consider > "enforcing" REST Catalog in the Iceberg ecosystem ? It would > "standardize" the "user facing security" (and the implementation can > implement credentials vending for the storage). > > Just my $0.01 :) > > Regards > JB > > On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov > wrote: > > > > Hi, > > > > Apache Iceberg can address multiple analytical scenarios, including ETL, > streaming, ad-hoc queries, etc. One important obstacle in Iceberg > integration nowadays is secure access to Iceberg tables across multiple > tools and engines. There are several typical approaches to lakehouse > security: > > > > Controlled environment. E.g., Google BigQuery has special > readers/writers for open formats, tightly integrated with managed engines. > Doesn't work outside of a specific cloud vendor. > > Securing storage. E.g., various S3 access policies. Works for individual > files/buckets but can hardly address important access restrictions, such as > column access permissions, masking, and filtering. Tightly integrated > solutions, such as AWS S3 Tables, can potentially solve these, but this > implies a cloud vendor lock-in. > > Catalog-level permissions. For example, a Tabular/Polaris role model, > possibly with vended credentials or remote request signature. Works for > coarse-grained access permissions but fails to deliver proper access > control for individual columns, as well as masking and filtering. > > Centralized security service. E.g., Apache Ranger, OPA. It could provide > whatever security permissions, but each engine must provide its own > integration with the service. Also, some admins of such services usually > have to duplicate access permissions between different engines. For > example, the column masking policy for Trino in Apache Ranger will not work > for Apache Spark. > > Securing data with virtual views. Works for individual engines, but not > across engines. There is an ongoing discussion about common IR with > Substrait, but given the complexity of engine dialects, we can hardly > expect truly reusable views any time soon. Moreover, similarly to Apache > Ranger, this shifts security decisions towards the engine, which is not > good. > > > > To the best of my knowledge, the above-mentioned strategies are some of > the "state-of-the-art" techniques for secure lakehouse access. I would > argue that none of these strategies are open, secure, interoperable, and > convenient for end users simultaneously. Compare it with security > management in monolithic systems, such as Vertica: execute a couple of SQL > statements, done. > > > > Having a solid vision of a secure lakehouse could be a major advantage > for Apache Iceberg. I would like to kindly ask the community about your > thoughts on what are the current major pain points with your Iceberg-based > deployments security and what could be done at the Iceber level to further > improve it. > > > > My 5 cents. REST catalog is a very good candidate for a centralized > security mechanism for the whole lakehouse, irrespective of the engine that > accesses data. However, the security capabilities of the current REST > protocol are limited. We can secure individual catalogs, namespaces, and > tables. But we cannot: > > > > Define individual column permission > > Apply column making > > Apply row-level filtering > > >
Re: There is no easy way to secure Iceberg data. How can we improve?
Hi Vladimir and JB, There have been some previous discussions on security [1]. > We can think about splitting table data into multiple files for > column-level security and masking. For example, instead of storing columns > [a, b, c] in the same Parquet file, we split them into three files: [a, b], > [c], [c_masked]. Then, individual policies could be applied to these files > at the catalog or storage layer. IMO, I think this would add too much complexity to the specification. Parquet, in theory, has metadata available to split columns across files but the Parquet community has chosen not to actually implement this in any of its readers (mostly due to complexity and compatibility reasons). For row-level filtering, we can think of a table redirection. That is, > a user asks for table "A", and we return the table metadata for > "A_filtered" with different data. It is not an ideal solution at all: it is > not flexible enough, requires data duplication, requires extensive support > at the engine level, etc. But might be better than nothing. Based on the prior discussions there are potentially other models to consider: 1. A shared responsibility model, where compute engines can be registered as "trusted" to implement the access controls registered in the REST API. This was already touched on above as not necessarily being desirable. 2. For non-trusted engines provide a table data service that acts as a secure proxy to the data to enforce access controls (e.g. an Arrow Flight or Flight SQL service [2][3] service or an extension beyond this [4]). The scan planning APIs in the REST service are already a step in this direction. I think between these two it should provide an incremental path for handling secure tables. A large number of use-cases can be supported by ensuring trusted Spark/Trino clusters are available. Other engines can either add the necessary support on their own timeline or if data access for those is a requirement, data administrators can set up the proxy service. > Why not consider > "enforcing" REST Catalog in the Iceberg ecosystem ? I think for security purposes this makes sense. As a general requirement, having the flexibility of different catalogs depending on implementation needs still makes sense to me. [1] https://lists.apache.org/thread/4swop72zgcr8rrmwvb51rlk0vnb8joyz [2] https://arrow.apache.org/docs/format/Flight.html [3] https://arrow.apache.org/docs/format/FlightSql.html [4] https://lists.apache.org/thread/g4jkyh4o8rqk16cl3mo3wb2h00y92z9j On Thu, Jan 2, 2025 at 9:36 AM Jean-Baptiste Onofré wrote: > Hi Vladimir, > > Thanks for starting this discussion. > > I agree with you that the REST catalog "should" be the centralized > security mechanism (Polaris is a good example). However, we have two > challenges today: > - there's no enforcement to use the REST catalog. Some engines are > still directly accessing the metadata.json without going through a > catalog. Without "enforcing" catalog use (and especially REST > catalog), it's not really possible to have a centralized security > mechanism across engines. > - the "entity" permission model (table, view, namespace) is REST > catalog impl side (server side). > > I think we are mixing two security layers here: the REST and entity > security (RBAC, etc) and the storage (credential vending). > > Thinking aloud, I would consider the storage as "internal security" > and REST catalog as "user facing security". Why not consider > "enforcing" REST Catalog in the Iceberg ecosystem ? It would > "standardize" the "user facing security" (and the implementation can > implement credentials vending for the storage). > > Just my $0.01 :) > > Regards > JB > > On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov > wrote: > > > > Hi, > > > > Apache Iceberg can address multiple analytical scenarios, including ETL, > streaming, ad-hoc queries, etc. One important obstacle in Iceberg > integration nowadays is secure access to Iceberg tables across multiple > tools and engines. There are several typical approaches to lakehouse > security: > > > > Controlled environment. E.g., Google BigQuery has special > readers/writers for open formats, tightly integrated with managed engines. > Doesn't work outside of a specific cloud vendor. > > Securing storage. E.g., various S3 access policies. Works for individual > files/buckets but can hardly address important access restrictions, such as > column access permissions, masking, and filtering. Tightly integrated > solutions, such as AWS S3 Tables, can potentially solve these, but this > implies a cloud vendor lock-in. > > Catalog-level permissions. For example, a Tabular/Polaris role model, > possibly with vended credentials or remote request signature. Works for > coarse-grained access permissions but fails to deliver proper access > control for individual columns, as well as masking and filtering. > > Centralized security service. E.g., Apache Ranger, OPA. It could provide > whatever security permissi