Re: [DISCUSS] REST Catalog bulk object lookup

2025-01-03 Thread Vladimir Ozerov
A motivational example: Trino has to implement a parallel table metadata
fetching recently (https://github.com/trinodb/trino/pull/23909) because
otherwise metadata queries (e.g., INFORMATION_SCHEMA) was slow. Parallel
metadata retrieval boosted metadata query performance significantly. But
this solution is far from ideal:

   1. Now catalogs will experience request bursts whenever a user or a tool
   attempts to list Iceberg objects in Trino. This may potentially induce
   unpredictable latency spikes, especially for large schemas
   2. Each such request imposes a constant catalog overhead on
   request dispatching, serde, security checks, etc. which could be easily
   avoided with bulk metadata lookup
   3. The aforementioned fix addresses only parallel table retrieval. But
   then the engine will have to support the same thing for views and
   materialized views, producing even more requests bursts, with considerable
   number of requests returning error responses because we cannot get object
   type and its metadata in one shot.


On Tue, Dec 24, 2024 at 10:29 PM Vladimir Ozerov 
wrote:

> Hi,
>
> Following the discussion [1] I'd like to formally propose an extension to
> REST catalog API that allows efficient lookup of multiple catalog objects
> without knowing their types in advance.
>
> When a query is submitted, the engine needs to resolve referenced objects.
> The current REST API requires multiple catalog calls per query, because it
> (1) assumes the prior knowledge of the object type (not the case for
> virtually all query engines), and (2) lacks bulk object lookup operation.
> This leads to increased query latency and increased REST catalog load.
>
> The proposal aims to solve the problem introducing an optional endpoint
> that returns information about several catalogs objects, including their
> type (table, view) and metadata.
>
> Note that the proposal attempts to solve two distinct issues via a single
> endpoint:
>
>1. Inability to lookup the object without knowing its type
>2. Inability to lookup multiple objects in a single request
>
> If the community finds the proposal too complicated, we can minimize the
> scope to the point 1, and introduce an endpoint for object lookup without
> knowing it's type. Even without bulk lookup this can help engine developers
> minimize SQL query planning latency.
>
> Proposal:
> https://docs.google.com/document/d/1KfzdQT8Q2xiV_yPNvICROCepz-Qqpm0npob7hmb40Fc/edit?usp=sharing
>
> [1] https://lists.apache.org/thread/g44czzpjqqhdvronqfyckw4mnxvlpn3s
>
> Regards,
> --
> *Vladimir Ozerov*
>
>

-- 
*Vladimir Ozerov*
Founder
querifylabs.com


Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

2025-01-03 Thread Yufei Gu
The proposal looks great to me. Thanks Gavor for working on it. Have we
created a spec change PR yet?

Yufei


On Thu, Dec 19, 2024 at 2:11 AM Gabor Kaszab  wrote:

> Hi All,
>
> Just an update that the proposal went through some iterations based on the
> comments from Daniel Weeks. Thanks for taking a look, Daniel!
>
> In a nutshell this is what changed compared to the original proposal:
> - The Catalog API will be intact, there is no proposed new API function
> now. With this the freshness aware functionality and the ETags in
> particular will not be exposed to the clients of the API.
> - Instead of storing the ETags in TableMetadata we propose to store it in
> RESTTableOperations since the proposal only focuses on the REST catalog.
> The very same changes can be done on other TableOperations implementations
> if there is going to be a need to have this for other catalogs too.
> - A SoftReference cache of (TableIdentifier -> Table object) is introduced
> on the RESTSessionCatalog level. This can be used for providing previous
> ETags to the HTTPClient and also to answer Catalog API calls with the
> latest table metadata if the REST server returns a '304 Not Modified'.
>
> The doc is updated with the above now:
>
> https://docs.google.com/document/d/1rnVSP_iv2I47giwfAe-Z3DYhKkKwWCVvCkC9rEvtaLA
>
> While I keep the discussion still open, I think I'll move on to take care
> of the changes required for the REST spec. Will send a PR for this soon.
>
> Regards,
> Gabor
>
>
> On Thu, Dec 12, 2024 at 4:07 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Gabor
>>
>> Thanks for the update ! I will take a look.
>>
>> Regards
>> JB
>>
>> On Thu, Dec 12, 2024 at 2:52 PM Gabor Kaszab 
>> wrote:
>> >
>> > Hi Iceberg Community,
>> >
>> > It took me a while but I finally managed to upload the proposal for
>> this as an official 'Iceberg improvement proposal'. Thanks for the feedback
>> so far!
>> >
>> > https://github.com/apache/iceberg/issues/11766
>> >
>> > Regards,
>> > Gabor
>> >
>> >
>> > On Fri, Nov 22, 2024 at 4:51 PM Taeyun Kim 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Since ETags are opaque values to the client, attributing any semantic
>> meaning to them in the interaction between the client and server would, in
>> my opinion, constitute a misuse/abuse of the HTTP specification.
>> >> On the other hand, the server can generate the ETag value as any
>> string, as long as it conforms to the grammar defined in
>> https://httpwg.org/specs/rfc9110.html#field.etag . Using the metadata
>> location is likely the simplest option. For reference, based on the
>> grammar, ETag values cannot include spaces. Therefore, if the metadata
>> location contains spaces, it may need to be encoded. The same goes for
>> double quotation marks. (I just found this out after looking it up.)
>> >> Anyway, in my opinion, the client must ignore any semantic meaning
>> associated with the value.
>> >>
>> >> Thank you.
>> >>
>> >> -Original Message-
>> >> From:  "Zoltán Borók-Nagy" 
>> >> To:  ;
>> >> Cc:
>> >> Sent:  2024-11-22 (금) 19:57:08 (UTC+09:00)
>> >> Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the
>> latest
>> >>
>> >> Hi,
>> >>
>> >> Separate version information forces the clients to manage a Table ->
>> >> VersionIdentifier mapping which adds unnecessary complexity and can be
>> >> error-prone.
>> >>
>> >> If the VersionIdentifier is embedded in the Table object then the
>> >> application logic is much simpler, and the Catalog interface is not
>> >> only simpler, but also hard to use incorrectly.
>> >> Though this approach slightly increases the size of the Table objects.
>> >> And touching the Table interface might encounter some resistance, even
>> >> if it is only an extension.
>> >>
>> >> Yeah, VersionIdentifier doesn't need to be a String, it could be an
>> >> Object, or an empty interface, and the Catalog implementation could
>> >> cast it to some catalog-specific VersionIdentifierImpl.
>> >>
>> >> loadTableIfChanged() throwing UnsupportedOperationException is
>> >> reasonable, as clients can easily fallback to loadTable. In my mind I
>> >> had a use case where we cache tables without any refresh checks for a
>> >> configured TTL, and after expiration we invoke reloadTable() anyway.
>> >> But this use case can also be implemented even if loadTableIfChanged()
>> >> throws exceptions, making this approach more flexible.
>> >>
>> >> About metadata_location as ETag: I don't have a strong opinion here,
>> >> not sure what could go wrong if we do this. If we start with this
>> >> approach we don't even need a VersionIdentifier for Tables, making the
>> >> whole proposal more lightweight.
>> >>
>> >> Thanks Gabor for driving this and putting together a proposal!
>> >>
>> >> Cheers,
>> >> Zoltan
>> >>
>> >> On Fri, Nov 22, 2024 at 11:42 AM Gabor Kaszab 
>> wrote:
>> >> >
>> >> > Hi Taeyun,
>> >> >
>> >> > Thanks for the writeup! Let me reflect to some areas:
>> >> >
>> >> >> the caller manages the version i

Re: [DISCUSS] Hive Support

2025-01-03 Thread Péter Váry
That sounds really interesting in a bad way :) :(

This basically means that we need to support every exact Hive versions
which are used by Spark, and we need to exclude our own Hive version from
the Spark runtime.

On Thu, Dec 19, 2024, 04:00 Manu Zhang  wrote:

> Hi Peter,
>
>> I think we should make sure that the Iceberg Hive version is independent
>> from the version used by Spark
>
>  I'm afraid that is not how it works currently. When Spark is deployed
> with hive libraries (I suppose this is common), iceberg-spark runtime must
> be compatible with them.
> Otherwise, we need to ask users to exclude hive libraries from Spark and
> ship iceberg-spark runtime with Iceberg's hive dependencies.\
>
> Regards,
> Manu
>
> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry 
> wrote:
>
>> @Manu: What will be the end result? Do we have to use the same Hive
>> version in Iceberg as it is defined by Spark? I think we should make sure
>> that the Iceberg Hive version is independent from the version used by Spark
>>
>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com  wrote:
>>
>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>>>
>>> We can at least separate the concerns. We can remove the runtime modules
>>> that are the main issue. If we compile against an older version of the Hive
>>> metastore module (leaving it unchanged) that at least has a dramatically
>>> reduced surface area for Java version issues. As long as the API is
>>> compatible (and we haven't heard complaints that it is not) then I think
>>> users can override the version in their environments.
>>>
>>> Ryan
>>>
>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang 
>>> wrote:
>>>
 Hi Daniel,
 I'll start a vote once I get the PR ready.

 Hi Ryan,
 Sorry, I wasn't clear in the last email that the consensus is to
 upgrade Hive metastore support.

 Well, I was too optimistic about the upgrade. Spark has only added hive
 4.0 metastore support recently for Spark 4.0[1] and there will be conflicts
 between Spark's hive 2.3.9 and our hive 4.0 dependencies.
 I'm not sure there's an upgrade path before Spark 4.0. Any ideas?

 1. https://issues.apache.org/jira/browse/SPARK-45265

 Thanks,
 Manu


 On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com 
 wrote:

> Oh, I think I see. The upgrade to Hive 4 is just for the Hive
> metastore support? When I read the thread, I thought that we weren't going
> to change the metastore. That seems reasonable to me. Sorry for
> the confusion.
>
> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com 
> wrote:
>
>> Sorry, I must have missed something. I don't think that we should
>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive
>> support entirely? Why would anyone need Hive 4 support from Iceberg when 
>> it
>> is built into Hive 4?
>>
>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks 
>> wrote:
>>
>>> Hey Manu,
>>>
>>> I agree with the direction here, but we should probably hold a quick
>>> procedural vote just to confirm since this is a significant change in
>>> support for Hive.
>>>
>>> -Dan
>>>
>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang 
>>> wrote:
>>>
 Thanks all for sharing your thoughts. It looks there's a consensus
 on upgrading to Hive 4 and dropping hive-runtime.
 I've submitted a PR[1] as the first step. Please help review.

 1. https://github.com/apache/iceberg/pull/11750

 Thanks,
 Manu

 On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya 
 wrote:

> Hi all,
>
> I also prefer option 1. I have some initiatives[1] to improve
> integrations between Hive and Iceberg. The current style allows us
> to
> develop both Hive's core and HiveIcebergStorageHandler
> simultaneously.
> That would help us enhance integrations.
>
> - [1] https://issues.apache.org/jira/browse/HIVE-28410
>
> Regards,
> Okumin
>
> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong 
> wrote:
> >
> > Hey Cheng,
> >
> > Thanks for the suggestion. The nightly snapshots are available:
> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
> which might help when working on features that are not released yet 
> (eg
> Nanosecond timestamps). Besides that, we should run RCs against Hive 
> to
> check if everything works as expected.
> >
> > I'm leaning toward removing Hive 2 and 3 as well.
> >
> > Kind regards,
> > Fokko
> >
> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com <
> rdb...@gmail.com>:
> >>
> >> I think that we sho

Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-03 Thread Steve Loughran
actually, there is a way for the catalog to return S3 objects without
granting access to the entire bucket: aws presigning:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html
This offers time-bounded access to an object

catalog will need to generate and return the presigned URLs and then the
applications will use these URLs to load the files.

All other access to the bucket (list etc) would have to be locked down

I have used the s3a fs to download artifacts with signatures, but never
generated the signatures myself. It does not have the capability to write
objects to a presigned url, and I don't see that in S3FileIO either.

signature creation will need to be homework for the catalog.



On Thu, 2 Jan 2025 at 17:35, Jean-Baptiste Onofré  wrote:

> Hi Vladimir,
>
> Thanks for starting this discussion.
>
> I agree with you that the REST catalog "should" be the centralized
> security mechanism (Polaris is a good example). However, we have two
> challenges today:
> - there's no enforcement to use the REST catalog. Some engines are
> still directly accessing the metadata.json without going through a
> catalog. Without "enforcing" catalog use (and especially REST
> catalog), it's not really possible to have a centralized security
> mechanism across engines.
> - the "entity" permission model (table, view, namespace) is REST
> catalog impl side (server side).
>
> I think we are mixing two security layers here: the REST and entity
> security (RBAC, etc) and the storage (credential vending).
>
> Thinking aloud, I would consider the storage as "internal security"
> and REST catalog as "user facing security". Why not consider
> "enforcing" REST Catalog in the Iceberg ecosystem ? It would
> "standardize" the "user facing security" (and the implementation can
> implement credentials vending for the storage).
>
> Just my $0.01 :)
>
> Regards
> JB
>
> On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov 
> wrote:
> >
> > Hi,
> >
> > Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
> >
> > Controlled environment. E.g., Google BigQuery has special
> readers/writers for open formats, tightly integrated with managed engines.
> Doesn't work outside of a specific cloud vendor.
> > Securing storage. E.g., various S3 access policies. Works for individual
> files/buckets but can hardly address important access restrictions, such as
> column access permissions, masking, and filtering. Tightly integrated
> solutions, such as AWS S3 Tables, can potentially solve these, but this
> implies a cloud vendor lock-in.
> > Catalog-level permissions. For example, a Tabular/Polaris role model,
> possibly with vended credentials or remote request signature. Works for
> coarse-grained access permissions but fails to deliver proper access
> control for individual columns, as well as masking and filtering.
> > Centralized security service. E.g., Apache Ranger, OPA. It could provide
> whatever security permissions, but each engine must provide its own
> integration with the service. Also, some admins of such services usually
> have to duplicate access permissions between different engines. For
> example, the column masking policy for Trino in Apache Ranger will not work
> for Apache Spark.
> > Securing data with virtual views. Works for individual engines, but not
> across engines. There is an ongoing discussion about common IR with
> Substrait, but given the complexity of engine dialects, we can hardly
> expect truly reusable views any time soon. Moreover, similarly to Apache
> Ranger, this shifts security decisions towards the engine, which is not
> good.
> >
> > To the best of my knowledge, the above-mentioned strategies are some of
> the "state-of-the-art"  techniques for secure lakehouse access. I would
> argue that none of these strategies are open, secure, interoperable, and
> convenient for end users simultaneously. Compare it with security
> management in monolithic systems, such as Vertica: execute a couple of SQL
> statements, done.
> >
> > Having a solid vision of a secure lakehouse could be a major advantage
> for Apache Iceberg. I would like to kindly ask the community about your
> thoughts on what are the current major pain points with your Iceberg-based
> deployments security and what could be done at the Iceber level to further
> improve it.
> >
> > My 5 cents. REST catalog is a very good candidate for a centralized
> security mechanism for the whole lakehouse, irrespective of the engine that
> accesses data. However, the security capabilities of the current REST
> protocol are limited. We can secure individual catalogs, namespaces, and
> tables. But we cannot:
> >
> > Define individual column permission
> > Apply column making
> > Apply row-level filtering
> >
>

Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-03 Thread Micah Kornfield
Hi Vladimir and JB,

There have been some previous discussions on security [1].


> We can think about splitting table data into multiple files for
> column-level security and masking. For example, instead of storing columns
> [a, b, c] in the same Parquet file, we split them into three files: [a, b],
> [c], [c_masked]. Then, individual policies could be applied to these files
> at the catalog or storage layer.


IMO, I think this would add too much complexity to the specification.
Parquet, in theory, has metadata available to split columns across files
but the Parquet community has chosen not to actually implement this in any
of its readers (mostly due to complexity and compatibility reasons).

For row-level filtering, we can think of a table redirection. That is,
> a user asks for table "A", and we return the table metadata for
> "A_filtered" with different data. It is not an ideal solution at all: it is
> not flexible enough, requires data duplication, requires extensive support
> at the engine level, etc. But might be better than nothing.


Based on the prior discussions there are potentially other models to
consider:
1.  A shared responsibility model, where compute engines can be registered
as "trusted" to implement  the access controls registered in the REST API.
This was already touched on above as not necessarily being desirable.
2.  For non-trusted engines provide a table data service that acts as a
secure proxy to the data to enforce access controls (e.g. an Arrow Flight
or Flight SQL service [2][3] service or an extension beyond this [4]).  The
scan planning APIs in the REST service are already a step in this
direction.

I think between these two it should provide an incremental path for
handling secure tables. A large number of use-cases can be supported by
ensuring trusted Spark/Trino clusters are available.  Other engines can
either add the necessary support on their own timeline or if data access
for those is a requirement, data administrators can set up the proxy
service.


>  Why not consider
> "enforcing" REST Catalog in the Iceberg ecosystem ?


I think for security purposes this makes sense.  As a general requirement,
having the flexibility of different catalogs depending on implementation
needs still makes sense to me.


[1] https://lists.apache.org/thread/4swop72zgcr8rrmwvb51rlk0vnb8joyz
[2] https://arrow.apache.org/docs/format/Flight.html
[3] https://arrow.apache.org/docs/format/FlightSql.html
[4] https://lists.apache.org/thread/g4jkyh4o8rqk16cl3mo3wb2h00y92z9j

On Thu, Jan 2, 2025 at 9:36 AM Jean-Baptiste Onofré  wrote:

> Hi Vladimir,
>
> Thanks for starting this discussion.
>
> I agree with you that the REST catalog "should" be the centralized
> security mechanism (Polaris is a good example). However, we have two
> challenges today:
> - there's no enforcement to use the REST catalog. Some engines are
> still directly accessing the metadata.json without going through a
> catalog. Without "enforcing" catalog use (and especially REST
> catalog), it's not really possible to have a centralized security
> mechanism across engines.
> - the "entity" permission model (table, view, namespace) is REST
> catalog impl side (server side).
>
> I think we are mixing two security layers here: the REST and entity
> security (RBAC, etc) and the storage (credential vending).
>
> Thinking aloud, I would consider the storage as "internal security"
> and REST catalog as "user facing security". Why not consider
> "enforcing" REST Catalog in the Iceberg ecosystem ? It would
> "standardize" the "user facing security" (and the implementation can
> implement credentials vending for the storage).
>
> Just my $0.01 :)
>
> Regards
> JB
>
> On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov 
> wrote:
> >
> > Hi,
> >
> > Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
> >
> > Controlled environment. E.g., Google BigQuery has special
> readers/writers for open formats, tightly integrated with managed engines.
> Doesn't work outside of a specific cloud vendor.
> > Securing storage. E.g., various S3 access policies. Works for individual
> files/buckets but can hardly address important access restrictions, such as
> column access permissions, masking, and filtering. Tightly integrated
> solutions, such as AWS S3 Tables, can potentially solve these, but this
> implies a cloud vendor lock-in.
> > Catalog-level permissions. For example, a Tabular/Polaris role model,
> possibly with vended credentials or remote request signature. Works for
> coarse-grained access permissions but fails to deliver proper access
> control for individual columns, as well as masking and filtering.
> > Centralized security service. E.g., Apache Ranger, OPA. It could provide
> whatever security permissi