prefix in S3URI for Google Cloud S3 Storage

Mayur Srivastava Thu, 02 Dec 2021 13:15:53 -0800

There are three reasons why we want to use S3FileIO over HadoopFileIO:

1.      We want access to the S3Client in our service so support some special 
handling of the auth. This is not possible with the HadoopFileIO because the 
S3Client is not exposed.


2.      We would like to improve upon the S3FileIO in the future, by 
introducing a vectorized IO mechanism and it makes is easier if we are already 
using S3FileIO. I’ll post my thoughts about the vectorized IO in a later email 
in upcoming weeks.

3.      As Ryan mentioned earlier, we are seeing very high memory usage with 
the HadoopFileIO in case of high concurrent commits. I reported that in another 
thread.

To moving forward:

Can we start by adding ‘gs’ to the S3URI’s valid prefixes?

One of Jack’s suggestion was to remove any scheme check from the S3URI. Given 
we are building ResolvingFileIO, I think removing scheme check in the 
individual implementation is not a bad idea.

Either solution will work for us.

Thanks,
Mayur

From: Ryan Blue <b...@tabular.io>
Sent: Thursday, December 2, 2021 11:37 AM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't hit 
the memory consumption problem that Mayur posted to the list. That's a fairly 
big advantage so I think it's reasonable to try to support this in 0.13.0.

It should be easy enough to add the gs scheme and then we can figure out how we 
want to handle ResolvingFileIO. Jack's plan seems reasonable to me, so I guess 
we'll be adding scheme to implementation customization sooner than I thought!

Ryan

On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
<pi...@starburstdata.com<mailto:pi...@starburstdata.com>> wrote:
Hi

I agree that endpoint, credentials, path style access etc. should be 
configurable.
There are storages which are primarily used as "s3 compatible" and they need 
these settings to make them work.
We've seen these being used to access MinIO, Ceph and even S3 with some gateway 
(i am light on details, sorry).
In all these cases, users seem to use s3:// urls even if not talking to actual 
AWS S3 service.

If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO, just 
by accepting gs:// protocol and delegating to S3FileIO for now.
In the long term, i would recommend using native GCS client though, or hadoop 
file system implementation provided by google.

BTW, Mayur what is the advantage of using S3FileIO for google storage vs 
HadoopFileIO?

BR
PF




On Thu, Dec 2, 2021 at 1:30 AM Jack Ye 
<yezhao...@gmail.com<mailto:yezhao...@gmail.com>> wrote:
And here is a proposal of what I think could be the best way to go for both 
worlds:
(1) remove URI restrictions in S3FileIO (or allow configuration of additional 
accepted schemes), and allow direct user configuration of endpoint, 
credentials, etc. to make S3 configuration simpler without the need to 
reconfigure the entire client.
(2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, others -> 
HadoopFileIO
(3) for s3 and gs, ResolvingFileIO needs to develop the ability to initialize 
S3FileIO differently, and users should be able to configure them differently in 
catalog properties
(4) for users that need special GCS unique features, a GCSFileIO could 
eventually be developed, and then people can choose to map gs -> GCSFileIO in 
ResolvingFileIO

-Jack


On Wed, Dec 1, 2021 at 4:14 PM Jack Ye 
<yezhao...@gmail.com<mailto:yezhao...@gmail.com>> wrote:
Thanks for the confirmation, this is as I expected. We had a similar case for 
Dell EMC ECS recently, where they published a version of their FileIO that 
works through S3FileIO (https://github.com/apache/iceberg/pull/2807) and the 
only thing needed was to override the endpoint, region and credentials. They 
also proposed some specialization because their object storage service is 
specialized with the Append operation when writing data. However, in the end 
they ended up just creating another FileIO 
(https://github.com/apache/iceberg/pull/3376) using their own SDK to better 
support the specialization.

I believe the recent addition of ResolvingFileIO was to support using multiple 
FileIOs and switch between them based on the file scheme. If we continue that 
path, it feels more reasonable to me that we will have specialized FileIOs for 
each implementation and allow them to evolve independently. Users will be able 
to set whatever specialized configurations for each implementation and take 
advantage of all of them.

On the other hand, if we can support using S3FileIO as the new standard FileIO 
that works with multiple storage providers, the advantages I see are:
(1) simple from the user's perspective because the least common denominator of 
all storages needed by many cloud storage service providers is S3. It's more 
work to configure and maintain multiple FileIOs.
(2) we can avoid the current check in ResolvingFileIO of the file scheme for 
each file path string, which might lead to some performance gain, although I do 
not know how much we gain in this process

From a technical perspective I prefer having dedicated FileIOs and an overall 
ResolvingFileIO, because the Iceberg's FileIO interface is simple enough for 
people to build specialized and proper support for different storage systems. 
But it's also very tempting to just reuse the same thing instead of building 
another one, especially when that feature is lacking and the current 
functionality could be easily extended to support the feature. The concern is 
that we will end up like Hadoop that had to develop another sub-layer of 
FileSystem interface to accommodate different unique features of different 
storage providers when the specialized feature request comes, and at that time 
there is no difference from the dedicated FileIO + ResolvingFileIO architecture.

I wonder what Daniel thinks about this since I believe he is more interested in 
multi-cloud support.

-Jack

On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Hi Jack, Daniel,

We use several S3-compatible backends with Iceberg, these include S3, GCS, and 
others. Currently, S3FileIO provides us all the functionality we need Iceberg 
to talk to these backends. The way we create S3FileIO is via the constructor 
and provide the S3Client as the constructor param; we do not use the 
initialize(Map<String,String>) method in FileIO. Our custom catalog accepts the 
FileIO object at creation time. To talk to GCS, we create the S3Client with a 
few overrides (described below) and pass it to S3FileIO. After that, the rest 
of the S3FileIO code works as is. The only exception is that “gs” (used by GCS 
URIs) needs to be accepted as a valid S3 prefix. This is the reason I sent the 
email.

The reason why we want to use S3FileIO to talk to GCS is that S3FileIO almost 
works out of the box and contains all the functionality needed to talk to GCS. 
The only special requirement is the creation of the S3Client and allow “gs” 
prefix in the URIs. Based on our early experiments and benchmarks, S3FileIO 
provides all the functionality we need and performs well, so we didn’t see a 
need to create a native GCS FileIO. Iceberg operations that we need are create, 
drop, read and write objects from S3 and S3FileIO provides this functionality.

We are managing ACLs (IAM in case of GCS) at the bucket level and that happens 
in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve not 
experimented with ACLs or encryption with S3FileIO and that is a good question 
whether it works with GCS. But, if these features are not enabled via default 
settings, S3FileIO works just fine with GCS.

I think there is a case for supporting S3-compatible backends in S3FileIO 
because a lot of the code is common. The question is whether we can cleanly 
expose the common S3FileIO code to work with these backends and separate out 
any specialization (if required) OR we want to have a different FileIO 
implementation for each of the other S3 compatible backends such as GCS? I’m 
eager to hear more from the community about this. I’m happy to discuss and 
follow long-term design direction of the Iceberg community.

The S3Client for GCS is created as follows (currently the code is not open 
source so I’m sharing the steps only):
1. Create S3ClientBuilder.
2. Set GCS endpoint URI and region.
3. Set a credentials provider that returns null. You can set credentials here 
if you have static credentials.
4. Set ClientOverrideConfiguration with interceptors in the 
overrideConfiguration(). The interceptors are used to setup authorization 
header in requests (setting projectId, auth tokens, etc.) and do header 
translation for requests and responses.
5. Build the S3Client.
6. Pass the S3Client to S3FileIO.

Thanks,
Mayur

From: Jack Ye <yezhao...@gmail.com<mailto:yezhao...@gmail.com>>
Sent: Wednesday, December 1, 2021 1:16 PM
To: Iceberg Dev List <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Hi Mayur,

I know many object storage services have allowed communication using the Amazon 
S3 client by implementing the same protocol, like recently the Dell EMC ECS and 
Aliyun OSS. But ultimately there are functionality differences that could be 
optimized with a native FileIO, and the 2 examples I listed before both 
contributed their own FileIO implementations to Iceberg recently. I would 
imagine some native S3 features like ACL or SSE to not work for GCS, and some 
GCS features to be not supported in S3FileIO, so I think a specific GCS FileIO 
would likely be better for GCS support in the long term.

Could you describe how you configure S3FileIO to talk to GCS? Do you need to 
override the S3 endpoint or have any other configurations?

And I am not an expert of GCS, do you see using S3FileIO for GCS as a feasible 
long-term solution? Are there any GCS specific features that you might need and 
could not be done through S3FileIO, and how widely used are those features?

Best,
Jack Ye



On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks 
<daniel.c.we...@gmail.com<mailto:daniel.c.we...@gmail.com>> wrote:
The S3FileIO does use the AWS S3 V2 Client libraries and while there appears to 
be some level of compatibility, it's not clear to me how far that currently 
extends (some AWS features like encryption, IAM, etc. may not have full 
support).

I think it's great that there may be a path for more native GCS FileIO support, 
but it might be a little early to rename the classes and except that everything 
will work cleanly.

Thanks for pointing this out, Mayur.  It's really an interesting development.

-Dan

On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen 
<pi...@starburstdata.com<mailto:pi...@starburstdata.com>> wrote:
if S3FileIO is supposed to be used with other file systems, we should consider 
proper class renames.
just my 2c

On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,

We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible with 
the AWS S3 SDKs and if they are added to the list of supported prefixes, they 
work with S3FileIO.

Thanks,
Mayur

From: Piotr Findeisen <pi...@starburstdata.com<mailto:pi...@starburstdata.com>>
Sent: Wednesday, December 1, 2021 10:58 AM
To: Iceberg Dev List <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Hi

Just curious. S3URI seems aws s3-specific. What would be the goal of using 
S3URI with google cloud storage urls?
what problem are we solving?

PF


On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
<russell.spit...@gmail.com<mailto:russell.spit...@gmail.com>> wrote:
Sounds reasonable to me if they are compatible

On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,

We have URIs starting with gs:// representing objects on GCS. Currently, S3URI 
doesn’t support gs:// prefix (see 
https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
 Is there an existing JIRA for supporting this? Any objections to add “gs” to 
the list of S3 prefixes?

Thanks,
Mayur



--
Ryan Blue
Tabular

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Reply via email to