Hey Mayur and Laurent, As an alternative to using S3FileIO to talk to GCS, I just posted a native GCSFileIO implementation <https://github.com/apache/iceberg/pull/3711> and would really appreciate feedback. I'd prefer to go this route which has a number of advantages (like using gRPC eventually) and more native support of some of the GCS features (like streaming transport).
It would be great if someone has a change to try this out in a real google cloud environment and help improve it. -Dan On Fri, Dec 3, 2021 at 7:48 AM Laurent Goujon <laur...@dremio.com> wrote: > To be clear, the reasons why using S3FileIO over HadoopFileIO are totally > reasonable. My issue is making gs:// is an alias to s3://, which I don't > believe it is. Even assuming that GCS has an endpoint so one can use a S3 > API to access data, you would need to configure this endpoint, and you > would need to create S3 accesskey/accessecret (which is not the regular > mode of operations for GCS) in order to access the data. So personally if I > was interested to access GCS data through the S3 endpoint, I would be > better off using a s3:// url and configuring the endpoint in the properties > (although I have to say I didn't find any property to it, so any > alternative S3 server needs to provide a specific AWS S3 Client to use with > S3FileIO?) > > I also noticed that https:// is an alias to s3:// but again, isn't this > breaking expectations about what the URI is supposed to represent? > > On Fri, Dec 3, 2021 at 6:44 AM Ryan Murray <rym...@gmail.com> wrote: > >> Echoing Laurent and Igor I wonder what the consequence of adding 'gs://' >> scheme to S3FileIO is if that scheme is already used by the hadoop gcs >> connector? Do we want to overload that scheme? I would almost think it >> should be an s3:// scheme or so right? >> >> Best, >> Ryan >> >> On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava < >> mayur.srivast...@twosigma.com> wrote: >> >>> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use >>> case (because we are creating our own S3Client). >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>> >>> *From:* Igor Dvorzhak <i...@google.com.INVALID> >>> *Sent:* Thursday, December 2, 2021 8:12 PM >>> *To:* dev@iceberg.apache.org >>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >>> Storage >>> >>> >>> >>> As long as proposed changes will not prevent Iceberg from using GCS >>> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) >>> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users >>> to use S3FileIO with GCS. >>> >>> >>> >>> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon <laur...@dremio.com> >>> wrote: >>> >>> What about credentials? Sure, GCS has a S3 compatibility mode, but the >>> gs:// URI used by Hadoop is native GCS support with Google authentication >>> mechanisms (GCS Hadoop filesystem is actually out of tree -> >>> https://github.com/GoogleCloudDataproc/hadoop-connectors) >>> >>> >>> >>> Laurent >>> >>> >>> >>> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>> Also https://github.com/apache/iceberg/pull/3658. >>> >>> >>> >>> Please let me know if these are enough, we can discuss in the PRs. It >>> would also be great if there are users of systems like MinIO to confirm. >>> >>> >>> >>> -Jack >>> >>> >>> >>> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava < >>> mayur.srivast...@twosigma.com> wrote: >>> >>> Looks like Jack is already on the top of the problem ( >>> https://github.com/apache/iceberg/pull/3656). Thanks Jack! >>> >>> >>> >>> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com> >>> *Sent:* Thursday, December 2, 2021 4:16 PM >>> *To:* dev@iceberg.apache.org >>> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3 >>> Storage >>> >>> >>> >>> There are three reasons why we want to use S3FileIO over HadoopFileIO: >>> >>> 1. We want access to the S3Client in our service so support some >>> special handling of the auth. This is not possible with the HadoopFileIO >>> because the S3Client is not exposed. >>> >>> 2. We would like to improve upon the S3FileIO in the future, by >>> introducing a vectorized IO mechanism and it makes is easier if we are >>> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a >>> later email in upcoming weeks. >>> >>> 3. As Ryan mentioned earlier, we are seeing very high memory usage >>> with the HadoopFileIO in case of high concurrent commits. I reported that >>> in another thread. >>> >>> >>> >>> To moving forward: >>> >>> >>> >>> Can we start by adding ‘gs’ to the S3URI’s valid prefixes? >>> >>> >>> >>> One of Jack’s suggestion was to remove any scheme check from the S3URI. >>> Given we are building ResolvingFileIO, I think removing scheme check in the >>> individual implementation is not a bad idea. >>> >>> >>> >>> Either solution will work for us. >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>> >>> *From:* Ryan Blue <b...@tabular.io> >>> *Sent:* Thursday, December 2, 2021 11:37 AM >>> *To:* Iceberg Dev List <dev@iceberg.apache.org> >>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >>> Storage >>> >>> >>> >>> I think the advantage of S3FileIO over HadoopFileIO with s3a is it >>> doesn't hit the memory consumption problem that Mayur posted to the list. >>> That's a fairly big advantage so I think it's reasonable to try to support >>> this in 0.13.0. >>> >>> >>> >>> It should be easy enough to add the gs scheme and then we can figure out >>> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me, >>> so I guess we'll be adding scheme to implementation customization sooner >>> than I thought! >>> >>> >>> >>> Ryan >>> >>> >>> >>> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen <pi...@starburstdata.com> >>> wrote: >>> >>> Hi >>> >>> >>> >>> I agree that endpoint, credentials, path style access etc. should be >>> configurable. >>> >>> There are storages which are primarily used as "s3 compatible" and they >>> need these settings to make them work. >>> >>> We've seen these being used to access MinIO, Ceph and even S3 with some >>> gateway (i am light on details, sorry). >>> >>> In all these cases, users seem to use s3:// urls even if not talking to >>> actual AWS S3 service. >>> >>> >>> >>> If this is sufficient for GCS, we could create GCSFileIO, or >>> GCSS3FileIO, just by accepting gs:// protocol and delegating to S3FileIO >>> for now. >>> >>> In the long term, i would recommend using native GCS client though, or >>> hadoop file system implementation provided by google. >>> >>> >>> >>> BTW, Mayur what is the advantage of using S3FileIO for google storage >>> vs HadoopFileIO? >>> >>> >>> >>> BR >>> >>> PF >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>> And here is a proposal of what I think could be the best way to go for >>> both worlds: >>> >>> (1) remove URI restrictions in S3FileIO (or allow configuration of >>> additional accepted schemes), and allow direct user configuration of >>> endpoint, credentials, etc. to make S3 configuration simpler without the >>> need to reconfigure the entire client. >>> >>> (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, >>> others -> HadoopFileIO >>> >>> (3) for s3 and gs, ResolvingFileIO needs to develop the ability to >>> initialize S3FileIO differently, and users should be able to configure them >>> differently in catalog properties >>> >>> (4) for users that need special GCS unique features, a GCSFileIO could >>> eventually be developed, and then people can choose to map gs -> GCSFileIO >>> in ResolvingFileIO >>> >>> >>> >>> -Jack >>> >>> >>> >>> >>> >>> On Wed, Dec 1, 2021 at 4:14 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>> Thanks for the confirmation, this is as I expected. We had a similar >>> case for Dell EMC ECS recently, where they published a version of their >>> FileIO that works through S3FileIO ( >>> https://github.com/apache/iceberg/pull/2807) and the only thing needed >>> was to override the endpoint, region and credentials. They also proposed >>> some specialization because their object storage service is specialized >>> with the Append operation when writing data. However, in the end they ended >>> up just creating another FileIO ( >>> https://github.com/apache/iceberg/pull/3376) using their own SDK to >>> better support the specialization. >>> >>> >>> >>> I believe the recent addition of ResolvingFileIO was to support using >>> multiple FileIOs and switch between them based on the file scheme. If we >>> continue that path, it feels more reasonable to me that we will have >>> specialized FileIOs for each implementation and allow them to evolve >>> independently. Users will be able to set whatever specialized >>> configurations for each implementation and take advantage of all of them. >>> >>> >>> >>> On the other hand, if we can support using S3FileIO as the new standard >>> FileIO that works with multiple storage providers, the advantages I see >>> are: >>> >>> (1) simple from the user's perspective because the least common >>> denominator of all storages needed by many cloud storage service providers >>> is S3. It's more work to configure and maintain multiple FileIOs. >>> >>> (2) we can avoid the current check in ResolvingFileIO of the file scheme >>> for each file path string, which might lead to some performance gain, >>> although I do not know how much we gain in this process >>> >>> >>> >>> From a technical perspective I prefer having dedicated FileIOs and an >>> overall ResolvingFileIO, because the Iceberg's FileIO interface is simple >>> enough for people to build specialized and proper support for different >>> storage systems. But it's also very tempting to just reuse the same thing >>> instead of building another one, especially when that feature is lacking >>> and the current functionality could be easily extended to support the >>> feature. The concern is that we will end up like Hadoop that had to develop >>> another sub-layer of FileSystem interface to accommodate different unique >>> features of different storage providers when the specialized feature >>> request comes, and at that time there is no difference from the dedicated >>> FileIO + ResolvingFileIO architecture. >>> >>> >>> >>> I wonder what Daniel thinks about this since I believe he is more >>> interested in multi-cloud support. >>> >>> >>> >>> -Jack >>> >>> >>> >>> On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava < >>> mayur.srivast...@twosigma.com> wrote: >>> >>> Hi Jack, Daniel, >>> >>> >>> >>> We use several S3-compatible backends with Iceberg, these include S3, >>> GCS, and others. Currently, S3FileIO provides us all the functionality we >>> need Iceberg to talk to these backends. The way we create S3FileIO is via >>> the constructor and provide the S3Client as the constructor param; we do >>> not use the initialize(Map<String,String>) method in FileIO. Our custom >>> catalog accepts the FileIO object at creation time. To talk to GCS, we >>> create the S3Client with a few overrides (described below) and pass it to >>> S3FileIO. After that, the rest of the S3FileIO code works as is. The only >>> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid >>> S3 prefix. This is the reason I sent the email. >>> >>> >>> >>> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO >>> almost works out of the box and contains all the functionality needed to >>> talk to GCS. The only special requirement is the creation of the S3Client >>> and allow “gs” prefix in the URIs. Based on our early experiments and >>> benchmarks, S3FileIO provides all the functionality we need and performs >>> well, so we didn’t see a need to create a native GCS FileIO. Iceberg >>> operations that we need are create, drop, read and write objects from S3 >>> and S3FileIO provides this functionality. >>> >>> >>> >>> We are managing ACLs (IAM in case of GCS) at the bucket level and that >>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve >>> not experimented with ACLs or encryption with S3FileIO and that is a good >>> question whether it works with GCS. But, if these features are not enabled >>> via default settings, S3FileIO works just fine with GCS. >>> >>> >>> >>> I think there is a case for supporting S3-compatible backends in >>> S3FileIO because a lot of the code is common. The question is whether we >>> can cleanly expose the common S3FileIO code to work with these backends and >>> separate out any specialization (if required) OR we want to have a >>> different FileIO implementation for each of the other S3 compatible >>> backends such as GCS? I’m eager to hear more from the community about this. >>> I’m happy to discuss and follow long-term design direction of the Iceberg >>> community. >>> >>> >>> >>> The S3Client for GCS is created as follows (currently the code is not >>> open source so I’m sharing the steps only): >>> >>> 1. Create S3ClientBuilder. >>> >>> 2. Set GCS endpoint URI and region. >>> >>> 3. Set a credentials provider that returns null. You can set credentials >>> here if you have static credentials. >>> >>> 4. Set ClientOverrideConfiguration with interceptors in the >>> overrideConfiguration(). The interceptors are used to setup authorization >>> header in requests (setting projectId, auth tokens, etc.) and do header >>> translation for requests and responses. >>> >>> 5. Build the S3Client. >>> >>> 6. Pass the S3Client to S3FileIO. >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>> >>> *From:* Jack Ye <yezhao...@gmail.com> >>> *Sent:* Wednesday, December 1, 2021 1:16 PM >>> *To:* Iceberg Dev List <dev@iceberg.apache.org> >>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >>> Storage >>> >>> >>> >>> Hi Mayur, >>> >>> >>> >>> I know many object storage services have allowed communication using the >>> Amazon S3 client by implementing the same protocol, like recently the Dell >>> EMC ECS and Aliyun OSS. But ultimately there are functionality differences >>> that could be optimized with a native FileIO, and the 2 examples I listed >>> before both contributed their own FileIO implementations to Iceberg >>> recently. I would imagine some native S3 features like ACL or SSE to not >>> work for GCS, and some GCS features to be not supported in S3FileIO, so I >>> think a specific GCS FileIO would likely be better for GCS support in the >>> long term. >>> >>> >>> >>> Could you describe how you configure S3FileIO to talk to GCS? Do you >>> need to override the S3 endpoint or have any other configurations? >>> >>> >>> >>> And I am not an expert of GCS, do you see using S3FileIO for GCS as a >>> feasible long-term solution? Are there any GCS specific features that you >>> might need and could not be done through S3FileIO, and how widely used are >>> those features? >>> >>> >>> >>> Best, >>> >>> Jack Ye >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks <daniel.c.we...@gmail.com> >>> wrote: >>> >>> The S3FileIO does use the AWS S3 V2 Client libraries and while there >>> appears to be some level of compatibility, it's not clear to me how far >>> that currently extends (some AWS features like encryption, IAM, etc. may >>> not have full support). >>> >>> >>> >>> I think it's great that there may be a path for more native GCS FileIO >>> support, but it might be a little early to rename the classes and except >>> that everything will work cleanly. >>> >>> >>> >>> Thanks for pointing this out, Mayur. It's really an interesting >>> development. >>> >>> >>> >>> -Dan >>> >>> >>> >>> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen <pi...@starburstdata.com> >>> wrote: >>> >>> if S3FileIO is supposed to be used with other file systems, we should >>> consider proper class renames. >>> >>> just my 2c >>> >>> >>> >>> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava < >>> mayur.srivast...@twosigma.com> wrote: >>> >>> Hi, >>> >>> >>> >>> We are using S3FileIO to talk to the GCS backend. GCS URIs are >>> compatible with the AWS S3 SDKs and if they are added to the list of >>> supported prefixes, they work with S3FileIO. >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>> >>> *From:* Piotr Findeisen <pi...@starburstdata.com> >>> *Sent:* Wednesday, December 1, 2021 10:58 AM >>> *To:* Iceberg Dev List <dev@iceberg.apache.org> >>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >>> Storage >>> >>> >>> >>> Hi >>> >>> >>> >>> Just curious. S3URI seems aws s3-specific. What would be the goal of >>> using S3URI with google cloud storage urls? >>> >>> what problem are we solving? >>> >>> >>> >>> PF >>> >>> >>> >>> >>> >>> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>> Sounds reasonable to me if they are compatible >>> >>> >>> >>> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava < >>> mayur.srivast...@twosigma.com> wrote: >>> >>> Hi, >>> >>> >>> >>> We have URIs starting with gs:// representing objects on GCS. Currently, >>> S3URI doesn’t support gs:// prefix (see >>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41). >>> Is there an existing JIRA for supporting this? Any objections to add “gs” >>> to the list of S3 prefixes? >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Tabular >>> >>>