prefix in S3URI for Google Cloud S3 Storage

Daniel Weeks Fri, 10 Dec 2021 09:37:31 -0800

Hey Mayur and Laurent,

As an alternative to using S3FileIO to talk to GCS, I just posted a
native GCSFileIO
implementation <https://github.com/apache/iceberg/pull/3711> and would
really appreciate feedback.  I'd prefer to go this route which has a number
of advantages (like using gRPC eventually) and more native support of some
of the GCS features (like streaming transport).


It would be great if someone has a change to try this out in a real google
cloud environment and help improve it.

-Dan

On Fri, Dec 3, 2021 at 7:48 AM Laurent Goujon <laur...@dremio.com> wrote:

> To be clear, the reasons why using S3FileIO over HadoopFileIO are totally
> reasonable. My issue is making gs:// is an alias to s3://, which I don't
> believe it is. Even assuming that GCS has an endpoint so one can use a S3
> API to access data, you would need to configure this endpoint, and you
> would need to create S3 accesskey/accessecret (which is not the regular
> mode of operations for GCS) in order to access the data. So personally if I
> was interested to access GCS data through the S3 endpoint, I would be
> better off using a s3:// url and configuring the endpoint in the properties
> (although I have to say I didn't find any property to it, so any
> alternative S3 server needs to provide a specific AWS S3 Client to use with
> S3FileIO?)
>
> I also noticed that https:// is an alias to s3:// but again, isn't this
> breaking expectations about what the URI is supposed to represent?
>
> On Fri, Dec 3, 2021 at 6:44 AM Ryan Murray <rym...@gmail.com> wrote:
>
>> Echoing Laurent and Igor I wonder what the consequence of adding 'gs://'
>> scheme to S3FileIO is if that scheme is already used by the hadoop gcs
>> connector? Do we want to overload that scheme? I would almost think it
>> should be an s3:// scheme or so right?
>>
>> Best,
>> Ryan
>>
>> On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>>> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use
>>> case (because we are creating our own S3Client).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Igor Dvorzhak <i...@google.com.INVALID>
>>> *Sent:* Thursday, December 2, 2021 8:12 PM
>>> *To:* dev@iceberg.apache.org
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> As long as proposed changes will not prevent Iceberg from using GCS
>>> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
>>> to use S3FileIO with GCS.
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon <laur...@dremio.com>
>>> wrote:
>>>
>>> What about credentials? Sure, GCS has a S3 compatibility mode, but the
>>> gs:// URI used by Hadoop is native GCS support with Google authentication
>>> mechanisms (GCS Hadoop filesystem is actually out of tree ->
>>> https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>>
>>>
>>>
>>> Laurent
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>> Also https://github.com/apache/iceberg/pull/3658.
>>>
>>>
>>>
>>> Please let me know if these are enough, we can discuss in the PRs. It
>>> would also be great if there are users of systems like MinIO to confirm.
>>>
>>>
>>>
>>> -Jack
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>> Looks like Jack is already on the top of the problem (
>>> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>>>
>>>
>>>
>>> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com>
>>> *Sent:* Thursday, December 2, 2021 4:16 PM
>>> *To:* dev@iceberg.apache.org
>>> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>>>
>>> 1.      We want access to the S3Client in our service so support some
>>> special handling of the auth. This is not possible with the HadoopFileIO
>>> because the S3Client is not exposed.
>>>
>>> 2.      We would like to improve upon the S3FileIO in the future, by
>>> introducing a vectorized IO mechanism and it makes is easier if we are
>>> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
>>> later email in upcoming weeks.
>>>
>>> 3.      As Ryan mentioned earlier, we are seeing very high memory usage
>>> with the HadoopFileIO in case of high concurrent commits. I reported that
>>> in another thread.
>>>
>>>
>>>
>>> To moving forward:
>>>
>>>
>>>
>>> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>>>
>>>
>>>
>>> One of Jack’s suggestion was to remove any scheme check from the S3URI.
>>> Given we are building ResolvingFileIO, I think removing scheme check in the
>>> individual implementation is not a bad idea.
>>>
>>>
>>>
>>> Either solution will work for us.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Ryan Blue <b...@tabular.io>
>>> *Sent:* Thursday, December 2, 2021 11:37 AM
>>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> I think the advantage of S3FileIO over HadoopFileIO with s3a is it
>>> doesn't hit the memory consumption problem that Mayur posted to the list.
>>> That's a fairly big advantage so I think it's reasonable to try to support
>>> this in 0.13.0.
>>>
>>>
>>>
>>> It should be easy enough to add the gs scheme and then we can figure out
>>> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me,
>>> so I guess we'll be adding scheme to implementation customization sooner
>>> than I thought!
>>>
>>>
>>>
>>> Ryan
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>> Hi
>>>
>>>
>>>
>>> I agree that endpoint, credentials, path style access etc. should be
>>> configurable.
>>>
>>> There are storages which are primarily used as "s3 compatible" and they
>>> need these settings to make them work.
>>>
>>> We've seen these being used to access MinIO, Ceph and even S3 with some
>>> gateway (i am light on details, sorry).
>>>
>>> In all these cases, users seem to use s3:// urls even if not talking to
>>> actual AWS S3 service.
>>>
>>>
>>>
>>> If this is sufficient for GCS, we could create GCSFileIO, or
>>> GCSS3FileIO, just by accepting gs:// protocol and delegating to S3FileIO
>>> for now.
>>>
>>> In the long term, i would recommend using native GCS client though, or
>>> hadoop file system implementation provided by google.
>>>
>>>
>>>
>>> BTW, Mayur what is the advantage of using S3FileIO for google storage
>>> vs HadoopFileIO?
>>>
>>>
>>>
>>> BR
>>>
>>> PF
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>> And here is a proposal of what I think could be the best way to go for
>>> both worlds:
>>>
>>> (1) remove URI restrictions in S3FileIO (or allow configuration of
>>> additional accepted schemes), and allow direct user configuration of
>>> endpoint, credentials, etc. to make S3 configuration simpler without the
>>> need to reconfigure the entire client.
>>>
>>> (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO,
>>> others -> HadoopFileIO
>>>
>>> (3) for s3 and gs, ResolvingFileIO needs to develop the ability to
>>> initialize S3FileIO differently, and users should be able to configure them
>>> differently in catalog properties
>>>
>>> (4) for users that need special GCS unique features, a GCSFileIO could
>>> eventually be developed, and then people can choose to map gs -> GCSFileIO
>>> in ResolvingFileIO
>>>
>>>
>>>
>>> -Jack
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 4:14 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>> Thanks for the confirmation, this is as I expected. We had a similar
>>> case for Dell EMC ECS recently, where they published a version of their
>>> FileIO that works through S3FileIO (
>>> https://github.com/apache/iceberg/pull/2807) and the only thing needed
>>> was to override the endpoint, region and credentials. They also proposed
>>> some specialization because their object storage service is specialized
>>> with the Append operation when writing data. However, in the end they ended
>>> up just creating another FileIO (
>>> https://github.com/apache/iceberg/pull/3376) using their own SDK to
>>> better support the specialization.
>>>
>>>
>>>
>>> I believe the recent addition of ResolvingFileIO was to support using
>>> multiple FileIOs and switch between them based on the file scheme. If we
>>> continue that path, it feels more reasonable to me that we will have
>>> specialized FileIOs for each implementation and allow them to evolve
>>> independently. Users will be able to set whatever specialized
>>> configurations for each implementation and take advantage of all of them.
>>>
>>>
>>>
>>> On the other hand, if we can support using S3FileIO as the new standard
>>> FileIO that works with multiple storage providers, the advantages I see
>>> are:
>>>
>>> (1) simple from the user's perspective because the least common
>>> denominator of all storages needed by many cloud storage service providers
>>> is S3. It's more work to configure and maintain multiple FileIOs.
>>>
>>> (2) we can avoid the current check in ResolvingFileIO of the file scheme
>>> for each file path string, which might lead to some performance gain,
>>> although I do not know how much we gain in this process
>>>
>>>
>>>
>>> From a technical perspective I prefer having dedicated FileIOs and an
>>> overall ResolvingFileIO, because the Iceberg's FileIO interface is simple
>>> enough for people to build specialized and proper support for different
>>> storage systems. But it's also very tempting to just reuse the same thing
>>> instead of building another one, especially when that feature is lacking
>>> and the current functionality could be easily extended to support the
>>> feature. The concern is that we will end up like Hadoop that had to develop
>>> another sub-layer of FileSystem interface to accommodate different unique
>>> features of different storage providers when the specialized feature
>>> request comes, and at that time there is no difference from the dedicated
>>> FileIO + ResolvingFileIO architecture.
>>>
>>>
>>>
>>> I wonder what Daniel thinks about this since I believe he is more
>>> interested in multi-cloud support.
>>>
>>>
>>>
>>> -Jack
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>> Hi Jack, Daniel,
>>>
>>>
>>>
>>> We use several S3-compatible backends with Iceberg, these include S3,
>>> GCS, and others. Currently, S3FileIO provides us all the functionality we
>>> need Iceberg to talk to these backends. The way we create S3FileIO is via
>>> the constructor and provide the S3Client as the constructor param; we do
>>> not use the initialize(Map<String,String>) method in FileIO. Our custom
>>> catalog accepts the FileIO object at creation time. To talk to GCS, we
>>> create the S3Client with a few overrides (described below) and pass it to
>>> S3FileIO. After that, the rest of the S3FileIO code works as is. The only
>>> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid
>>> S3 prefix. This is the reason I sent the email.
>>>
>>>
>>>
>>> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO
>>> almost works out of the box and contains all the functionality needed to
>>> talk to GCS. The only special requirement is the creation of the S3Client
>>> and allow “gs” prefix in the URIs. Based on our early experiments and
>>> benchmarks, S3FileIO provides all the functionality we need and performs
>>> well, so we didn’t see a need to create a native GCS FileIO. Iceberg
>>> operations that we need are create, drop, read and write objects from S3
>>> and S3FileIO provides this functionality.
>>>
>>>
>>>
>>> We are managing ACLs (IAM in case of GCS) at the bucket level and that
>>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
>>> not experimented with ACLs or encryption with S3FileIO and that is a good
>>> question whether it works with GCS. But, if these features are not enabled
>>> via default settings, S3FileIO works just fine with GCS.
>>>
>>>
>>>
>>> I think there is a case for supporting S3-compatible backends in
>>> S3FileIO because a lot of the code is common. The question is whether we
>>> can cleanly expose the common S3FileIO code to work with these backends and
>>> separate out any specialization (if required) OR we want to have a
>>> different FileIO implementation for each of the other S3 compatible
>>> backends such as GCS? I’m eager to hear more from the community about this.
>>> I’m happy to discuss and follow long-term design direction of the Iceberg
>>> community.
>>>
>>>
>>>
>>> The S3Client for GCS is created as follows (currently the code is not
>>> open source so I’m sharing the steps only):
>>>
>>> 1. Create S3ClientBuilder.
>>>
>>> 2. Set GCS endpoint URI and region.
>>>
>>> 3. Set a credentials provider that returns null. You can set credentials
>>> here if you have static credentials.
>>>
>>> 4. Set ClientOverrideConfiguration with interceptors in the
>>> overrideConfiguration(). The interceptors are used to setup authorization
>>> header in requests (setting projectId, auth tokens, etc.) and do header
>>> translation for requests and responses.
>>>
>>> 5. Build the S3Client.
>>>
>>> 6. Pass the S3Client to S3FileIO.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Jack Ye <yezhao...@gmail.com>
>>> *Sent:* Wednesday, December 1, 2021 1:16 PM
>>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> Hi Mayur,
>>>
>>>
>>>
>>> I know many object storage services have allowed communication using the
>>> Amazon S3 client by implementing the same protocol, like recently the Dell
>>> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
>>> that could be optimized with a native FileIO, and the 2 examples I listed
>>> before both contributed their own FileIO implementations to Iceberg
>>> recently. I would imagine some native S3 features like ACL or SSE to not
>>> work for GCS, and some GCS features to be not supported in S3FileIO, so I
>>> think a specific GCS FileIO would likely be better for GCS support in the
>>> long term.
>>>
>>>
>>>
>>> Could you describe how you configure S3FileIO to talk to GCS? Do you
>>> need to override the S3 endpoint or have any other configurations?
>>>
>>>
>>>
>>> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
>>> feasible long-term solution? Are there any GCS specific features that you
>>> might need and could not be done through S3FileIO, and how widely used are
>>> those features?
>>>
>>>
>>>
>>> Best,
>>>
>>> Jack Ye
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks <daniel.c.we...@gmail.com>
>>> wrote:
>>>
>>> The S3FileIO does use the AWS S3 V2 Client libraries and while there
>>> appears to be some level of compatibility, it's not clear to me how far
>>> that currently extends (some AWS features like encryption, IAM, etc. may
>>> not have full support).
>>>
>>>
>>>
>>> I think it's great that there may be a path for more native GCS FileIO
>>> support, but it might be a little early to rename the classes and except
>>> that everything will work cleanly.
>>>
>>>
>>>
>>> Thanks for pointing this out, Mayur.  It's really an interesting
>>> development.
>>>
>>>
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>> if S3FileIO is supposed to be used with other file systems, we should
>>> consider proper class renames.
>>>
>>> just my 2c
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> We are using S3FileIO to talk to the GCS backend. GCS URIs are
>>> compatible with the AWS S3 SDKs and if they are added to the list of
>>> supported prefixes, they work with S3FileIO.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Piotr Findeisen <pi...@starburstdata.com>
>>> *Sent:* Wednesday, December 1, 2021 10:58 AM
>>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> Hi
>>>
>>>
>>>
>>> Just curious. S3URI seems aws s3-specific. What would be the goal of
>>> using S3URI with google cloud storage urls?
>>>
>>> what problem are we solving?
>>>
>>>
>>>
>>> PF
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>> Sounds reasonable to me if they are compatible
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> We have URIs starting with gs:// representing objects on GCS. Currently,
>>> S3URI doesn’t support gs:// prefix (see
>>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>>> to the list of S3 prefixes?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Tabular
>>>
>>>

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Reply via email to