Hi, why are you trying to access data in S3 via another network? Does that not cause huge network overhead, and data transmissions losses (as data is getting transferred over internet) and other inconsistencies?
Have you tried using AWS CLI? Using "aws s3 sync" command you can copy all the files in a s3://bucket/ or s3://bucket/key/ to your local system. And then you can point your spark cluster to the local data store and run the queries.Of course that depends on the data volume as well. Regards, Gourav Sengupta On Fri, Feb 26, 2016 at 7:29 PM, Joshua Buss <joshua.b...@gmail.com> wrote: > Hello, > > I'm trying to use spark with google cloud storage, but from a network where > I cannot talk to the outside internet directly. This means we have a proxy > set up for all requests heading towards GCS. > > So far, I've had good luck with projects that talk to S3 through libraries > like boto (for python) and the AWS SDK (for node.js), because these both > appear compatible with what Google calls " interoperability mode > <https://cloud.google.com/storage/docs/migrating> ". > > Spark, (or whatever it uses for s3 connectivity under the hood, maybe > "JetS3t"?), on the other hand, doesn't appear to be compatible. > Furthermore, I can't use hadoop Google Cloud Storage connector because it > doesn't have any properties exposed for setting up a proxy host. > > When I set the following core-xml values for the s3a connector, I get an > AmazonS3Exception: > > Caused by: com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception: > The provided security credentials are not valid. (Service: Amazon S3; > Status > Code: 403; Error Code: InvalidSecurity; Request ID: null), S3 Extended > Request ID: null > > I know this isn't real xml, I just condensed it a bit for readability: > <name>fs.s3a.access.key</name> > <value>....</value> > <name>fs.s3a.secret.key</name> > <value>....</value> > <name>fs.s3a.endpoint</name> > <value>https://storage.googleapis.com</value> > <name>fs.s3a.connection.ssl.enabled</name> > <value>True</value> > <name>fs.s3a.proxy.host</name> > <value>proxyhost</value> > <name>fs.s3a.proxy.port</name> > <value>12345</value> > > I'd inspect the traffic manually to learn more, but it's all encrypted with > SSL of course. Any suggestions? > > I feel like I should also mention that since this is critical to get > working > asap, I'm also going down the route of using another local proxy like this > one <https://github.com/reversefold/python-s3proxy> written in python > because boto handles the interoperability mode correctly and is designed to > "look like actual s3" to clients of it. > > Using this approach I can get a basic read to work with spark but this has > numerous drawbacks, including being very fragile, not supporting writes, > and > not to mention I'm sure it'll be a huge performance bottleneck once we > start > trying to run larger workloads ...so... I'd like to get the native s3a/s3n > connectors working if at all possible, but need some help. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/s3-access-through-proxy-tp26347.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >