Hello,

I'm trying to use spark with google cloud storage, but from a network where
I cannot talk to the outside internet directly.  This means we have a proxy
set up for all requests heading towards GCS.

So far, I've had good luck with projects that talk to S3 through libraries
like boto (for python) and the AWS SDK (for node.js), because these both
appear compatible with what Google calls " interoperability mode
<https://cloud.google.com/storage/docs/migrating>  ".

Spark, (or whatever it uses for s3 connectivity under the hood, maybe
"JetS3t"?), on the other hand, doesn't appear to be compatible. 
Furthermore, I can't use hadoop Google Cloud Storage connector because it
doesn't have any properties exposed for setting up a proxy host.

 When I set the following core-xml values for the s3a connector, I get an
AmazonS3Exception:

Caused by: com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception:
The provided security credentials are not valid. (Service: Amazon S3; Status
Code: 403; Error Code: InvalidSecurity; Request ID: null), S3 Extended
Request ID: null

I know this isn't real xml, I just condensed it a bit for readability:
  <name>fs.s3a.access.key</name>
  <value>....</value>
  <name>fs.s3a.secret.key</name>
  <value>....</value>
  <name>fs.s3a.endpoint</name>
  <value>https://storage.googleapis.com</value>
  <name>fs.s3a.connection.ssl.enabled</name>
  <value>True</value>
  <name>fs.s3a.proxy.host</name>
  <value>proxyhost</value>
  <name>fs.s3a.proxy.port</name>
  <value>12345</value>

I'd inspect the traffic manually to learn more, but it's all encrypted with
SSL of course.  Any suggestions?

I feel like I should also mention that since this is critical to get working
asap, I'm also going down the route of using another local proxy like  this
one <https://github.com/reversefold/python-s3proxy>   written in python
because boto handles the interoperability mode correctly and is designed to
"look like actual s3" to clients of it.

Using this approach I can get a basic read to work with spark but this has
numerous drawbacks, including being very fragile, not supporting writes, and
not to mention I'm sure it'll be a huge performance bottleneck once we start
trying to run larger workloads ...so... I'd like to get the native s3a/s3n
connectors working if at all possible, but need some help.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/s3-access-through-proxy-tp26347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to