Hello, I'm trying to use spark with google cloud storage, but from a network where I cannot talk to the outside internet directly. This means we have a proxy set up for all requests heading towards GCS.
So far, I've had good luck with projects that talk to S3 through libraries like boto (for python) and the AWS SDK (for node.js), because these both appear compatible with what Google calls " interoperability mode <https://cloud.google.com/storage/docs/migrating> ". Spark, (or whatever it uses for s3 connectivity under the hood, maybe "JetS3t"?), on the other hand, doesn't appear to be compatible. Furthermore, I can't use hadoop Google Cloud Storage connector because it doesn't have any properties exposed for setting up a proxy host. When I set the following core-xml values for the s3a connector, I get an AmazonS3Exception: Caused by: com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception: The provided security credentials are not valid. (Service: Amazon S3; Status Code: 403; Error Code: InvalidSecurity; Request ID: null), S3 Extended Request ID: null I know this isn't real xml, I just condensed it a bit for readability: <name>fs.s3a.access.key</name> <value>....</value> <name>fs.s3a.secret.key</name> <value>....</value> <name>fs.s3a.endpoint</name> <value>https://storage.googleapis.com</value> <name>fs.s3a.connection.ssl.enabled</name> <value>True</value> <name>fs.s3a.proxy.host</name> <value>proxyhost</value> <name>fs.s3a.proxy.port</name> <value>12345</value> I'd inspect the traffic manually to learn more, but it's all encrypted with SSL of course. Any suggestions? I feel like I should also mention that since this is critical to get working asap, I'm also going down the route of using another local proxy like this one <https://github.com/reversefold/python-s3proxy> written in python because boto handles the interoperability mode correctly and is designed to "look like actual s3" to clients of it. Using this approach I can get a basic read to work with spark but this has numerous drawbacks, including being very fragile, not supporting writes, and not to mention I'm sure it'll be a huge performance bottleneck once we start trying to run larger workloads ...so... I'd like to get the native s3a/s3n connectors working if at all possible, but need some help. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/s3-access-through-proxy-tp26347.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org