If you're using hadoop 2.7 or below, you may also need to use the following hadoop settings:
fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A Hadoop 2.8 and above would have these set by default. Thanks, Hariharan On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer <devin.bo...@mapbox.com.invalid> wrote: > > Hello, > > I'm attempting to run Spark within a Docker container with the hope of > eventually running Spark on Kubernetes. Nearly all the data we currently > process with Spark is stored in S3, so I need to be able to interface with it > using the S3A filesystem. > > I feel like I've gotten close to getting this working but for some reason > cannot get my local Spark installations to correctly interface with S3 yet. > > A basic example of what I've tried: > > Build Kubernetes docker images by downloading the > spark-2.4.5-bin-hadoop2.7.tgz archive and building the > kubernetes/dockerfiles/spark/Dockerfile image. > Run an interactive docker container using the above built image. > Within that container, run spark-shell. This command passes valid AWS > credentials by setting spark.hadoop.fs.s3a.access.key and > spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the > hadoop-aws package by specifying the --packages > org.apache.hadoop:hadoop-aws:2.7.3 flag. > Try to access the simple public file as outlined in the "Integration with > Cloud Infrastructures" documentation by running: > sc.textFile("s3a://landsat-pds/scene_list.gz").take(5) > Observe this to fail with a 403 Forbidden exception thrown by S3 > > > I've tried a variety of other means of setting credentials (like exporting > the standard AWS_ACCESS_KEY_ID environment variable before launching > spark-shell), and other means of building a Spark image and including the > appropriate libraries (see this Github repo: > https://github.com/drboyer/spark-s3a-demo), all with the same results. I've > tried also accessing objects within our AWS account, rather than the object > from the public landsat-pds bucket, with the same 403 error being thrown. > > Can anyone help explain why I can't seem to connect to S3 successfully using > Spark, or even explain where I could look for additional clues as to what's > misconfigured? I've tried turning up the logging verbosity and didn't see > much that was particularly useful, but happy to share additional log output > too. > > Thanks for any help you can provide! > > Best, > Devin Boyer --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org