Thanks for the input Steven and Hariharan. I think this ended up being a combination of bad configuration with the credential providers I was using *and* using the wrong set of credentials for the test data I was trying to access.
I was able to get this working with both hadoop 2.8 and 3.1 by pulling down the correct hadoop-aws and aws-java-sdk[-bundle] bundles and fixing the credential provider I was using for testing. It's probably the same for the spark distribution compiled for hadoop 2.7, but since I already have a build with a more modern hadoop version working, I may just stick with that. Best, Devin On Wed, Mar 4, 2020 at 11:02 PM Hariharan <hariharan...@gmail.com> wrote: > If you're using hadoop 2.7 or below, you may also need to use the > following hadoop settings: > > fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem > fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem > fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A > fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A > > Hadoop 2.8 and above would have these set by default. > > Thanks, > Hariharan > > On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer > <devin.bo...@mapbox.com.invalid> wrote: > > > > Hello, > > > > I'm attempting to run Spark within a Docker container with the hope of > eventually running Spark on Kubernetes. Nearly all the data we currently > process with Spark is stored in S3, so I need to be able to interface with > it using the S3A filesystem. > > > > I feel like I've gotten close to getting this working but for some > reason cannot get my local Spark installations to correctly interface with > S3 yet. > > > > A basic example of what I've tried: > > > > Build Kubernetes docker images by downloading the > spark-2.4.5-bin-hadoop2.7.tgz archive and building the > kubernetes/dockerfiles/spark/Dockerfile image. > > Run an interactive docker container using the above built image. > > Within that container, run spark-shell. This command passes valid AWS > credentials by setting spark.hadoop.fs.s3a.access.key and > spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the > hadoop-aws package by specifying the --packages > org.apache.hadoop:hadoop-aws:2.7.3 flag. > > Try to access the simple public file as outlined in the "Integration > with Cloud Infrastructures" documentation by running: > sc.textFile("s3a://landsat-pds/scene_list.gz").take(5) > > Observe this to fail with a 403 Forbidden exception thrown by S3 > > > > > > I've tried a variety of other means of setting credentials (like > exporting the standard AWS_ACCESS_KEY_ID environment variable before > launching spark-shell), and other means of building a Spark image and > including the appropriate libraries (see this Github repo: > https://github.com/drboyer/spark-s3a-demo), all with the same results. > I've tried also accessing objects within our AWS account, rather than the > object from the public landsat-pds bucket, with the same 403 error being > thrown. > > > > Can anyone help explain why I can't seem to connect to S3 successfully > using Spark, or even explain where I could look for additional clues as to > what's misconfigured? I've tried turning up the logging verbosity and > didn't see much that was particularly useful, but happy to share additional > log output too. > > > > Thanks for any help you can provide! > > > > Best, > > Devin Boyer > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >