If you're using hadoop 2.7 or below, you may also need to use the
following hadoop settings:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

Hadoop 2.8 and above would have these set by default.

Thanks,
Hariharan

On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
<devin.bo...@mapbox.com.invalid> wrote:
>
> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of 
> eventually running Spark on Kubernetes. Nearly all the data we currently 
> process with Spark is stored in S3, so I need to be able to interface with it 
> using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason 
> cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
> Build Kubernetes docker images by downloading the 
> spark-2.4.5-bin-hadoop2.7.tgz archive and building the 
> kubernetes/dockerfiles/spark/Dockerfile image.
> Run an interactive docker container using the above built image.
> Within that container, run spark-shell. This command passes valid AWS 
> credentials by setting spark.hadoop.fs.s3a.access.key and 
> spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the 
> hadoop-aws package by specifying the --packages 
> org.apache.hadoop:hadoop-aws:2.7.3 flag.
> Try to access the simple public file as outlined in the "Integration with 
> Cloud Infrastructures" documentation by running: 
> sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting 
> the standard AWS_ACCESS_KEY_ID environment variable before launching 
> spark-shell), and other means of building a Spark image and including the 
> appropriate libraries (see this Github repo: 
> https://github.com/drboyer/spark-s3a-demo), all with the same results. I've 
> tried also accessing objects within our AWS account, rather than the object 
> from the public landsat-pds bucket, with the same 403 error being thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully using 
> Spark, or even explain where I could look for additional clues as to what's 
> misconfigured? I've tried turning up the logging verbosity and didn't see 
> much that was particularly useful, but happy to share additional log output 
> too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to