To successfully read from S3 using s3a, I've had to also set
```
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```
in addition to `spark.hadoop.fs.s3a.access.key` and
`spark.hadoop.fs.s3a.secret.key`. I've also needed to ensure Spark has
access to the AWS SDK jar. I have downloaded `aws-java-sdk-1.7.4.jar` (maven
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar>)
paired with `hadoop-aws-2.7.3.jar` in `$SPARK_HOME/jars`.

These additionally configurations don't seem related to credentials and
security (and may not even be needed in my case), but perhaps it will help
you.

Thanks,
Steven

On Wed, Mar 4, 2020 at 1:11 PM Devin Boyer <devin.bo...@mapbox.com.invalid>
wrote:

> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of
> eventually running Spark on Kubernetes. Nearly all the data we currently
> process with Spark is stored in S3, so I need to be able to interface with
> it using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason
> cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
>    - Build Kubernetes docker images by downloading the
>    spark-2.4.5-bin-hadoop2.7.tgz archive and building the
>    kubernetes/dockerfiles/spark/Dockerfile image.
>    - Run an interactive docker container using the above built image.
>    - Within that container, run spark-shell. This command passes valid
>    AWS credentials by setting spark.hadoop.fs.s3a.access.key and
>    spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
>    hadoop-aws package by specifying the --packages
>    org.apache.hadoop:hadoop-aws:2.7.3 flag.
>    - Try to access the simple public file as outlined in the "Integration
>    with Cloud Infrastructures
>    <https://spark.apache.org/docs/latest/cloud-integration.html#installation>"
>    documentation by running:
>    sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
>    - Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting
> the standard AWS_ACCESS_KEY_ID environment variable before launching
> spark-shell), and other means of building a Spark image and including the
> appropriate libraries (see this Github repo:
> https://github.com/drboyer/spark-s3a-demo), all with the same results.
> I've tried also accessing objects within our AWS account, rather than the
> object from the public landsat-pds bucket, with the same 403 error being
> thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully
> using Spark, or even explain where I could look for additional clues as to
> what's misconfigured? I've tried turning up the logging verbosity and
> didn't see much that was particularly useful, but happy to share additional
> log output too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer
>

Reply via email to