[
https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449830#comment-15449830
]
Steve Loughran commented on SPARK-17307:
----------------------------------------
I think this a subset of SPARK-7481, where I am doing the docs
https://github.com/steveloughran/spark/blob/f39018eee40ef463ebfdfb0f6a7ba6384b46c459/docs/cloud-integration.md
I haven't done the bit on authentication setup through; I'm planning to point
to the [Hadoop docs
there|https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html],
because as well as the details on how to configure the latest hadoop s3x
clients, it's got a troubleshooting section.
Looking at the code,
# It's dangerous to put AWS secrets in the source file —it's too easy to leak
them. Stick them in your spark configuration file, prefixed with
{{spark.hadoop}}
# if you are using Hadoop 2.7+ as the Hadoop version, please use s3a:// paths
instead of s3n://. Your life will be better.
Anyway, can you have a look at the cloud integration doc I've linked to,
comment on the [pull request|https://github.com/apache/spark/pull/12004] where
it could be improved....I'll do my best
> Document what all access is needed on S3 bucket when trying to save a model
> ---------------------------------------------------------------------------
>
> Key: SPARK-17307
> URL: https://issues.apache.org/jira/browse/SPARK-17307
> Project: Spark
> Issue Type: Documentation
> Reporter: Aseem Bansal
> Priority: Minor
>
> I faced this lack of documentation when I was trying to save a model to S3.
> Initially I thought it should be only write. Then I found it also needs
> delete to delete temporary files. Now I requested access for delete and tried
> again and I am get the error
> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
> org.jets3t.service.S3ServiceException: S3 PUT failed for
> '/dev-qa_%24folder%24' XML Error Message
> To reproduce this error the below can be used
> {code}
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("my app")
> .master("local")
> .getOrCreate();
> JavaSparkContext jsc = new
> JavaSparkContext(sparkSession.sparkContext());
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESS_KEY>);
> jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <SECRET
> ACCESS KEY>);
> //Create a Pipelinemode
>
> pipelineModel.write().overwrite().save("s3n://<BUCKET>/dev-qa/modelTest");
> {code}
> This back and forth could be avoided if it was clearly mentioned what all
> access spark needs to write to S3. Also would be great if why all of the
> access is needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]