[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

Steve Loughran (JIRA) Tue, 30 Aug 2016 11:59:21 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449830#comment-15449830
 ]


Steve Loughran commented on SPARK-17307:
----------------------------------------

I think this a subset of SPARK-7481, where I am doing the docs

https://github.com/steveloughran/spark/blob/f39018eee40ef463ebfdfb0f6a7ba6384b46c459/docs/cloud-integration.md

I haven't done the bit on authentication setup through; I'm planning to point 
to the [Hadoop docs 
there|https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html],
 because as well as the details on how to configure the latest hadoop s3x 
clients, it's got a troubleshooting section.

Looking at the code,

# It's dangerous to put AWS secrets in the source file —it's too easy to leak 
them. Stick them in your spark configuration file, prefixed with 
{{spark.hadoop}}
# if you are using Hadoop 2.7+ as the Hadoop version, please use s3a:// paths 
instead of s3n://. Your life will be better.

Anyway, can you have a look at the cloud integration doc I've linked to, 
comment on the [pull request|https://github.com/apache/spark/pull/12004] where 
it could be improved....I'll do my best


> Document what all access is needed on S3 bucket when trying to save a model
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-17307
>                 URL: https://issues.apache.org/jira/browse/SPARK-17307
>             Project: Spark
>          Issue Type: Documentation
>            Reporter: Aseem Bansal
>            Priority: Minor
>
> I faced this lack of documentation when I was trying to save a model to S3. 
> Initially I thought it should be only write. Then I found it also needs 
> delete to delete temporary files. Now I requested access for delete and tried 
> again and I am get the error
> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
> org.jets3t.service.S3ServiceException: S3 PUT failed for 
> '/dev-qa_%24folder%24' XML Error Message
> To reproduce this error the below can be used
> {code}
> SparkSession sparkSession = SparkSession
>                 .builder()
>                 .appName("my app")
>                 .master("local") 
>                 .getOrCreate();
>         JavaSparkContext jsc = new 
> JavaSparkContext(sparkSession.sparkContext());
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESS_KEY>);
>         jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <SECRET 
> ACCESS KEY>);
> //Create a Pipelinemode
>         
> pipelineModel.write().overwrite().save("s3n://<BUCKET>/dev-qa/modelTest");
> {code}
> This back and forth could be avoided if it was clearly mentioned what all 
> access spark needs to write to S3. Also would be great if why all of the 
> access is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

Reply via email to