Re: spark 2.02 error when writing to s3

Neil Jonkers Fri, 20 Jan 2017 11:39:31 -0800

Can you test by enabling emrfs consistent view and use s3:// uri.

http://docs.aws.amazon.com/emr/latest/ManagementGuide/enable-consistent-view.html


<div>-------- Original message --------</div><div>From: Steve Loughran 
<ste...@hortonworks.com> </div><div>Date:20/01/2017  21:17  (GMT+02:00) 
</div><div>To: "VND Tremblay, Paul" <tremblay.p...@bcg.com> </div><div>Cc: 
Takeshi Yamamuro <linguin....@gmail.com>,user@spark.apache.org 
</div><div>Subject: Re: spark 2.02 error when writing to s3 </div><div>
</div>AWS S3 is eventually consistent: even after something is deleted, a 
LIST/GET call may show it. You may be seeing that effect; even after the DELETE 
has got rid of the files, a listing sees something there, And I suspect the 
time it takes for the listing to "go away" will depend on the total number of 
entries underneath, as there are more deletion markers "tombstones" to 
propagate around s3

Try deleting the path and then waiting a short period


On 20 Jan 2017, at 18:54, VND Tremblay, Paul <tremblay.p...@bcg.com> wrote:

I am using an EMR cluster, and the latest version offered is 2.02. The link 
below indicates that that user had the same problem, which seems unresolved.
 
Thanks
 
Paul
 
_________________________________________________________________________________________________

Paul Tremblay 
Analytics Specialist 
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +

_________________________________________________________________________________________________

From: Takeshi Yamamuro [mailto:linguin....@gmail.com] 
Sent: Thursday, January 19, 2017 9:27 PM
To: VND Tremblay, Paul
Cc: user@spark.apache.org
Subject: Re: spark 2.02 error when writing to s3
 
Hi,
 
Do you get the same exception also in v2.1.0?
Anyway, I saw another guy reporting the same error, I think.
https://www.mail-archive.com/user@spark.apache.org/msg60882.html
 
// maropu
 
 
On Fri, Jan 20, 2017 at 5:15 AM, VND Tremblay, Paul <tremblay.p...@bcg.com> 
wrote:
I have come across a problem when writing CSV files to S3 in Spark 2.02. The 
problem does not exist in Spark 1.6.
 
19:09:20 Caused by: java.io.IOException: File already 
exists:s3://stx-apollo-pr-datascience-internal/revenue_model/part-r-00025-c48a0d52-9600-4495-913c-64ae6bf888bd.csv
 
 
My code is this:
 
new_rdd\
135         .map(add_date_diff)\
136         .map(sid_offer_days)\
137         .groupByKey()\
138         .map(custom_sort)\
139         .map(before_rev_date)\
140         .map(lambda x, num_weeks = args.num_weeks: create_columns(x, 
num_weeks))\
141         .toDF()\
142         .write.csv(
143                 sep = "|",
144                 header = True,
145                 nullValue = '',
146                 quote = None,
147                 path = path
148                 )
 
In order to get the path (the last argument), I call this function:
 
150 def _get_s3_write(test):
151     if s3_utility.s3_data_already_exists(_get_write_bucket_name(), 
_get_s3_write_dir(test)):
152         s3_utility.remove_s3_dir(_get_write_bucket_name(), 
_get_s3_write_dir(test))
153     return make_s3_path(_get_write_bucket_name(), _get_s3_write_dir(test))
 
In other words, I am removing the directory if it exists before I write. 
 
Notes:
 
* If I use a small set of data, then I don't get the error
 
* If I use Spark 1.6, I don't get the error
 
* If I read in a simple dataframe and then write to S3, I still get the error 
(without doing any transformations)
 
* If I do the previous step with a smaller set of data, I don't get the error.
 
* I am using pyspark, with python 2.7
 
* The thread at this link: 
https://forums.aws.amazon.com/thread.jspa?threadID=152470  Indicates the 
problem is caused by a problem sync problem. With large datasets, spark tries 
to write multiple times and causes the error. The suggestion is to turn off 
speculation, but I believe speculation is turned off by default in pyspark.
 
Thanks!
 
Paul
 
 
_____________________________________________________________________________________________________

Paul Tremblay 
Analytics Specialist 

THE BOSTON CONSULTING GROUP
STL ▪ 

Tel. + ▪ Mobile +
tremblay.p...@bcg.com
_____________________________________________________________________________________________________

Read BCG's latest insights, analysis, and viewpoints at bcgperspectives.com
 

The Boston Consulting Group, Inc. 

This e-mail message may contain confidential and/or privileged information. If 
you are not an addressee or otherwise authorized to receive this message, you 
should not use, copy, disclose or  take any action based on this e-mail or any 
information contained in the message. If you have received this material in 
error, please advise the sender immediately by reply e-mail and delete this 
message. Thank you.


 
-- 
---
Takeshi Yamamuro

Re: spark 2.02 error when writing to s3

Reply via email to