Can you test by enabling emrfs consistent view and use s3:// uri. http://docs.aws.amazon.com/emr/latest/ManagementGuide/enable-consistent-view.html
<div>-------- Original message --------</div><div>From: Steve Loughran <ste...@hortonworks.com> </div><div>Date:20/01/2017 21:17 (GMT+02:00) </div><div>To: "VND Tremblay, Paul" <tremblay.p...@bcg.com> </div><div>Cc: Takeshi Yamamuro <linguin....@gmail.com>,user@spark.apache.org </div><div>Subject: Re: spark 2.02 error when writing to s3 </div><div> </div>AWS S3 is eventually consistent: even after something is deleted, a LIST/GET call may show it. You may be seeing that effect; even after the DELETE has got rid of the files, a listing sees something there, And I suspect the time it takes for the listing to "go away" will depend on the total number of entries underneath, as there are more deletion markers "tombstones" to propagate around s3 Try deleting the path and then waiting a short period On 20 Jan 2017, at 18:54, VND Tremblay, Paul <tremblay.p...@bcg.com> wrote: I am using an EMR cluster, and the latest version offered is 2.02. The link below indicates that that user had the same problem, which seems unresolved. Thanks Paul _________________________________________________________________________________________________ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _________________________________________________________________________________________________ From: Takeshi Yamamuro [mailto:linguin....@gmail.com] Sent: Thursday, January 19, 2017 9:27 PM To: VND Tremblay, Paul Cc: user@spark.apache.org Subject: Re: spark 2.02 error when writing to s3 Hi, Do you get the same exception also in v2.1.0? Anyway, I saw another guy reporting the same error, I think. https://www.mail-archive.com/user@spark.apache.org/msg60882.html // maropu On Fri, Jan 20, 2017 at 5:15 AM, VND Tremblay, Paul <tremblay.p...@bcg.com> wrote: I have come across a problem when writing CSV files to S3 in Spark 2.02. The problem does not exist in Spark 1.6. 19:09:20 Caused by: java.io.IOException: File already exists:s3://stx-apollo-pr-datascience-internal/revenue_model/part-r-00025-c48a0d52-9600-4495-913c-64ae6bf888bd.csv My code is this: new_rdd\ 135 .map(add_date_diff)\ 136 .map(sid_offer_days)\ 137 .groupByKey()\ 138 .map(custom_sort)\ 139 .map(before_rev_date)\ 140 .map(lambda x, num_weeks = args.num_weeks: create_columns(x, num_weeks))\ 141 .toDF()\ 142 .write.csv( 143 sep = "|", 144 header = True, 145 nullValue = '', 146 quote = None, 147 path = path 148 ) In order to get the path (the last argument), I call this function: 150 def _get_s3_write(test): 151 if s3_utility.s3_data_already_exists(_get_write_bucket_name(), _get_s3_write_dir(test)): 152 s3_utility.remove_s3_dir(_get_write_bucket_name(), _get_s3_write_dir(test)) 153 return make_s3_path(_get_write_bucket_name(), _get_s3_write_dir(test)) In other words, I am removing the directory if it exists before I write. Notes: * If I use a small set of data, then I don't get the error * If I use Spark 1.6, I don't get the error * If I read in a simple dataframe and then write to S3, I still get the error (without doing any transformations) * If I do the previous step with a smaller set of data, I don't get the error. * I am using pyspark, with python 2.7 * The thread at this link: https://forums.aws.amazon.com/thread.jspa?threadID=152470 Indicates the problem is caused by a problem sync problem. With large datasets, spark tries to write multiple times and causes the error. The suggestion is to turn off speculation, but I believe speculation is turned off by default in pyspark. Thanks! Paul _____________________________________________________________________________________________________ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP STL ▪ Tel. + ▪ Mobile + tremblay.p...@bcg.com _____________________________________________________________________________________________________ Read BCG's latest insights, analysis, and viewpoints at bcgperspectives.com The Boston Consulting Group, Inc. This e-mail message may contain confidential and/or privileged information. If you are not an addressee or otherwise authorized to receive this message, you should not use, copy, disclose or take any action based on this e-mail or any information contained in the message. If you have received this material in error, please advise the sender immediately by reply e-mail and delete this message. Thank you. -- --- Takeshi Yamamuro