Fwd: Missing output partition file in S3

Richard Catlin Mon, 19 Sep 2016 11:49:18 -0700

> Begin forwarded message:
> 
> From: "Chen, Kevin" <kevin.c...@neustar.biz>
> Subject: Re: Missing output partition file in S3
> Date: September 19, 2016 at 10:54:44 AM PDT
> To: Steve Loughran <ste...@hortonworks.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> 
> Hi Steve,
> 
> Our S3 is on US east. But this issue also occurred when we using a S3 bucket 
> on US west. We are using S3n. We use Spark standalone deployment. We run the 
> job in EC2. The datasets are about 25GB. We did not have speculative 
> execution turned on. We did not use DirectCommiter.
> 
> Thanks,
> Kevin
> 
> From: Steve Loughran <ste...@hortonworks.com <mailto:ste...@hortonworks.com>>
> Date: Friday, September 16, 2016 at 3:46 AM
> To: Chen Kevin <kevin.c...@neustar.biz <mailto:kevin.c...@neustar.biz>>
> Cc: "user@spark.apache.org <mailto:user@spark.apache.org>" 
> <user@spark.apache.org <mailto:user@spark.apache.org>>
> Subject: Re: Missing output partition file in S3
> 
> 
>> On 15 Sep 2016, at 19:37, Chen, Kevin <kevin.c...@neustar.biz 
>> <mailto:kevin.c...@neustar.biz>> wrote:
>> 
>> Hi,
>> 
>> Has any one encountered an issue of missing output partition file in S3 ? My 
>> spark job writes output to a S3 location. Occasionally, I noticed one 
>> partition file is missing. As a result, one chunk of data was lost. If I 
>> rerun the same job, the problem usually goes away. This has been happening 
>> pretty random. I observed once or twice a week on a daily run job. I am 
>> using Spark 1.2.1.
>> 
>> Very much appreciated on any input, suggestion of fix/workaround.
>> 
>> 
>> 
> 
> This doesn't sound good
> 
> Without making any promises about being able to fix this,  I would like to 
> understand the setup to see if there is something that could be done to 
> address this
> Which S3 installation? US East or elsewhere
> Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you 
> haven't already (exception, if you are using AWS EMR you have to stick with 
> their s3:// client)
> Are you running in-EC2 or remotely?
> How big are the datasets being generated?
> Do you have speculative execution turned on
> which committer? is the external "DirectCommitter", or the classic Hadoop 
> FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2 
> algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)
> 
> I should warn that the stance of myself and colleagues is "dont commit direct 
> to S3", write to HDFS and do a distcp when you finally copy out the data. S3 
> itself doesn't have enough consistency for committing output to work in the 
> presence of all race conditions and failure modes. At least here you've 
> noticed the problem; the thing people fear is not noticing that a problem has 
> arisen
> 
> -Steve
Fwd: Missing output partition file in S3

Reply via email to