> Begin forwarded message:
>
> From: "Chen, Kevin" <kevin.c...@neustar.biz>
> Subject: Re: Missing output partition file in S3
> Date: September 19, 2016 at 10:54:44 AM PDT
> To: Steve Loughran <ste...@hortonworks.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
>
> Hi Steve,
>
> Our S3 is on US east. But this issue also occurred when we using a S3 bucket
> on US west. We are using S3n. We use Spark standalone deployment. We run the
> job in EC2. The datasets are about 25GB. We did not have speculative
> execution turned on. We did not use DirectCommiter.
>
> Thanks,
> Kevin
>
> From: Steve Loughran <ste...@hortonworks.com <mailto:ste...@hortonworks.com>>
> Date: Friday, September 16, 2016 at 3:46 AM
> To: Chen Kevin <kevin.c...@neustar.biz <mailto:kevin.c...@neustar.biz>>
> Cc: "user@spark.apache.org <mailto:user@spark.apache.org>"
> <user@spark.apache.org <mailto:user@spark.apache.org>>
> Subject: Re: Missing output partition file in S3
>
>
>> On 15 Sep 2016, at 19:37, Chen, Kevin <kevin.c...@neustar.biz
>> <mailto:kevin.c...@neustar.biz>> wrote:
>>
>> Hi,
>>
>> Has any one encountered an issue of missing output partition file in S3 ? My
>> spark job writes output to a S3 location. Occasionally, I noticed one
>> partition file is missing. As a result, one chunk of data was lost. If I
>> rerun the same job, the problem usually goes away. This has been happening
>> pretty random. I observed once or twice a week on a daily run job. I am
>> using Spark 1.2.1.
>>
>> Very much appreciated on any input, suggestion of fix/workaround.
>>
>>
>>
>
> This doesn't sound good
>
> Without making any promises about being able to fix this, I would like to
> understand the setup to see if there is something that could be done to
> address this
> Which S3 installation? US East or elsewhere
> Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you
> haven't already (exception, if you are using AWS EMR you have to stick with
> their s3:// client)
> Are you running in-EC2 or remotely?
> How big are the datasets being generated?
> Do you have speculative execution turned on
> which committer? is the external "DirectCommitter", or the classic Hadoop
> FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2
> algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)
>
> I should warn that the stance of myself and colleagues is "dont commit direct
> to S3", write to HDFS and do a distcp when you finally copy out the data. S3
> itself doesn't have enough consistency for committing output to work in the
> presence of all race conditions and failure modes. At least here you've
> noticed the problem; the thing people fear is not noticing that a problem has
> arisen
>
> -Steve