Re: DirectFileOutputCommiter

Steve Loughran Mon, 29 Feb 2016 02:27:07 -0800

> On 26 Feb 2016, at 06:24, Takeshi Yamamuro <linguin....@gmail.com> wrote:
> 
> Hi,
> 
> Great work!
> What is the concrete performance gain of the committer on s3?
> I'd like to know.
> 
> I think there is no direct committer for files because these kinds of 
> committer has risks
> to loss data (See: SPARK-10063).
> Until this resolved, ISTM files cannot support direct commits.
>


that's speculative output via any committer; you cannot use s3 as a speculative 
destination for spark, MR, hive, etc.

Speculative output relies on being able to commit a file operation (create with 
overwrite==false) file rename or directory rename being atomic with respect to 
the check for the destination existing and the operation of creating or 
renaming. There's also a tendency to assume that file directory/rename 
operations are O(1)

S3 (and openstack swift) don't offer those semantics. The check for existence 
is done client-side before the operation, against a remote store whose metadata 
may not be consistent anyway (i.e it says the blob isn't there when it is, and 
vice versa). With rename() and delete() being done client-side, they are 
O(files * len(files), and can fail partway through.

what the direct committer does is bypass any attempt to write then commit by 
renaming, which is the performance killer.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DirectFileOutputCommiter

Reply via email to