> On 26 Feb 2016, at 06:24, Takeshi Yamamuro <linguin....@gmail.com> wrote: > > Hi, > > Great work! > What is the concrete performance gain of the committer on s3? > I'd like to know. > > I think there is no direct committer for files because these kinds of > committer has risks > to loss data (See: SPARK-10063). > Until this resolved, ISTM files cannot support direct commits. >
that's speculative output via any committer; you cannot use s3 as a speculative destination for spark, MR, hive, etc. Speculative output relies on being able to commit a file operation (create with overwrite==false) file rename or directory rename being atomic with respect to the check for the destination existing and the operation of creating or renaming. There's also a tendency to assume that file directory/rename operations are O(1) S3 (and openstack swift) don't offer those semantics. The check for existence is done client-side before the operation, against a remote store whose metadata may not be consistent anyway (i.e it says the blob isn't there when it is, and vice versa). With rename() and delete() being done client-side, they are O(files * len(files), and can fail partway through. what the direct committer does is bypass any attempt to write then commit by renaming, which is the performance killer. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org