Hi, I think the essential culprit is that these committers are not idempotent; retry attempts will fail. See codes below for details; https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L130
On Sat, Feb 27, 2016 at 7:38 PM, Igor Berman <igor.ber...@gmail.com> wrote: > Hi Reynold, > thanks for the response > Yes, speculation mode needs some coordination. > Regarding job failure : > correct me if I wrong - if one of jobs fails - client code will be sort of > "notified" by exception or something similar, so the client can decide to > re-submit action(job), i.e. it won't be "silent" failure. > > > On 26 February 2016 at 11:50, Reynold Xin <r...@databricks.com> wrote: > >> It could lose data in speculation mode, or if any job fails. >> >> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <igor.ber...@gmail.com> >> wrote: >> >>> Takeshi, do you know the reason why they wanted to remove this commiter >>> in SPARK-10063? >>> the jira has no info inside >>> as far as I understand the direct committer can't be used when either of >>> two is true >>> 1. speculation mode >>> 2. append mode(ie. not creating new version of data but appending to >>> existing data) >>> >>> On 26 February 2016 at 08:24, Takeshi Yamamuro <linguin....@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Great work! >>>> What is the concrete performance gain of the committer on s3? >>>> I'd like to know. >>>> >>>> I think there is no direct committer for files because these kinds of >>>> committer has risks >>>> to loss data (See: SPARK-10063). >>>> Until this resolved, ISTM files cannot support direct commits. >>>> >>>> thanks, >>>> >>>> >>>> >>>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <teng...@gmail.com> wrote: >>>> >>>>> yes, should be this one >>>>> https://gist.github.com/aarondav/c513916e72101bbe14ec >>>>> >>>>> then need to set it in spark-defaults.conf : >>>>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 >>>>> >>>>> Am Freitag, 26. Februar 2016 schrieb Yin Yang : >>>>> > The header of DirectOutputCommitter.scala says Databricks. >>>>> > Did you get it from Databricks ? >>>>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote: >>>>> >> >>>>> >> interesting in this topic as well, why >>>>> the DirectFileOutputCommitter not included? >>>>> >> we added it in our fork, >>>>> under >>>>> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala >>>>> >> moreover, this DirectFileOutputCommitter is not working for the >>>>> insert operations in HiveContext, since the Committer is called by hive >>>>> (means uses dependencies in hive package) >>>>> >> we made some hack to fix this, you can take a look: >>>>> >> >>>>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando >>>>> >> >>>>> >> may bring some ideas to other spark contributors to find a better >>>>> way to use s3. >>>>> >> >>>>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>: >>>>> >>> >>>>> >>> Hi, >>>>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or >>>>> alikes >>>>> >>> especially when working with s3? >>>>> >>> I know that there is one impl in spark distro for parquet format, >>>>> but not >>>>> >>> for files - why? >>>>> >>> >>>>> >>> Imho, it can bring huge performance boost. >>>>> >>> Using default FileOutputCommiter with s3 has big overhead at >>>>> commit stage >>>>> >>> when all parts are copied one-by-one to destination dir from >>>>> _temporary, >>>>> >>> which is bottleneck when number of partitions is high. >>>>> >>> >>>>> >>> Also, wanted to know if there are some problems when using >>>>> >>> DirectFileOutputCommitter? >>>>> >>> If writing one partition directly will fail in the middle is spark >>>>> will >>>>> >>> notice this and will fail job(say after all retries)? >>>>> >>> >>>>> >>> thanks in advance >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> -- >>>>> >>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html >>>>> >>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>> >>>>> >>> >>>>> --------------------------------------------------------------------- >>>>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> >>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>> >>>>> >> >>>>> > >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >> > -- --- Takeshi Yamamuro