It could lose data in speculation mode, or if any job fails. On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <igor.ber...@gmail.com> wrote:
> Takeshi, do you know the reason why they wanted to remove this commiter in > SPARK-10063? > the jira has no info inside > as far as I understand the direct committer can't be used when either of > two is true > 1. speculation mode > 2. append mode(ie. not creating new version of data but appending to > existing data) > > On 26 February 2016 at 08:24, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> Hi, >> >> Great work! >> What is the concrete performance gain of the committer on s3? >> I'd like to know. >> >> I think there is no direct committer for files because these kinds of >> committer has risks >> to loss data (See: SPARK-10063). >> Until this resolved, ISTM files cannot support direct commits. >> >> thanks, >> >> >> >> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <teng...@gmail.com> wrote: >> >>> yes, should be this one >>> https://gist.github.com/aarondav/c513916e72101bbe14ec >>> >>> then need to set it in spark-defaults.conf : >>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 >>> >>> Am Freitag, 26. Februar 2016 schrieb Yin Yang : >>> > The header of DirectOutputCommitter.scala says Databricks. >>> > Did you get it from Databricks ? >>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote: >>> >> >>> >> interesting in this topic as well, why the DirectFileOutputCommitter >>> not included? >>> >> we added it in our fork, >>> under >>> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala >>> >> moreover, this DirectFileOutputCommitter is not working for the >>> insert operations in HiveContext, since the Committer is called by hive >>> (means uses dependencies in hive package) >>> >> we made some hack to fix this, you can take a look: >>> >> >>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando >>> >> >>> >> may bring some ideas to other spark contributors to find a better way >>> to use s3. >>> >> >>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>: >>> >>> >>> >>> Hi, >>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or >>> alikes >>> >>> especially when working with s3? >>> >>> I know that there is one impl in spark distro for parquet format, >>> but not >>> >>> for files - why? >>> >>> >>> >>> Imho, it can bring huge performance boost. >>> >>> Using default FileOutputCommiter with s3 has big overhead at commit >>> stage >>> >>> when all parts are copied one-by-one to destination dir from >>> _temporary, >>> >>> which is bottleneck when number of partitions is high. >>> >>> >>> >>> Also, wanted to know if there are some problems when using >>> >>> DirectFileOutputCommitter? >>> >>> If writing one partition directly will fail in the middle is spark >>> will >>> >>> notice this and will fail job(say after all retries)? >>> >>> >>> >>> thanks in advance >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html >>> >>> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >> >>> > >>> > >>> >> >> >> >> -- >> --- >> Takeshi Yamamuro >> > >