yes, should be this one https://gist.github.com/aarondav/c513916e72101bbe14ec
then need to set it in spark-defaults.conf : https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 Am Freitag, 26. Februar 2016 schrieb Yin Yang : > The header of DirectOutputCommitter.scala says Databricks. > Did you get it from Databricks ? > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote: >> >> interesting in this topic as well, why the DirectFileOutputCommitter not included? >> we added it in our fork, under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala >> moreover, this DirectFileOutputCommitter is not working for the insert operations in HiveContext, since the Committer is called by hive (means uses dependencies in hive package) >> we made some hack to fix this, you can take a look: >> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando >> >> may bring some ideas to other spark contributors to find a better way to use s3. >> >> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>: >>> >>> Hi, >>> Wanted to understand if anybody uses DirectFileOutputCommitter or alikes >>> especially when working with s3? >>> I know that there is one impl in spark distro for parquet format, but not >>> for files - why? >>> >>> Imho, it can bring huge performance boost. >>> Using default FileOutputCommiter with s3 has big overhead at commit stage >>> when all parts are copied one-by-one to destination dir from _temporary, >>> which is bottleneck when number of partitions is high. >>> >>> Also, wanted to know if there are some problems when using >>> DirectFileOutputCommitter? >>> If writing one partition directly will fail in the middle is spark will >>> notice this and will fail job(say after all retries)? >>> >>> thanks in advance >>> >>> >>> >>> >>> -- >>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> > >