The header of DirectOutputCommitter.scala says Databricks. Did you get it from Databricks ?
On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote: > interesting in this topic as well, why the DirectFileOutputCommitter not > included? > > we added it in our fork, under > core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala > > moreover, this DirectFileOutputCommitter is not working for the insert > operations in HiveContext, since the Committer is called by hive (means > uses dependencies in hive package) > > we made some hack to fix this, you can take a look: > > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando > > may bring some ideas to other spark contributors to find a better way to > use s3. > > > 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>: > >> Hi, >> Wanted to understand if anybody uses DirectFileOutputCommitter or alikes >> especially when working with s3? >> I know that there is one impl in spark distro for parquet format, but not >> for files - why? >> >> Imho, it can bring huge performance boost. >> Using default FileOutputCommiter with s3 has big overhead at commit stage >> when all parts are copied one-by-one to destination dir from _temporary, >> which is bottleneck when number of partitions is high. >> >> Also, wanted to know if there are some problems when using >> DirectFileOutputCommitter? >> If writing one partition directly will fail in the middle is spark will >> notice this and will fail job(say after all retries)? >> >> thanks in advance >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >