The header of DirectOutputCommitter.scala says Databricks.

Did you get it from Databricks ?

On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote:

> interesting in this topic as well, why the DirectFileOutputCommitter not
> included?
>
> we added it in our fork, under
> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>
> moreover, this DirectFileOutputCommitter is not working for the insert
> operations in HiveContext, since the Committer is called by hive (means
> uses dependencies in hive package)
>
> we made some hack to fix this, you can take a look:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>
> may bring some ideas to other spark contributors to find a better way to
> use s3.
>
>
> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>:
>
>> Hi,
>> Wanted to understand if anybody uses DirectFileOutputCommitter or alikes
>> especially when working with s3?
>> I know that there is one impl in spark distro for parquet format, but not
>> for files -  why?
>>
>> Imho, it can bring huge performance boost.
>> Using default FileOutputCommiter with s3 has big overhead at commit stage
>> when all parts are copied one-by-one to destination dir from _temporary,
>> which is bottleneck when number of partitions is high.
>>
>> Also, wanted to know if there are some problems when using
>> DirectFileOutputCommitter?
>> If writing one partition directly will fail in the middle is spark will
>> notice this and will fail job(say after all retries)?
>>
>> thanks in advance
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to