Re: DirectFileOutputCommiter

Teng Qiu Thu, 25 Feb 2016 15:40:21 -0800

yes, should be this one
https://gist.github.com/aarondav/c513916e72101bbe14ec


then need to set it in spark-defaults.conf :
https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13

Am Freitag, 26. Februar 2016 schrieb Yin Yang :
> The header of DirectOutputCommitter.scala says Databricks.
> Did you get it from Databricks ?
> On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote:
>>
>> interesting in this topic as well, why the DirectFileOutputCommitter not
included?
>> we added it in our fork,
under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>> moreover, this DirectFileOutputCommitter is not working for the insert
operations in HiveContext, since the Committer is called by hive (means
uses dependencies in hive package)
>> we made some hack to fix this, you can take a look:
>>
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>>
>> may bring some ideas to other spark contributors to find a better way to
use s3.
>>
>> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>:
>>>
>>> Hi,
>>> Wanted to understand if anybody uses DirectFileOutputCommitter or alikes
>>> especially when working with s3?
>>> I know that there is one impl in spark distro for parquet format, but
not
>>> for files -  why?
>>>
>>> Imho, it can bring huge performance boost.
>>> Using default FileOutputCommiter with s3 has big overhead at commit
stage
>>> when all parts are copied one-by-one to destination dir from _temporary,
>>> which is bottleneck when number of partitions is high.
>>>
>>> Also, wanted to know if there are some problems when using
>>> DirectFileOutputCommitter?
>>> If writing one partition directly will fail in the middle is spark will
>>> notice this and will fail job(say after all retries)?
>>>
>>> thanks in advance
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>
>

Re: DirectFileOutputCommiter

Reply via email to