Re: DirectFileOutputCommiter

Alexander Pivovarov Fri, 26 Feb 2016 12:28:51 -0800

DirectOutputCommitter doc says:
The FileOutputCommitter is required for HDFS + speculation, which allows
only one writer at
 a time for a file (so two people racing to write the same file would not
work). However, S3
 supports multiple writers outputting to the same file, where visibility is
guaranteed to be
 atomic. This is a monotonic operation: all writers should be writing the
same data, so which
 one wins is immaterial.


aws impl is better because it uses DirectFileOutputCommitter only for
s3n:// files
https://gist.github.com/apivovarov/bb215f08318318570567

But for some reason it does not work for me.

On Fri, Feb 26, 2016 at 11:50 AM, Reynold Xin <[email protected]> wrote:

> It could lose data in speculation mode, or if any job fails.
>
> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <[email protected]>
> wrote:
>
>> Takeshi, do you know the reason why they wanted to remove this commiter
>> in SPARK-10063?
>> the jira has no info inside
>> as far as I understand the direct committer can't be used when either of
>> two is true
>> 1. speculation mode
>> 2. append mode(ie. not creating new version of data but appending to
>> existing data)
>>
>> On 26 February 2016 at 08:24, Takeshi Yamamuro <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Great work!
>>> What is the concrete performance gain of the committer on s3?
>>> I'd like to know.
>>>
>>> I think there is no direct committer for files because these kinds of
>>> committer has risks
>>> to loss data (See: SPARK-10063).
>>> Until this resolved, ISTM files cannot support direct commits.
>>>
>>> thanks,
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <[email protected]> wrote:
>>>
>>>> yes, should be this one
>>>> https://gist.github.com/aarondav/c513916e72101bbe14ec
>>>>
>>>> then need to set it in spark-defaults.conf :
>>>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>>>>
>>>> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
>>>> > The header of DirectOutputCommitter.scala says Databricks.
>>>> > Did you get it from Databricks ?
>>>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <[email protected]> wrote:
>>>> >>
>>>> >> interesting in this topic as well, why the DirectFileOutputCommitter
>>>> not included?
>>>> >> we added it in our fork,
>>>> under 
>>>> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>>>> >> moreover, this DirectFileOutputCommitter is not working for the
>>>> insert operations in HiveContext, since the Committer is called by hive
>>>> (means uses dependencies in hive package)
>>>> >> we made some hack to fix this, you can take a look:
>>>> >>
>>>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>>>> >>
>>>> >> may bring some ideas to other spark contributors to find a better
>>>> way to use s3.
>>>> >>
>>>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <[email protected]>:
>>>> >>>
>>>> >>> Hi,
>>>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
>>>> alikes
>>>> >>> especially when working with s3?
>>>> >>> I know that there is one impl in spark distro for parquet format,
>>>> but not
>>>> >>> for files -  why?
>>>> >>>
>>>> >>> Imho, it can bring huge performance boost.
>>>> >>> Using default FileOutputCommiter with s3 has big overhead at commit
>>>> stage
>>>> >>> when all parts are copied one-by-one to destination dir from
>>>> _temporary,
>>>> >>> which is bottleneck when number of partitions is high.
>>>> >>>
>>>> >>> Also, wanted to know if there are some problems when using
>>>> >>> DirectFileOutputCommitter?
>>>> >>> If writing one partition directly will fail in the middle is spark
>>>> will
>>>> >>> notice this and will fail job(say after all retries)?
>>>> >>>
>>>> >>> thanks in advance
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>>>> >>> Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >>>
>>>> >>>
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe, e-mail: [email protected]
>>>> >>> For additional commands, e-mail: [email protected]
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>

Re: DirectFileOutputCommiter

Reply via email to