Re: DirectFileOutputCommiter

Takeshi Yamamuro Mon, 29 Feb 2016 00:07:56 -0800

Hi,

I think the essential culprit is that these committers are not idempotent;
retry attempts will fail.
See codes below for details;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L130


On Sat, Feb 27, 2016 at 7:38 PM, Igor Berman <[email protected]> wrote:

> Hi Reynold,
> thanks for the response
> Yes, speculation mode needs some coordination.
> Regarding job failure :
> correct me if I wrong - if one of jobs fails - client code will be sort of
> "notified" by exception or something similar, so the client can decide to
> re-submit action(job), i.e. it won't be "silent" failure.
>
>
> On 26 February 2016 at 11:50, Reynold Xin <[email protected]> wrote:
>
>> It could lose data in speculation mode, or if any job fails.
>>
>> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <[email protected]>
>> wrote:
>>
>>> Takeshi, do you know the reason why they wanted to remove this commiter
>>> in SPARK-10063?
>>> the jira has no info inside
>>> as far as I understand the direct committer can't be used when either of
>>> two is true
>>> 1. speculation mode
>>> 2. append mode(ie. not creating new version of data but appending to
>>> existing data)
>>>
>>> On 26 February 2016 at 08:24, Takeshi Yamamuro <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Great work!
>>>> What is the concrete performance gain of the committer on s3?
>>>> I'd like to know.
>>>>
>>>> I think there is no direct committer for files because these kinds of
>>>> committer has risks
>>>> to loss data (See: SPARK-10063).
>>>> Until this resolved, ISTM files cannot support direct commits.
>>>>
>>>> thanks,
>>>>
>>>>
>>>>
>>>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <[email protected]> wrote:
>>>>
>>>>> yes, should be this one
>>>>> https://gist.github.com/aarondav/c513916e72101bbe14ec
>>>>>
>>>>> then need to set it in spark-defaults.conf :
>>>>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>>>>>
>>>>> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
>>>>> > The header of DirectOutputCommitter.scala says Databricks.
>>>>> > Did you get it from Databricks ?
>>>>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <[email protected]> wrote:
>>>>> >>
>>>>> >> interesting in this topic as well, why
>>>>> the DirectFileOutputCommitter not included?
>>>>> >> we added it in our fork,
>>>>> under 
>>>>> core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>>>>> >> moreover, this DirectFileOutputCommitter is not working for the
>>>>> insert operations in HiveContext, since the Committer is called by hive
>>>>> (means uses dependencies in hive package)
>>>>> >> we made some hack to fix this, you can take a look:
>>>>> >>
>>>>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>>>>> >>
>>>>> >> may bring some ideas to other spark contributors to find a better
>>>>> way to use s3.
>>>>> >>
>>>>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <[email protected]>:
>>>>> >>>
>>>>> >>> Hi,
>>>>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
>>>>> alikes
>>>>> >>> especially when working with s3?
>>>>> >>> I know that there is one impl in spark distro for parquet format,
>>>>> but not
>>>>> >>> for files -  why?
>>>>> >>>
>>>>> >>> Imho, it can bring huge performance boost.
>>>>> >>> Using default FileOutputCommiter with s3 has big overhead at
>>>>> commit stage
>>>>> >>> when all parts are copied one-by-one to destination dir from
>>>>> _temporary,
>>>>> >>> which is bottleneck when number of partitions is high.
>>>>> >>>
>>>>> >>> Also, wanted to know if there are some problems when using
>>>>> >>> DirectFileOutputCommitter?
>>>>> >>> If writing one partition directly will fail in the middle is spark
>>>>> will
>>>>> >>> notice this and will fail job(say after all retries)?
>>>>> >>>
>>>>> >>> thanks in advance
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>>>>> >>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>> >>>
>>>>> >>>
>>>>> ---------------------------------------------------------------------
>>>>> >>> To unsubscribe, e-mail: [email protected]
>>>>> >>> For additional commands, e-mail: [email protected]
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>


-- 
---
Takeshi Yamamuro

Re: DirectFileOutputCommiter

Reply via email to