Yes we deployed Spark on top of Yarn.

What you suggested is very helpful, I increased the Yarn memory overhead
option and it helped in most cases. (Sometime it still has some failures
when the amount of data to be shuffled is large, but I guess if I continue
increasing the Yarn memory overhead option, the problem should be solved,
although at the expense of consuming more memory).

Thank you!

On Fri, Jun 26, 2015 at 1:34 PM, Eugen Cepoi <[email protected]> wrote:

> Are you using yarn?
> If yes increase the yarn memory overhead option. Yarn is probably killing
> your executors.
> Le 26 juin 2015 20:43, "XianXing Zhang" <[email protected]> a
> écrit :
>
>> Do we have any update on this thread? Has anyone met and solved similar
>> problems before?
>>
>> Any pointers will be greatly appreciated!
>>
>> Best,
>> XianXing
>>
>> On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu <[email protected]> wrote:
>>
>>> Hi Peng,
>>>
>>> I got exactly same error! My shuffle data is also very large. Have you
>>> figured out a method to solve that?
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng <[email protected]> wrote:
>>>
>>>> I'm deploying a Spark data processing job on an EC2 cluster, the job is
>>>> small
>>>> for the cluster (16 cores with 120G RAM in total), the largest RDD has
>>>> only
>>>> 76k+ rows. But heavily skewed in the middle (thus requires
>>>> repartitioning)
>>>> and each row has around 100k of data after serialization. The job
>>>> always got
>>>> stuck in repartitioning. Namely, the job will constantly get following
>>>> errors and retries:
>>>>
>>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>>>> location for shuffle
>>>>
>>>> org.apache.spark.shuffle.FetchFailedException: Error in opening
>>>> FileSegmentManagedBuffer
>>>>
>>>> org.apache.spark.shuffle.FetchFailedException:
>>>> java.io.FileNotFoundException: /tmp/spark-...
>>>> I've tried to identify the problem but it seems like both memory and
>>>> disk
>>>> consumption of the machine throwing these errors are below 50%. I've
>>>> also
>>>> tried different configurations, including:
>>>>
>>>> let driver/executor memory use 60% of total memory.
>>>> let netty to priortize JVM shuffling buffer.
>>>> increase shuffling streaming buffer to 128m.
>>>> use KryoSerializer and max out all buffers
>>>> increase shuffling memoryFraction to 0.4
>>>> But none of them works. The small job always trigger the same series of
>>>> errors and max out retries (upt to 1000 times). How to troubleshoot this
>>>> thing in such situation?
>>>>
>>>> Thanks a lot if you have any clue.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>

Reply via email to