Yes we deployed Spark on top of Yarn. What you suggested is very helpful, I increased the Yarn memory overhead option and it helped in most cases. (Sometime it still has some failures when the amount of data to be shuffled is large, but I guess if I continue increasing the Yarn memory overhead option, the problem should be solved, although at the expense of consuming more memory).
Thank you! On Fri, Jun 26, 2015 at 1:34 PM, Eugen Cepoi <[email protected]> wrote: > Are you using yarn? > If yes increase the yarn memory overhead option. Yarn is probably killing > your executors. > Le 26 juin 2015 20:43, "XianXing Zhang" <[email protected]> a > écrit : > >> Do we have any update on this thread? Has anyone met and solved similar >> problems before? >> >> Any pointers will be greatly appreciated! >> >> Best, >> XianXing >> >> On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu <[email protected]> wrote: >> >>> Hi Peng, >>> >>> I got exactly same error! My shuffle data is also very large. Have you >>> figured out a method to solve that? >>> >>> Thanks, >>> Jia >>> >>> On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng <[email protected]> wrote: >>> >>>> I'm deploying a Spark data processing job on an EC2 cluster, the job is >>>> small >>>> for the cluster (16 cores with 120G RAM in total), the largest RDD has >>>> only >>>> 76k+ rows. But heavily skewed in the middle (thus requires >>>> repartitioning) >>>> and each row has around 100k of data after serialization. The job >>>> always got >>>> stuck in repartitioning. Namely, the job will constantly get following >>>> errors and retries: >>>> >>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output >>>> location for shuffle >>>> >>>> org.apache.spark.shuffle.FetchFailedException: Error in opening >>>> FileSegmentManagedBuffer >>>> >>>> org.apache.spark.shuffle.FetchFailedException: >>>> java.io.FileNotFoundException: /tmp/spark-... >>>> I've tried to identify the problem but it seems like both memory and >>>> disk >>>> consumption of the machine throwing these errors are below 50%. I've >>>> also >>>> tried different configurations, including: >>>> >>>> let driver/executor memory use 60% of total memory. >>>> let netty to priortize JVM shuffling buffer. >>>> increase shuffling streaming buffer to 128m. >>>> use KryoSerializer and max out all buffers >>>> increase shuffling memoryFraction to 0.4 >>>> But none of them works. The small job always trigger the same series of >>>> errors and max out retries (upt to 1000 times). How to troubleshoot this >>>> thing in such situation? >>>> >>>> Thanks a lot if you have any clue. >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>
