Hi Tomer, Are you able to look in your NodeManager logs to see if the NodeManagers are killing any executors for exceeding memory limits? If you observe this, you can solve the problem by bumping up spark.yarn.executor.memoryOverhead.
-Sandy On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini <[email protected]> wrote: > Hi all, > > I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that > whenever I'm running a heavy computation job in parallel to other jobs > running, I'm getting these kind of exceptions: > > * [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager- > Lost task 820.0 in stage 175.0 (TID 11327) on executor xxxxxxx: > java.io.IOException (Failed to connect to xxxxxxxxxx:35194) [duplicate 12] > > * org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 12 > > * org.apache.spark.shuffle.FetchFailedException: Failed to connect to > xxxxxxxxxxxxxxxxx:35194 > Caused by: java.io.IOException: Failed to connect > to xxxxxxxxxxxxxxxxx:35194 > > when running the heavy job alone on the cluster, I'm not getting any > errors. My guess is that spark contexts from different apps do not share > information about taken ports, and therefore collide on specific ports, > causing the job/stage to fail. Is there a way to assign a specific set of > executors to a specific spark job via "spark-submit", or is there a way to > define a range of ports to be used by the application? > > Thanks! > Tomer > > >
