I am running the job on 500 executors, each with 8G and 1 core. See lots of fetch failures on reduce stage, when running a simple reduceByKey
map tasks -> 4000 reduce tasks -> 200 On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <chen.song...@gmail.com> wrote: > I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the > following exception. > > java.io.IOException: sendMessageReliably failed because ack was not > received within 60 sec > at > org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854) > at > org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > > I have increased spark.core.connection.ack.wait.timeout to 120 seconds. > Situation is relieved but not too much. I am pretty confident it was not > due to GC on executors. What could be the reason for this? > > Chen > -- Chen Song