I am running the job on 500 executors, each with 8G and 1 core.

See lots of fetch failures on reduce stage, when running a simple
reduceByKey

map tasks -> 4000
reduce tasks -> 200



On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <chen.song...@gmail.com> wrote:

> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
> following exception.
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>         at
> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
>         at
> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
> I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
> Situation is relieved but not too much. I am pretty confident it was not
> due to GC on executors. What could be the reason for this?
>
> Chen
>



-- 
Chen Song

Reply via email to