Hi Chen, The fetch failures seem to be happening a lot more to people on 1.1.0 -- there's a bug tracking fetch failures at https://issues.apache.org/jira/browse/SPARK-3633 that might be the same as what you're seeing. Can you take a peek at that bug and if it matches what you're observing follow it and vote for it?
There currently seem to be 3 things causing FetchFailures in 1.1: 1) long GCs on an executor (longer than spark.core.connection.ack.wait.timeout default 60sec) 2) too many files open (hit kernel limits on ulimit -n) 3) some undetermined issue being tracked on that ticket Hope that helps! Andrew On Tue, Sep 23, 2014 at 11:14 AM, Chen Song <chen.song...@gmail.com> wrote: > I am running the job on 500 executors, each with 8G and 1 core. > > See lots of fetch failures on reduce stage, when running a simple > reduceByKey > > map tasks -> 4000 > reduce tasks -> 200 > > > > On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <chen.song...@gmail.com> > wrote: > >> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the >> following exception. >> >> java.io.IOException: sendMessageReliably failed because ack was not >> received within 60 sec >> at >> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854) >> at >> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852) >> at scala.Option.foreach(Option.scala:236) >> at >> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852) >> at java.util.TimerThread.mainLoop(Timer.java:555) >> at java.util.TimerThread.run(Timer.java:505) >> >> I have increased spark.core.connection.ack.wait.timeout to 120 seconds. >> Situation is relieved but not too much. I am pretty confident it was not >> due to GC on executors. What could be the reason for this? >> >> Chen >> > > > > -- > Chen Song > >