I think I emailed about a similar issue, but in standalone mode. I haven't investigated much so I don't know what's a good fix.
On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou <dearji...@gmail.com> wrote: > Hi, > > I am having this FetchFailed issue when the driver is about to collect > about > 2.5M lines of short strings (about 10 characters each line) from a YARN > cluster with 400 nodes: > > *14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage > 0.0 (TID 1228, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com, > 37899, 0), shuffleId=0, mapId=420, reduceId=205) > 14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 603.0 in stage > 0.0 (TID 1626, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com, > 37899, 0), shuffleId=0, mapId=420, reduceId=603)* > > And other than this FetchFailed, I am not able to see anything else from > the > log file (no OOM errors shown). > > This does not happen when there is only 2M lines. I guess it might because > of the akka message size, and then I used the following > > spark.akka.frameSize 100 > spark.akka.timeout 200 > > And that does not help as well. Has anyone experienced similar problems? > > Thanks, > Jiayu > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >