Hi Chen,

The fetch failures seem to be happening a lot more to people on 1.1.0 --
there's a bug tracking fetch failures at
https://issues.apache.org/jira/browse/SPARK-3633 that might be the same as
what you're seeing.  Can you take a peek at that bug and if it matches what
you're observing follow it and vote for it?

There currently seem to be 3 things causing FetchFailures in 1.1:

1) long GCs on an executor (longer than
spark.core.connection.ack.wait.timeout default 60sec)
2) too many files open (hit kernel limits on ulimit -n)
3) some undetermined issue being tracked on that ticket

Hope that helps!
Andrew

On Tue, Sep 23, 2014 at 11:14 AM, Chen Song <chen.song...@gmail.com> wrote:

> I am running the job on 500 executors, each with 8G and 1 core.
>
> See lots of fetch failures on reduce stage, when running a simple
> reduceByKey
>
> map tasks -> 4000
> reduce tasks -> 200
>
>
>
> On Mon, Sep 22, 2014 at 12:22 PM, Chen Song <chen.song...@gmail.com>
> wrote:
>
>> I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
>> following exception.
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
>>         at scala.Option.foreach(Option.scala:236)
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
>>         at java.util.TimerThread.mainLoop(Timer.java:555)
>>         at java.util.TimerThread.run(Timer.java:505)
>>
>> I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
>> Situation is relieved but not too much. I am pretty confident it was not
>> due to GC on executors. What could be the reason for this?
>>
>> Chen
>>
>
>
>
> --
> Chen Song
>
>

Reply via email to