*HI ALL:*
*My job is cpu intensive, and its resource configuration is 400 worker
* 1 core * 3G. There are many fetch failure, like:*
14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave1:33500)
14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure
14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission
14-08-23 08:34:53 INFO [spark-akka.actor.default-dispatcher-71]
DAGScheduler: Resubmitting failed stages
14-08-23 08:35:06 WARN [Result resolver thread-2] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave2:34792)
14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure
14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission
14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: Executor lost: 118 (epoch 3)
14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-38]
BlockManagerMasterActor: Trying to remove executor 118 from
BlockManagerMaster.
14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
BlockManagerMaster: Removed 118 successfully in removeExecutor
14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-43]
DAGScheduler: Resubmitting failed stages
*stage 4 will be marked for resubmission. After a period of time:
block manager slave1:33500 will be registered again*
14-08-23 08:36:16 INFO [spark-akka.actor.default-dispatcher-58]
BlockManagerInfo: Registering block manager slave1:33500 with 1766.4
MB RAM
*unfortunately, stage 4 will be resubmitted again and again, and meet
many fetch failure. After 14-08-23 09:03:37, there is no log in
master, and print log again at 14-08-24 00:43:15*
14-08-23 09:03:37 INFO [Result resolver thread-3]
YarnClusterScheduler: Removed TaskSet 4.0, whose tasks have all
completed, from pool
14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure
14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission
14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-71]
DAGScheduler: Resubmitting failed stages
14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed
container container_1400565786114_133451_01_000171 (state: COMPLETE,
exit status: -100)
14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container
marked as failed: container_1400565786114_133451_01_000171
14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed
container container_1400565786114_133451_01_000172 (state: COMPLETE,
exit status: -100)
14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container
marked as failed: container_1400565786114_133451_01_000172
14-08-24 00:43:20 INFO [Thread-854] ApplicationMaster: Allocating 2
containers to make up for (potentially) lost containers
14-08-24 00:43:20 INFO [Thread-854] YarnAllocationHandler: Will
Allocate 2 executor containers, each with 3456 memory
*Strangely, TaskSet4.0 will be removed as its tasks have completed,
while Stage 4 was marked for resubmission. In Executor there are many
"java.net.ConnectException: Connection timed out", like:*
14-08-23 08:19:14 WARN [pool-3-thread-1] SendingConnection: Error
finishing connection to java.net.ConnectException: Connection timed
out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
*I often meet such problems, i.e. BlockManager Connection Fail, and
Spark can not recover effectively, and job will hang or fail
directly.*
*Any Suggestions? And are there any guides about resource for job in
view of computing, cache, shuffle, etc.*
*Thank You!*