I am testing my spark job on yarn spark: 1.0.0-cdh5.1.0 yarn: cdh5.1.0
Once a while the spark job hung up (stuck in some stage without any progress on driver and executors) after some failures. Below is the list of typical failures on driver and executor. ** on master/driver* 14/09/16 06:42:28 WARN TaskSetManager: Loss was due to fetch failure from null 14/09/16 06:42:28 INFO DAGScheduler: Marking Stage 0 (saveAsSequenceFile at FraudUsers.scala:144) for resubmision due to a fetch failure 14/09/16 06:42:28 INFO DAGScheduler: The failed fetch was from Stage 1 (reduceByKey at FraudUsers.scala:105); marking it for resubmission 14/09/16 06:42:28 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.NullPointerException at org.apache.spark.util.JsonProtocol$.blockManagerIdToJson(JsonProtocol.scala:267) at org.apache.spark.util.JsonProtocol$.taskEndReasonToJson(JsonProtocol.scala:249) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:103) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:52) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:84) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:102) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$7.apply(SparkListenerBus.scala:58) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$7.apply(SparkListenerBus.scala:58) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:58) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) *On executor* 14/09/16 06:42:15 WARN SendingConnection: Error finishing connection to 369.bm-hadoopc-datanode.prod.lax1/10.0.81.19:42251 java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/09/16 06:42:15 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(369.bm-hadoopc-datanode.prod.lax1,42251) 14/09/16 06:42:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(369.bm-hadoopc-datanode.prod.lax1,42251) 14/09/16 06:42:15 INFO ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@257cbf16 14/09/16 06:42:15 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(369.bm-hadoopc-datanode.prod.lax1,42251) ** on driver/master* 14/09/16 06:42:14 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(767, 369.bm-hadoopc-datanode.prod.lax1, 42251, 0) ** on exectutor, 369.bm-hadoopc-datanode.prod.lax1* 14/09/16 06:48:22 INFO BlockManager: BlockManager re-registering with master 14/09/16 06:48:22 INFO BlockManagerMaster: Trying to register BlockManager 14/09/16 06:48:22 INFO BlockManagerMaster: Registered BlockManager 14/09/16 06:48:22 INFO BlockManager: Reporting 63 blocks to the master. -- Chen Song