I have this problem too. Eventually the job fails (on the UI) and hangs the terminal until I CTRL + C. (Logs below)
Now the Spark docs explain the heartbeat configuration stuff can be tweaked to handle GC hangs. I'm wondering if this is symptomatic of pushing the cluster a little too hard (we where also running a hdfs balance which died of an OOM). What sort of values should I try increasing the configurables too???? 14/03/22 21:45:47 ERROR scheduler.TaskSchedulerImpl: Lost executor 6 on ip-172-31-0-126.ec2.internal: remote Akka client disassociated 14/03/22 21:45:47 INFO scheduler.TaskSetManager: Re-queueing tasks for 6 from TaskSet 1.0 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 720 (task 1.0:248) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 722 (task 1.0:250) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 686 (task 1.0:214) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 698 (task 1.0:226) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 707 (task 1.0:235) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 709 (task 1.0:237) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 694 (task 1.0:222) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 688 (task 1.0:216) 14/03/22 21:45:47 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 5) 14/03/22 21:45:47 INFO storage.BlockManagerMasterActor: Trying to remove executor 6 from BlockManagerMaster. 14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated: app-20140322213226-0044/6 is now FAILED (Command exited with code 137) 14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140322213226-0044/6 removed: Command exited with code 137 14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor added: app-20140322213226-0044/9 on worker-20140321161205-ip-172-31-0-126.ec2.internal-50034 (ip-172-31-0-126.ec2.internal:50034) with 8 cores 14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140322213226-0044/9 on hostPort ip-172-31-0-126.ec2.internal:50034 with 8 cores, 13.5 GB RAM 14/03/22 21:45:47 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor 14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated: app-20140322213226-0044/9 is now RUNNING 14/03/22 21:45:49 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_6_236 in memory on ec2-54-84-166-37.compute-1.amazonaws.com:56804 (size: 72.5 MB, free: 3.4 GB) 14/03/22 21:45:49 INFO scheduler.TaskSetManager: Starting task 1.0:216 as TID 729 on executor 8: ec2-54-84-166-37.compute-1.amazonaws.com (PROCESS_LOCAL) .... more stuff happens .... 14/03/22 21:52:09 ERROR scheduler.TaskSchedulerImpl: Lost executor 12 on ip-172-31-8-63.ec2.internal: remote Akka client disassociated 14/03/22 21:52:09 INFO scheduler.TaskSetManager: Re-queueing tasks for 12 from TaskSet 1.0 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 828 (task 1.0:339) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 830 (task 1.0:305) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 824 (task 1.0:302) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 827 (task 1.0:313) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 826 (task 1.0:338) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 829 (task 1.0:311) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 823 (task 1.0:314) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 825 (task 1.0:312) 14/03/22 21:52:09 INFO scheduler.DAGScheduler: Executor lost: 12 (epoch 10) 14/03/22 21:52:09 INFO storage.BlockManagerMasterActor: Trying to remove executor 12 from BlockManagerMaster. 14/03/22 21:52:09 INFO storage.BlockManagerMaster: Removed 12 successfully in removeExecutor 14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 11 disconnected, so removing it 14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 11 on ec2-54-84-151-18.compute-1.amazonaws.com: remote Akka client disassociated 14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 11 from TaskSet 1.0 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 837 (task 1.0:331) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 831 (task 1.0:341) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 839 (task 1.0:347) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 836 (task 1.0:284) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 794 (task 1.0:271) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 838 (task 1.0:273) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 841 (task 1.0:296) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 840 (task 1.0:276) 14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 11) 14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster. 14/03/22 21:52:10 INFO storage.BlockManagerMaster: Removed 11 successfully in removeExecutor 14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 9 disconnected, so removing it 14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 9 on ip-172-31-0-126.ec2.internal: remote Akka client disassociated 14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 9 from TaskSet 1.0 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 812 (task 1.0:324) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 815 (task 1.0:330) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 833 (task 1.0:345) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 809 (task 1.0:317) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 818 (task 1.0:334) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 811 (task 1.0:321) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.0:344) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 834 (task 1.0:346) 14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 9 (epoch 12) 14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove executor 9 from BlockManagerMaster. 14/03/22 21:52:10 INFO storage.BlockManagerMaster: Removed 9 successfully in removeExecutor HANGS requires CTRL + C -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/worker-keeps-getting-disassociated-upon-a-failed-job-spark-version-0-90-tp2099p3040.html Sent from the Apache Spark User List mailing list archive at Nabble.com.