I have this problem too.  Eventually the job fails (on the UI) and hangs the
terminal until I CTRL + C.  (Logs below)

Now the Spark docs explain the heartbeat configuration stuff can be tweaked
to handle GC hangs.  I'm wondering if this is symptomatic of pushing the
cluster a little too hard (we where also running a hdfs balance which died
of an OOM).

What sort of values should I try increasing the configurables too????

14/03/22 21:45:47 ERROR scheduler.TaskSchedulerImpl: Lost executor 6 on
ip-172-31-0-126.ec2.internal: remote Akka client disassociated
14/03/22 21:45:47 INFO scheduler.TaskSetManager: Re-queueing tasks for 6
from TaskSet 1.0
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 720 (task 1.0:248)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 722 (task 1.0:250)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 686 (task 1.0:214)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 698 (task 1.0:226)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 707 (task 1.0:235)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 709 (task 1.0:237)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 694 (task 1.0:222)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 688 (task 1.0:216)
14/03/22 21:45:47 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 5)
14/03/22 21:45:47 INFO storage.BlockManagerMasterActor: Trying to remove
executor 6 from BlockManagerMaster.
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated:
app-20140322213226-0044/6 is now FAILED (Command exited with code 137)
14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20140322213226-0044/6 removed: Command exited with code 137
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor added:
app-20140322213226-0044/9 on
worker-20140321161205-ip-172-31-0-126.ec2.internal-50034
(ip-172-31-0-126.ec2.internal:50034) with 8 cores
14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Granted executor
ID app-20140322213226-0044/9 on hostPort ip-172-31-0-126.ec2.internal:50034
with 8 cores, 13.5 GB RAM
14/03/22 21:45:47 INFO storage.BlockManagerMaster: Removed 6 successfully in
removeExecutor
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated:
app-20140322213226-0044/9 is now RUNNING
14/03/22 21:45:49 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_6_236 in memory on ec2-54-84-166-37.compute-1.amazonaws.com:56804
(size: 72.5 MB, free: 3.4 GB)
14/03/22 21:45:49 INFO scheduler.TaskSetManager: Starting task 1.0:216 as
TID 729 on executor 8: ec2-54-84-166-37.compute-1.amazonaws.com
(PROCESS_LOCAL)

.... more stuff happens ....

14/03/22 21:52:09 ERROR scheduler.TaskSchedulerImpl: Lost executor 12 on
ip-172-31-8-63.ec2.internal: remote Akka client disassociated
14/03/22 21:52:09 INFO scheduler.TaskSetManager: Re-queueing tasks for 12
from TaskSet 1.0
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 828 (task 1.0:339)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 830 (task 1.0:305)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 824 (task 1.0:302)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 827 (task 1.0:313)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 826 (task 1.0:338)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 829 (task 1.0:311)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 823 (task 1.0:314)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 825 (task 1.0:312)
14/03/22 21:52:09 INFO scheduler.DAGScheduler: Executor lost: 12 (epoch 10)
14/03/22 21:52:09 INFO storage.BlockManagerMasterActor: Trying to remove
executor 12 from BlockManagerMaster.
14/03/22 21:52:09 INFO storage.BlockManagerMaster: Removed 12 successfully
in removeExecutor
14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 11
disconnected, so removing it
14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 11 on
ec2-54-84-151-18.compute-1.amazonaws.com: remote Akka client disassociated
14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 11
from TaskSet 1.0
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 837 (task 1.0:331)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 831 (task 1.0:341)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 839 (task 1.0:347)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 836 (task 1.0:284)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 794 (task 1.0:271)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 838 (task 1.0:273)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 841 (task 1.0:296)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 840 (task 1.0:276)
14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 11)
14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove
executor 11 from BlockManagerMaster.
14/03/22 21:52:10 INFO storage.BlockManagerMaster: Removed 11 successfully
in removeExecutor
14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 9
disconnected, so removing it
14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 9 on
ip-172-31-0-126.ec2.internal: remote Akka client disassociated
14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 9
from TaskSet 1.0
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 812 (task 1.0:324)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 815 (task 1.0:330)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 833 (task 1.0:345)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 809 (task 1.0:317)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 818 (task 1.0:334)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 811 (task 1.0:321)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.0:344)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 834 (task 1.0:346)
14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 9 (epoch 12)
14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove
executor 9 from BlockManagerMaster.
14/03/22 21:52:10 INFO storage.BlockManagerMaster: Removed 9 successfully in
removeExecutor

HANGS requires CTRL + C




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/worker-keeps-getting-disassociated-upon-a-failed-job-spark-version-0-90-tp2099p3040.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to