TaskSetManager stalls for 1 min in the middle of a job

Oleg Mazurov Wed, 14 Dec 2016 14:55:35 -0800

Having submitted three tasks at level PROCESS_LOCAL TaskSetManager moves to
next locality level and gets stuck there for 60 sec. That level is not empty
but it appears it contains same tasks already submitted and successfully
executed, which leads to a stall until the corresponding timeout expires.
After that execution continues at level RACK_LOCAL.
Is it a bug in TaskSetManager? Expected behavior? What could/should be done
to avoid the delay?
Here is the log for TaskSetManager messages:


22:32:35,265 (dag-scheduler-event-loop) DEBUG [o.a.s.s.TaskSetManager] -
Epoch for TaskSet 1.0: 0
22:32:35,276 (dag-scheduler-event-loop) DEBUG [o.a.s.s.TaskSetManager] -
Valid locality levels for TaskSet 1.0: PROCESS_LOCAL, NODE_LOCAL,
RACK_LOCAL, ANY
22:32:35,288 (dispatcher-event-loop-20) INFO  [o.a.s.s.TaskSetManager] -
Starting task 1.0 in stage 1.0 (TID 37, localhost, partition 1,
PROCESS_LOCAL, 5724 bytes)
22:32:35,289 (dispatcher-event-loop-20) INFO  [o.a.s.s.TaskSetManager] -
Starting task 8.0 in stage 1.0 (TID 38, localhost, partition 8,
PROCESS_LOCAL, 5727 bytes)
22:32:35,290 (dispatcher-event-loop-20) INFO  [o.a.s.s.TaskSetManager] -
Starting task 24.0 in stage 1.0 (TID 39, localhost, partition 24,
PROCESS_LOCAL, 5723 bytes)
22:32:36,510 (dispatcher-event-loop-9) DEBUG [o.a.s.s.TaskSetManager] - No
tasks for locality level PROCESS_LOCAL, so moving to locality level
NODE_LOCAL
22:32:36,511 (task-result-getter-1) INFO  [o.a.s.s.TaskSetManager] -
Finished task 24.0 in stage 1.0 (TID 39) in 1222 ms on localhost (1/37)
22:32:40,655 (task-result-getter-2) INFO  [o.a.s.s.TaskSetManager] -
Finished task 8.0 in stage 1.0 (TID 38) in 5367 ms on localhost (2/37)
22:32:41,285 (task-result-getter-3) INFO  [o.a.s.s.TaskSetManager] -
Finished task 1.0 in stage 1.0 (TID 37) in 6004 ms on localhost (3/37)
22:33:37,398 (dispatcher-event-loop-18) DEBUG [o.a.s.s.TaskSetManager] -
Moving to RACK_LOCAL after waiting for 60000ms
22:33:37,400 (dispatcher-event-loop-18) INFO  [o.a.s.s.TaskSetManager] -
Starting task 0.0 in stage 1.0 (TID 40, localhost, partition 0, RACK_LOCAL,
5720 bytes)
22:33:37,401 (dispatcher-event-loop-18) INFO  [o.a.s.s.TaskSetManager] -
Starting task 2.0 in stage 1.0 (TID 41, localhost, partition 2, RACK_LOCAL,
5723 bytes)
22:33:37,402 (dispatcher-event-loop-18) INFO  [o.a.s.s.TaskSetManager] -
Starting task 3.0 in stage 1.0 (TID 42, localhost, partition 3, RACK_LOCAL,
5725 bytes)
22:33:45,916 (dispatcher-event-loop-12) INFO  [o.a.s.s.TaskSetManager] -
Starting task 4.0 in stage 1.0 (TID 43, localhost, partition 4, RACK_LOCAL,
5725 bytes)
...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TaskSetManager-stalls-for-1-min-in-the-middle-of-a-job-tp28211.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

TaskSetManager stalls for 1 min in the middle of a job

Reply via email to