Having submitted three tasks at level PROCESS_LOCAL TaskSetManager moves to next locality level and gets stuck there for 60 sec. That level is not empty but it appears it contains same tasks already submitted and successfully executed, which leads to a stall until the corresponding timeout expires. After that execution continues at level RACK_LOCAL. Is it a bug in TaskSetManager? Expected behavior? What could/should be done to avoid the delay? Here is the log for TaskSetManager messages:
22:32:35,265 (dag-scheduler-event-loop) DEBUG [o.a.s.s.TaskSetManager] - Epoch for TaskSet 1.0: 0 22:32:35,276 (dag-scheduler-event-loop) DEBUG [o.a.s.s.TaskSetManager] - Valid locality levels for TaskSet 1.0: PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, ANY 22:32:35,288 (dispatcher-event-loop-20) INFO [o.a.s.s.TaskSetManager] - Starting task 1.0 in stage 1.0 (TID 37, localhost, partition 1, PROCESS_LOCAL, 5724 bytes) 22:32:35,289 (dispatcher-event-loop-20) INFO [o.a.s.s.TaskSetManager] - Starting task 8.0 in stage 1.0 (TID 38, localhost, partition 8, PROCESS_LOCAL, 5727 bytes) 22:32:35,290 (dispatcher-event-loop-20) INFO [o.a.s.s.TaskSetManager] - Starting task 24.0 in stage 1.0 (TID 39, localhost, partition 24, PROCESS_LOCAL, 5723 bytes) 22:32:36,510 (dispatcher-event-loop-9) DEBUG [o.a.s.s.TaskSetManager] - No tasks for locality level PROCESS_LOCAL, so moving to locality level NODE_LOCAL 22:32:36,511 (task-result-getter-1) INFO [o.a.s.s.TaskSetManager] - Finished task 24.0 in stage 1.0 (TID 39) in 1222 ms on localhost (1/37) 22:32:40,655 (task-result-getter-2) INFO [o.a.s.s.TaskSetManager] - Finished task 8.0 in stage 1.0 (TID 38) in 5367 ms on localhost (2/37) 22:32:41,285 (task-result-getter-3) INFO [o.a.s.s.TaskSetManager] - Finished task 1.0 in stage 1.0 (TID 37) in 6004 ms on localhost (3/37) 22:33:37,398 (dispatcher-event-loop-18) DEBUG [o.a.s.s.TaskSetManager] - Moving to RACK_LOCAL after waiting for 60000ms 22:33:37,400 (dispatcher-event-loop-18) INFO [o.a.s.s.TaskSetManager] - Starting task 0.0 in stage 1.0 (TID 40, localhost, partition 0, RACK_LOCAL, 5720 bytes) 22:33:37,401 (dispatcher-event-loop-18) INFO [o.a.s.s.TaskSetManager] - Starting task 2.0 in stage 1.0 (TID 41, localhost, partition 2, RACK_LOCAL, 5723 bytes) 22:33:37,402 (dispatcher-event-loop-18) INFO [o.a.s.s.TaskSetManager] - Starting task 3.0 in stage 1.0 (TID 42, localhost, partition 3, RACK_LOCAL, 5725 bytes) 22:33:45,916 (dispatcher-event-loop-12) INFO [o.a.s.s.TaskSetManager] - Starting task 4.0 in stage 1.0 (TID 43, localhost, partition 4, RACK_LOCAL, 5725 bytes) ... Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TaskSetManager-stalls-for-1-min-in-the-middle-of-a-job-tp28211.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org