Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/7385#issuecomment-122333250
It looks like the hanging test is DAGSchedulerSuite's "misbehaved
resultHandler should not crash DAGScheduler and SparkContext" test. Here's the
relevant part of the unit tests log:
```
15/07/17 09:05:49.333 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite
INFO SparkContext: Starting job: runJob at DAGSchedulerSuite.scala:762
15/07/17 09:05:49.346 dag-scheduler-event-loop INFO DAGScheduler: Got job 0
(runJob at DAGSchedulerSuite.scala:762) with 1 output partitions
(allowLocal=true)
15/07/17 09:05:49.346 dag-scheduler-event-loop INFO DAGScheduler: Final
stage: ResultStage 0(runJob at DAGSchedulerSuite.scala:762)
15/07/17 09:05:49.347 dag-scheduler-event-loop INFO DAGScheduler: Parents
of final stage: List()
15/07/17 09:05:49.347 dag-scheduler-event-loop INFO DAGScheduler: Missing
parents: List()
15/07/17 09:05:49.348 dag-scheduler-event-loop INFO DAGScheduler: Computing
the requested partition locally
15/07/17 09:05:49.353 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite
INFO DAGScheduler: Job 0 failed: runJob at DAGSchedulerSuite.scala:762, took
0.019835 s
15/07/17 09:05:49.362 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite
INFO SparkContext: Starting job: runJob at DAGSchedulerSuite.scala:773
15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Got job 1
(runJob at DAGSchedulerSuite.scala:773) with 2 output partitions
(allowLocal=false)
15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Final
stage: ResultStage 1(runJob at DAGSchedulerSuite.scala:773)
15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Parents
of final stage: List()
15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Missing
parents: List()
15/07/17 09:05:49.366 dag-scheduler-event-loop INFO DAGScheduler:
Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at
DAGSchedulerSuite.scala:772), which has no missing parents
15/07/17 09:05:49.452 dag-scheduler-event-loop INFO MemoryStore:
ensureFreeSpace(1048) called with curMem=0, maxMem=1691332116
15/07/17 09:05:49.453 dag-scheduler-event-loop INFO MemoryStore: Block
broadcast_0 stored as values in memory (estimated size 1048.0 B, free 1613.0 MB)
15/07/17 09:05:49.459 dag-scheduler-event-loop INFO MemoryStore:
ensureFreeSpace(769) called with curMem=1048, maxMem=1691332116
15/07/17 09:05:49.459 dag-scheduler-event-loop INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 769.0 B, free
1613.0 MB)
15/07/17 09:05:49.461 sparkDriver-akka.actor.default-dispatcher-3 INFO
BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61512 (size:
769.0 B, free: 1613.0 MB)
15/07/17 09:05:49.463 dag-scheduler-event-loop INFO SparkContext: Created
broadcast 0 from broadcast at DAGScheduler.scala:901
15/07/17 09:05:49.465 dag-scheduler-event-loop INFO DAGScheduler:
Submitting 2 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at
parallelize at DAGSchedulerSuite.scala:772)
15/07/17 09:05:49.466 dag-scheduler-event-loop INFO TaskSchedulerImpl:
Adding task set 1.0 with 2 tasks
15/07/17 09:05:49.488 sparkDriver-akka.actor.default-dispatcher-3 INFO
TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost,
PROCESS_LOCAL, 1369 bytes)
15/07/17 09:05:49.491 Executor task launch worker-0 INFO Executor: Running
task 0.0 in stage 1.0 (TID 0)
15/07/17 09:05:49.504 Executor task launch worker-0 INFO Executor: Finished
task 0.0 in stage 1.0 (TID 0). 810 bytes result sent to driver
15/07/17 09:05:49.506 sparkDriver-akka.actor.default-dispatcher-3 INFO
TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost,
PROCESS_LOCAL, 1426 bytes)
15/07/17 09:05:49.506 Executor task launch worker-0 INFO Executor: Running
task 1.0 in stage 1.0 (TID 1)
15/07/17 09:05:49.507 task-result-getter-0 INFO TaskSetManager: Finished
task 0.0 in stage 1.0 (TID 0) in 30 ms on localhost (1/2)
15/07/17 09:05:49.508 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite
INFO DAGScheduler: Job 1 failed: runJob at DAGSchedulerSuite.scala:773, took
0.145836 s
15/07/17 09:05:49.509 Executor task launch worker-0 INFO Executor: Finished
task 1.0 in stage 1.0 (TID 1). 810 bytes result sent to driver
15/07/17 09:05:49.511 task-result-getter-1 INFO TaskSetManager: Finished
task 1.0 in stage 1.0 (TID 1) in 6 ms on localhost (2/2)
15/07/17 09:05:49.512 dag-scheduler-event-loop INFO DAGScheduler:
ResultStage 1 (runJob at DAGSchedulerSuite.scala:773) finished in 0.041 s
15/07/17 09:05:49.512 task-result-getter-1 INFO TaskSchedulerImpl: Removed
TaskSet 1.0, whose tasks have all completed, from pool
15/07/17 09:05:49.516 dag-scheduler-event-loop ERROR
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting
down SparkContext
java.lang.IllegalStateException: Promise already completed.
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Promise$class.failure(Promise.scala:107)
at
scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:153)
at org.apache.spark.scheduler.JobWaiter.jobFailed(JobWaiter.scala:86)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1059)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1508)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1472)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1461)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/07/17 09:05:49.516 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite
INFO SparkContext: Starting job: count at DAGSchedulerSuite.scala:783
15/07/17 09:05:49.519 dag-scheduler-event-loop INFO DAGScheduler: Stopping
DAGScheduler
15/07/17 09:08:48.579 sparkDriver-akka.actor.default-dispatcher-4 WARN
HeartbeatReceiver: Removing executor driver with no recent heartbeats: 179561
ms exceeds timeout 120000 ms
15/07/17 09:08:48.580 sparkDriver-akka.actor.default-dispatcher-4 ERROR
TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed
out after 179561 ms
```
It looks like the problem is that `jobFailed` can be called multiple times
if a task that runs in `DAGScheduler.runLocallyWithinThread` fails. We get one
call from `runLocallyWithinThread` and another call inside of
`DAGScheduler.handleTaskCompletion`.
I guess this is actually a subtlety that was overlooked in the old code. I
guess one approach is to make `jobFailed` into an idempotent method. Let me
try that now...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]