[
https://issues.apache.org/jira/browse/SPARK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15476952#comment-15476952
]
Sean Owen commented on SPARK-17468:
-----------------------------------
Doesn't the worker then die? I'm not clear in this case why you'd have workers
still running for any significant period of time.
> Cluster workers crushed when master network bad more than one
> WORKER_TIMEOUT_MS!
> --------------------------------------------------------------------------------
>
> Key: SPARK-17468
> URL: https://issues.apache.org/jira/browse/SPARK-17468
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.1
> Environment: CentOS 6.5, Spark standalone, 15 machines,15worker and
> 2master,there are worker,master,driver on the same machine
> Reporter: zhangzhiyan
> Priority: Critical
> Labels: Spark, WORKER_TIMEOUT_MS, crush, standalone
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> I'm from China.My production spark standalone is crushed on 9.9 sales, please
> help me to tell how to solve this problem,thanks.
> master log is below:
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814124907-10.205.130.37-16590 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814113016-10.205.130.13-57487 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814134926-10.205.130.39-11430 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814131257-10.205.130.38-32160 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814161444-10.205.136.19-14196 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814141654-10.205.130.42-49707 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814115125-10.205.130.14-38381 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814152146-10.205.136.10-24730 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814122817-10.205.130.36-54348 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:57 WARN Master: Removing
> worker-20160814170452-10.205.136.34-9921 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:58 WARN Master: Removing
> worker-20160814154744-10.205.136.12-12399 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:58 WARN Master: Removing
> worker-20160814150355-10.205.130.44-5792 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:58 WARN Master: Removing
> worker-20160814143901-10.205.130.43-2223 because we got no heartbeat in 60
> seconds
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814124907-10.205.130.37-16590. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814170452-10.205.136.34-9921. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814141654-10.205.130.42-49707. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814115125-10.205.130.14-38381. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814134926-10.205.130.39-11430. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814131257-10.205.130.38-32160. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814150355-10.205.130.44-5792. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814154744-10.205.136.12-12399. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814161444-10.205.136.19-14196. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814113016-10.205.130.13-57487. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814152146-10.205.136.10-24730. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814143901-10.205.130.43-2223. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814122817-10.205.130.36-54348. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814124907-10.205.130.37-16590. Asking it to re-register.
> 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker
> worker-20160814170452-10.205.136.34-9921. Asking it to re-register.
> and I found the code here may be wrong, when master network is not ok more
> than WORKER_TIMEOUT_MS, master will remove worker and executor information in
> it's memory, but when workers recover connection again with master
> quickly,because it's old info has been erased on master, despite it still
> running the old executors, master will allocate more resource than worker can
> afford,that comes crush my workers.
> So I try to increase WORKER_TIMEOUT_MS to 3 minutes, is that ok?Can you give
> me some advice?
> code address:
> org.apache.spark.deploy.master.Master,line 1023
> /** Check for, and remove, any timed-out workers */
> private def timeOutDeadWorkers() {
> // Copy the workers into an array so we don't modify the hashset while
> iterating through it
> val currentTime = System.currentTimeMillis()
> val toRemove = workers.filter(_.lastHeartbeat < currentTime -
> WORKER_TIMEOUT_MS).toArray
> for (worker <- toRemove) {
> if (worker.state != WorkerState.DEAD) {
> logWarning("Removing %s because we got no heartbeat in %d
> seconds".format(
> worker.id, WORKER_TIMEOUT_MS / 1000))
> removeWorker(worker)
> } else {
> if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) *
> WORKER_TIMEOUT_MS)) {
> workers -= worker // we've seen this DEAD worker in the UI, etc.
> for long enough; cull it
> }
> }
> }
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]