[
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-17449.
-------------------------------
Resolution: Fixed
Fix Version/s: 2.1.0
Issue resolved by pull request 15042
[https://github.com/apache/spark/pull/15042]
> Relation between heartbeatInterval and network timeout
> ------------------------------------------------------
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
> Issue Type: Improvement
> Components: Documentation
> Reporter: Yang Liang
> Priority: Minor
> Fix For: 2.1.0
>
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136
> ms exceeds timeout 120000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949
> ms exceeds timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299
> ms exceeds timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
> * A heartbeat from executors to the driver. This is a shared message used by
> several internal
> * components to convey liveness or execution information for in-progress
> tasks. It will also
> * expire the hosts that have not heartbeated for more than
> spark.network.timeout.
> */
> private val executorTimeoutMs =
> sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms")
> * 1000
> The relation between spark.network.timeout and
> spark.executor.heartbeatInterval should be mentioned in the document at
> least. Otherwise error above would be confusing. Do some checks when get
> settings ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]