[ https://issues.apache.org/jira/browse/FLINK-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553247#comment-16553247 ]
Gary Yao edited comment on FLINK-9159 at 7/23/18 6:41 PM: ---------------------------------------------------------- [~till.rohrmann] Find below the config keys that I had a look at and their default values. ||Config Key||Default Value|| |slotmanager.request-timeout|10 m| |slotmanager.taskmanager-timeout|30 s| |slot.request.timeout|5 m| |slot.idle.timeout|50 s| |taskmanager.registration.timeout|5 m| |mesos.failover-timeout|10 m| |resourcemanager.job.timeout|5 m| |heartbeat.timeout|50s| |heartbeat.interval|10s| Recommendations: The value for {{mesos.failover-timeout}} is too low. The value specifies _"amount of time (in seconds) that the master will wait for the scheduler to failover before it tears down the framework by killing all its tasks/executors."_ For production systems, the recommended value is 1 week. Between slotmanager.request-timeout and slot.request.timeout effectively the minimum of both values will be used. One of them should be removed or at least both should be set to the same value. Some of the timeouts, e.g., slotmanager.taskmanager-timeout, are measured using {{System.currentTimeMillis()}}. If the stars align, e.g., during DST clock changes, this can lead to resources not being freed. was (Author: gjy): [~till.rohrmann] Find below the config keys that I had a look at and their default values. ||Config Key||Default Value|| |slotmanager.request-timeout|10 m| |slotmanager.taskmanager-timeout|30 s| |slot.request.timeout|5 m| |slot.idle.timeout|50 s| |taskmanager.registration.timeout|5 m| |mesos.failover-timeout|10 m| |resourcemanager.job.timeout|5 m| |heartbeat.timeout|50s| |heartbeat.interval|10s| Recommendations: The value for mesos.failover-timeout is too low. The value specifies _"amount of time (in seconds) that the master will wait for the scheduler to failover before it tears down the framework by killing all its tasks/executors."_ For production systems, the recommended value is 1 week. Between slotmanager.request-timeout and slot.request.timeout effectively the minimum of both values will be used. One of them should be removed or at least both should be set to the same value. Some of the timeouts, e.g., slotmanager.taskmanager-timeout, are measured using {{System.currentTimeMillis()}}. If the stars align, e.g., during DST clock changes, this can lead to resources not being freed. > Sanity check default timeout values > ----------------------------------- > > Key: FLINK-9159 > URL: https://issues.apache.org/jira/browse/FLINK-9159 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.0 > Reporter: Till Rohrmann > Assignee: Gary Yao > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.2, 1.6.0 > > > Check that the default timeout values for resource release are sanely chosen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)