fanrui created FLINK-27396:
------------------------------

             Summary: Reduce the Heartbeat timeout after zookeeper suspended
                 Key: FLINK-27396
                 URL: https://issues.apache.org/jira/browse/FLINK-27396
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.14.0, 1.15.0
            Reporter: fanrui
             Fix For: 1.16.0


After FLINK-10052, flink will tolerate zk suspension if 
`high-availability.zookeeper.client.tolerate-suspended-connections` is enabled. 
This feature is very useful, it reduces unnecessary Flink job failover in case 
of zk server crashing some nodes or zk rolling restart.

Two cases result in zk SUSPENDED:
 * The zk server to which the TM/JM is connected is stopped
 * TM has a network partition.

For the first case, we hope Flink can tolerate it. For the second case, we want 
the TM to fail fast, because the JM may have started a new TM, and if this TM 
does not fail, it may deal with duplicate data (network partitioning is 
complicated). But in the second case, TM will still run until zk 
lost(high-availability.zookeeper.client.session-timeout, default 60s) or 
heartbeat timeout with JM (heartbeat.timeout, default 50s).

Can we set heartbeat.timeout to 20s if zk is suspended? If zk is suspended and 
the heartbeat times out, execute zk lost related logic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to