Question about the behavior of TM when it lost the zookeeper client session in HA mode

Tony Wei Sun, 13 May 2018 20:37:16 -0700

Hi all,

Recently, my flink job met a problem that caused the job failed and
restarted.


The log is list this screen snapshot



or this

```
2018-05-11 13:21:04,582 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Client
session timed out, have not heard from server in 61054ms for sessionid
0x3054b165fe2006a
2018-05-11 13:21:04,583 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Client
session timed out, have not heard from server in 61054ms for sessionid
0x3054b165fe2006a, closing socket connection and attempting reconnect
2018-05-11 13:21:04,683 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: SUSPENDED
2018-05-11 13:21:04,686 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
2018-05-11 13:21:04,689 INFO
org.apache.kafka.clients.producer.KafkaProducer               - Closing the
Kafka producer with timeoutMillis = 9223372036854775807 ms.
2018-05-11 13:21:04,694 INFO
org.apache.kafka.clients.producer.KafkaProducer               - Closing the
Kafka producer with timeoutMillis = 9223372036854775807 ms.
2018-05-11 13:21:04,698 INFO  org.apache.flink.runtime.taskmanager.Task
                 - match-rule -> (get-ordinary -> Sink: kafka-sink, get-cd
-> Sink: kafka-sink-cd) (4/32) (65a4044ac963e083f2635fe24e7f2403) switched
from RUNNING to FAILED.
java.lang.Exception: Failed to send data to Kafka: The server disconnected
before a response was received.
```

Logs showed *`org.apache.kafka.clients.producer.KafkaProducer - Closing the
Kafka producer with timeoutMillis = 9223372036854775807 ms.`* This timeout
value is *Long.MAX_VALUE*. It happened when someone called
*`producer.close()`*.

And I also saw the log said
*`org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Client session timed out, have not heard from server in 61054ms for
sessionid 0x3054b165fe2006a, closing socket connection and attempting
reconnect`*
and *`org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
- Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.`*

I have checked zookeeper and kafka and there was no error during that
period.
I was wondering if TM will stop the tasks when it lost zookeeper client in
HA mode. Since I didn't see any document or mailing thread discuss this,
I'm not sure if this is the reason that made kafka producer closed.
Could someone who know HA well? Or someone know what happened in my job?

My flink cluster version is 1.4.0 with 2 masters and 10 slaves. My
zookeeper cluster version is 3.4.11 with 3 nodes.
The *`high-availability.zookeeper.client.session-timeout`* is default
value: 60000 ms.
The *`maxSessionTimeout`* in zoo.cfg is 40000ms.
I have already change the *maxSessionTimeout* to 120000ms this morning.

This problem happened many many times during the last weekend and made my
kafka log delay grew up. Please help me. Thank you very much!

Best Regards,
Tony Wei

Question about the behavior of TM when it lost the zookeeper client session in HA mode

Reply via email to