Hi everyone, I observed the following behavior with Flink 1.0.2 on Hadoop 2.4.1 with a yarn session in HA mode:
2016-05-10 18:39:14,546 INFO org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 52444ms for sessionid 0x2544821cf2f818a, closing socket connection and attempting reconnect 2016-05-10 18:39:14,546 INFO org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 54871ms for sessionid 0x154481fce7881c8, closing socket connection and attempting reconnect 2016-05-10 18:39:14,730 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2016-05-10 18:39:14,872 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2016-05-10 18:39:14,907 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily). 2016-05-10 18:39:14,943 INFO org.apache.flink.yarn.YarnJobManager - JobManager akka://flink/user/jobmanager#1292460688 was revoked leadership. I am confused about the timeouts of roughly 50,000ms as the flink-conf.yml states: > reocvery.zookeeper.client.connection-timeout: 30000 > recovery.zookeeper.client.session-timeout: 120000 > recovery.zookeeper.client.retry-wait: 5000 > recovery.zookeeper.client.max-retry-attempts: 5 So I would have expected a timeout of around 120,000ms. 50,000ms is our configured akka.watch.heartbeat.interval. Is this value used instead here? Cheers, Konstantin -- Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082