[ https://issues.apache.org/jira/browse/FLINK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ricky Burnett updated FLINK-22516: ---------------------------------- Attachment: jm2.log > ResourceManager cannot establish leadership > ------------------------------------------- > > Key: FLINK-22516 > URL: https://issues.apache.org/jira/browse/FLINK-22516 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.9.3 > Environment: 1.9.3 Flink version, on kubernetes. > Reporter: Ricky Burnett > Priority: Major > Attachments: jm1.log, jm2.log, jobmanager_leadership.log > > > We are running Flink clusters with 2 Jobmanagers in HA mode. After a > Zookeeper restart the two JMs begin leadership election end up in state where > they are both trying to start their ResourceManager and until one of them > writes to `leader/<jobid>/resource_manager_lock` and the other Jobmanager's > JobMaster proceeds to execute `notifyOfNewResourceManagerLeader` which > restarts the ResourceManager. This in turn writes to > `leader/<jobid>/resource_manager_lock` which triggers the first JobMaster to > restart it's ResourceManager. We can see this in the logs from the > "ResourceManager leader changed to new address" log, that goes back and forth > between the two JMs and the two IP addresses. This cycle appears to continue > indefinitely with outside interruption. > I've attached combined logs from two JMs in our environment that got into > this state. The logs start with the loss of connection and end with a couple > of cycles of back and forth. The two relevant hosts are > "flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and > "flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x". > *-tsxb7 appears to be the last host that was granted leadership. > {code:java} > {"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager > runner for job tenant: ssademo, pipeline: > 828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection > (33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id > 97992805-9c60-40ba-8260-aaf036694cde at > akka.tcp://flink@100.97.92.73:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":447000000},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"} > {code} > But *-mpf9x continues to try to wrestle control back. -- This message was sent by Atlassian Jira (v8.3.4#803005)