[ https://issues.apache.org/jira/browse/FLINK-34672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl closed FLINK-34672. --------------------------------- Resolution: Duplicate I forgot that I already worked on this issue. -.- Anyway, there a duplicate FLINK-36451 where the same issue was raised and where I actually provided a fix for 2.0 and 1.20. I'm in the midst of merging the back port for 1.19. Closing this issue as a duplicate. > HA deadlock between JobMasterServiceLeadershipRunner and > DefaultLeaderElectionService > ------------------------------------------------------------------------------------- > > Key: FLINK-34672 > URL: https://issues.apache.org/jira/browse/FLINK-34672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0 > Reporter: Chesnay Schepler > Assignee: Matthias Pohl > Priority: Major > Labels: pull-request-available > > We recently observed a deadlock in the JM within the HA system. > (see below for the thread dump) > [~mapohl] and I looked a bit into it and there appears to be a race condition > when leadership is revoked while a JobMaster is being started. > It appears to be caused by > {{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} > forwarding futures while holding a lock; depending on whether the forwarded > future is already complete the next stage may or may not run while holding > that same lock. > We haven't determined yet whether we should be holding that lock or not. > {code} > "DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 > daemon prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x00007f531f43d000 > nid=0x19d waiting for monitor entry [0x00007f53084fd000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462) > - waiting to lock <0x00000000f1c0e088> (a java.lang.Object) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x0000000840ddec40.accept(Unknown > Source) > at java.util.HashMap.forEach(java.base@11.0.22/HashMap.java:1337) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x0000000840dcf840.run(Unknown > Source) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549) > - locked <0x00000000f0e3f4d8> (a java.lang.Object) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x0000000840c23040.run(Unknown > Source) > at > java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.22/CompletableFuture.java:1736) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) > {code} > {code} > "jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms > elapsed=78699.01s tid=0x00007f5321c6e800 nid=0x396 waiting for monitor entry > [0x00007f530567d000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366) > - waiting to lock <0x00000000f0e3f4d8> (a java.lang.Object) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520) > - locked <0x00000000f1c0e088> (a java.lang.Object) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$1320/0x0000000840e1a840.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@11.0.22/CompletableFuture.java:859) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@11.0.22/CompletableFuture.java:837) > at > java.util.concurrent.CompletableFuture.postComplete(java.base@11.0.22/CompletableFuture.java:506) > at > java.util.concurrent.CompletableFuture.complete(java.base@11.0.22/CompletableFuture.java:2079) > at > org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.registerJobMasterServiceFutures(DefaultJobMasterServiceProcess.java:124) > at > org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:114) > at > org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess$$Lambda$1319/0x0000000840e1a440.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@11.0.22/CompletableFuture.java:859) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@11.0.22/CompletableFuture.java:837) > at > java.util.concurrent.CompletableFuture.postComplete(java.base@11.0.22/CompletableFuture.java:506) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(java.base@11.0.22/CompletableFuture.java:1705) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)