[ https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556863#comment-17556863 ]
Matthias Pohl edited comment on FLINK-28078 at 6/21/22 2:36 PM: ---------------------------------------------------------------- {code} 16:17:07,802 [ForkJoinPool-45-worker-25] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Starting 16:17:07,804 [ForkJoinPool-45-worker-25] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Default schema 16:17:07,814 [ForkJoinPool-45-worker-25-EventThread] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: CONNECTED 16:17:07,817 [ForkJoinPool-45-worker-25-EventThread] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {} 16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start. 16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start. 16:17:07,826 [ForkJoinPool-45-worker-25-EventThread] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {} 16:17:07,848 [ForkJoinPool-45-worker-25-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 16:17:07,860 [ForkJoinPool-45-worker-25] INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver. {code} The test itself usually creates three {{ElectionDriver}} instances and removes them one by one through a for loop. The logs of the failed test reveal that only two out of the three have the quorum connection established (i.e. the log message {{Connected to ZooKeeper quorum. Leader election can start.}} is printed). The first iteration picks the first instance, checks its leadership and closes it. The {{anyOf}} call in the next iteration should actually still succeed because there's one {{ElectionDriver}} that has an established connection. But the resulting {{anyOf}} composite future doesn't complete, i.e. non of the left Leadership futures completes resulting in the test getting stuck in the subsequent {{join}} call. was (Author: mapohl): {code} 16:17:07,802 [ForkJoinPool-45-worker-25] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Starting 16:17:07,804 [ForkJoinPool-45-worker-25] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Default schema 16:17:07,814 [ForkJoinPool-45-worker-25-EventThread] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: CONNECTED 16:17:07,817 [ForkJoinPool-45-worker-25-EventThread] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {} 16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start. 16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start. 16:17:07,826 [ForkJoinPool-45-worker-25-EventThread] INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker [] - New config event received: {} 16:17:07,848 [ForkJoinPool-45-worker-25-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 16:17:07,860 [ForkJoinPool-45-worker-25] INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver. {code} The test itself usually creates three {{ElectionDriver}} instances and removes them one by one through a for loop. The logs of the failed test reveal that only two out of the three have the quorum connection established (i.e. the log message {{Connected to ZooKeeper quorum. Leader election can start.}} is printed). The first iteration picks the first instance, checks its leadership and closes it. It looks like the second iteration picks the instance for which the quorum connection is still not established. The leadership future could therefore never be completed which results in the test getting stuck in the {{join}} call. > ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers > runs into timeout > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-28078 > URL: https://issues.apache.org/jira/browse/FLINK-28078 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.16.0 > Reporter: Matthias Pohl > Assignee: Matthias Pohl > Priority: Major > Labels: test-stability > > [Build > #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455] > got stuck in > {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}} > {code} > "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 > tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000] > May 30 16:36:10 java.lang.Thread.State: WAITING (parking) > May 30 16:36:10 at sun.misc.Unsafe.park(Native Method) > May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a > java.util.concurrent.CompletableFuture$Signaller) > May 30 16:36:10 at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > May 30 16:36:10 at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > May 30 16:36:10 at > org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256) > May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > May 30 16:36:10 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 30 16:36:10 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498) > [...] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)