[ https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558345#comment-17558345 ]
Matthias Pohl commented on FLINK-28078: --------------------------------------- I did a comparison of a successful run and the failed run. I'm getting the feeling that we're running into some race condition on the {{LeaderLatch}} implementation side. Here are the logs of a successful run starting from the log message where the latch zNode is checked for existence after being created: {code:java} 1376 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x1b zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1376 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x1b zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1377 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:createContainer cxid:0x1c zxid:0xa txntype:19 reqpath:n/a 1378 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:createContainer cxid:0x1c zxid:0xa txntype:19 reqpath:n/a 1380 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x1d zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1380 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x1d zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1380 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x1e zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1380 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x1e zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1380 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x1f zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1380 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x1f zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1381 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren cxid:0x20 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1381 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren cxid:0x20 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1385 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:create2 cxid:0x21 zxid:0xb txntype:15 reqpath:n/a 1385 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:create2 cxid:0x21 zxid:0xb txntype:15 reqpath:n/a 1385 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x22 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink 1385 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x22 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink 1386 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x23 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1386 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x23 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1387 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1387 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1387 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x25 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1387 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x25 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1388 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren cxid:0x26 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1388 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren cxid:0x26 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1390 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:create2 cxid:0x27 zxid:0xc txntype:15 reqpath:n/a 1391 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:create2 cxid:0x27 zxid:0xc txntype:15 reqpath:n/a 1391 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x28 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink 1391 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x28 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink 1391 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x29 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1391 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x29 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1392 [ForkJoinPool-1-worker-9-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 1392 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x2a zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1392 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x2a zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1393 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:exists cxid:0x2b zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1393 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:exists cxid:0x2b zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1393 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren cxid:0x2c zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1393 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren cxid:0x2c zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1394 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:create2 cxid:0x2d zxid:0xd txntype:15 reqpath:n/a 1395 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:create2 cxid:0x2d zxid:0xd txntype:15 reqpath:n/a 1421 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getData cxid:0x2e zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_2b57761c-31e0-458d-9466-b409010d3d14-latch-0000000000 1421 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getData cxid:0x2e zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_2b57761c-31e0-458d-9466-b409010d3d14-latch-0000000000 1421 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x2f zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1421 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x2f zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1422 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getData cxid:0x30 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_1e836e28-a9d1-4bea-97f3-222034044247-latch-0000000001 1422 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getData cxid:0x30 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_1e836e28-a9d1-4bea-97f3-222034044247-latch-0000000001 1497 [ForkJoinPool-1-worker-9] INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver. 1501 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:checkWatches cxid:0x31 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1501 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:checkWatches cxid:0x31 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1505 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:delete cxid:0x32 zxid:0xe txntype:2 reqpath:n/a 1506 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:delete cxid:0x32 zxid:0xe txntype:2 reqpath:n/a 1506 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x33 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1506 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x33 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1507 [ForkJoinPool-1-worker-9-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 1507 [ForkJoinPool-1-worker-9] INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver. 1508 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:checkWatches cxid:0x34 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1508 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:checkWatches cxid:0x34 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1508 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:delete cxid:0x35 zxid:0xf txntype:2 reqpath:n/a 1508 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:delete cxid:0x35 zxid:0xf txntype:2 reqpath:n/a 1509 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x36 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1509 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:getChildren2 cxid:0x36 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 1509 [ForkJoinPool-1-worker-9-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 1510 [ForkJoinPool-1-worker-9] INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver. 1510 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:checkWatches cxid:0x37 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1510 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:checkWatches cxid:0x37 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 1510 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:delete cxid:0x38 zxid:0x10 txntype:2 reqpath:n/a 1511 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:delete cxid:0x38 zxid:0x10 txntype:2 reqpath:n/a 1511 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:checkWatches cxid:0x39 zxid:0xfffffffffffffffe txntype:unknown reqpath:/zookeeper/config 1511 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:checkWatches cxid:0x39 zxid:0xfffffffffffffffe txntype:unknown reqpath:/zookeeper/config 1513 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100009ef1b30000 type:closeSession cxid:0x3a zxid:0x11 txntype:-11 reqpath:n/a 1513 [SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100009ef1b30000 type:closeSession cxid:0x3a zxid:0x11 txntype:-11 reqpath:n/a {code} and here are the merged (i.e. I integrated the test-sides logs into the zookeeper-server logs based on the time) logs: {code:java} 16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start. 16:17:07,824 [Curator-ConnectionStateManager-0] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connected to ZooKeeper quorum. Leader election can start. 16:17:07,829 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:exists cxid:0x10 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,829 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:exists cxid:0x10 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,831 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x11 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,832 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getData cxid:0x11 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,837 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x12 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,837 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x12 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,838 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:createContainer cxid:0x13 zxid:0x7 txntype:19 reqpath:n/a 16:17:07,838 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:createContainer cxid:0x13 zxid:0x7 txntype:19 reqpath:n/a 16:17:07,839 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x14 zxid:0x8 txntype:15 reqpath:n/a 16:17:07,839 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x14 zxid:0x8 txntype:15 reqpath:n/a 16:17:07,839 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x15 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,839 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getData cxid:0x15 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,839 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x16 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,839 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x16 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x17 zxid:0x9 txntype:15 reqpath:n/a 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x17 zxid:0x9 txntype:15 reqpath:n/a 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren cxid:0x18 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren cxid:0x18 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x19 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x19 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x1a zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x1a zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x1b zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,846 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x1b zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,848 [ForkJoinPool-45-worker-25-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 16:17:07,854 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x1c zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,854 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x1c zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,848 [ForkJoinPool-45-worker-25-EventThread] DEBUG org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - ZooKeeperMultipleComponentLeaderElectionDriver obtained the leadership. 16:17:07,860 [ForkJoinPool-45-worker-25] INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Closing ZooKeeperMultipleComponentLeaderElectionDriver. 16:17:07,860 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x1d zxid:0xa txntype:15 reqpath:n/a 16:17:07,860 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x1d zxid:0xa txntype:15 reqpath:n/a 16:17:07,862 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:checkWatches cxid:0x1e zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,862 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:checkWatches cxid:0x1e zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default 16:17:07,863 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x1f zxid:0xb txntype:2 reqpath:n/a 16:17:07,863 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x1f zxid:0xb txntype:2 reqpath:n/a 16:17:07,863 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x20 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_ddb7327c-1775-4084-9fbd-40e263758697-latch-0000000000 16:17:07,863 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getData cxid:0x20 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_ddb7327c-1775-4084-9fbd-40e263758697-latch-0000000000 16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,864 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a 16:17:07,866 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001 16:17:07,869 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001 16:17:07,871 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x25 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,871 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x25 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch 16:17:07,875 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x26 zxid:0xe txntype:2 reqpath:n/a 16:17:07,875 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x26 zxid:0xe txntype:2 reqpath:n/a 16:17:07,876 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x27 zxid:0xf txntype:15 reqpath:n/a 16:17:07,876 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x27 zxid:0xf txntype:15 reqpath:n/a 16:17:07,876 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x28 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_8008ceb9-e0b9-459b-8fa5-44428fa31e29-latch-0000000002 16:17:07,876 [ SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor [] - sessionid:0x100cf6d9cf60000 type:getData cxid:0x28 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_8008ceb9-e0b9-459b-8fa5-44428fa31e29-latch-0000000002 [...] {code} The LeaderLatch implementation works like that (based on the code): # {{LeaderLatch}} is started pointing to a given zNode path ({{{}/flink/default/latch/{}}} in our case). # The instance will get {{reset()}} initially setting the leadership for this latch to {{false}} and triggering a child creation # Child creation is done "withProtection" (i.e. a random String will be prefixed to the child) in mode {{EPHEMERAL_SEQUENTIAL}} (i.e. the child will be deleted on connection-loss and a monotonically increasing number is added as a suffix) with a callback that triggers {{getChildren}} in case of success and selecting the leader based on the suffix of the child nodes names (lower ID wins). Hence, an order between the children is established based on the ID. # {{getChildren}} callback triggers leadership detection in the local {{LeaderLatch}} instance. ## For the child with the lowest ID, the leadership is triggered in the callback ## If the current {{LeaderLatch}} doesn't correspond to the child with the lowest ID, a watcher is set up watching the direct predecessor (based on the ID-based ordering I mentioned before) for deletion. If the deletion happens, the watcher is triggered (only once, i.e. not repeatedly) and reinitiates the leadership detection through {{getChildren}} (4.) if the {{LeaderLatch}} isn't closed, yet. This is what we observe in the successful test run with the watcher being triggered. There is another code path if the predecessor node doesn't exist in {{getChildren}} (4.). In that case, an entire {{reset()}} (2.) is triggered resulting in the deletion of the current's {{LeaderLatch}}'s zNode and the recreation of the child zNode. It appears that we're ending up in a this code path for the unsuccessful test run. > ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers > runs into timeout > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-28078 > URL: https://issues.apache.org/jira/browse/FLINK-28078 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.16.0 > Reporter: Matthias Pohl > Assignee: Matthias Pohl > Priority: Major > Labels: test-stability > > [Build > #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455] > got stuck in > {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}} > {code} > "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 > tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000] > May 30 16:36:10 java.lang.Thread.State: WAITING (parking) > May 30 16:36:10 at sun.misc.Unsafe.park(Native Method) > May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a > java.util.concurrent.CompletableFuture$Signaller) > May 30 16:36:10 at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > May 30 16:36:10 at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > May 30 16:36:10 at > org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256) > May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > May 30 16:36:10 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 30 16:36:10 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498) > [...] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)