[ https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598256#comment-17598256 ]
Matthias Pohl commented on FLINK-28078: --------------------------------------- Ok, I tried to reproduce the issue in [build 20220830.14|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=40525&view=results] (I let each of the test jobs run the unit test repeatedly except for the `core` stage due to some bug in the if statement :facepalm:). I attached the results of the run for reproducability reasons to this issue. In total there were 6178 test execution over all participating modules (there were failed test runs due to the jobs being cancelled after ~4hours). {code} for f in $(ls *zip); do job_name="${f%".zip"}"; unzip -p $f ${job_name}/mvn-1.log | grep -c "Test org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers successfully run."; done | paste -sd+ | bc {code} I couldn't find any evidence that any of the test runs ran into the issue by checking the size of the {{zookeeper-server-1.log}} for each of the jobs (we should have observed a peak in file size due to the infinite loop on the ZK side). All logs have similar sizes: {code} for f in $(ls *zip); do job_name="${f%".zip"}"; echo $job_name; unzip -p $f ${job_name}/zookeeper-server-1.log | wc --bytes | numfmt --to iec --format "%8.4f"; done logs-ci-test_ci_connect_1-1661858590 41,0784M logs-ci-test_ci_connect_2-1661858534 43,2716M logs-ci-test_ci_core-1661858500 caution: filename not matched: logs-ci-test_ci_core-1661858500/zookeeper-server-1.log 0,0000 logs-ci-test_ci_finegrained_resource_management-1661858511 42,9698M logs-ci-test_ci_misc-1661858514 39,8520M logs-ci-test_ci_python-1661858538 41,2648M logs-ci-test_ci_table-1661858510 43,8197M logs-ci-test_ci_tests-1661858630 40,8045M {code} I will proceed with implementing the (dirty temporary) workaround in the PR. Let's see whether that reduces the likelihood for this test failing again. > ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers > runs into timeout > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-28078 > URL: https://issues.apache.org/jira/browse/FLINK-28078 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.16.0, 1.15.2 > Reporter: Matthias Pohl > Assignee: Matthias Pohl > Priority: Major > Labels: pull-request-available, stale-assigned, test-stability > > [Build > #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455] > got stuck in > {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}} > {code} > "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 > tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000] > May 30 16:36:10 java.lang.Thread.State: WAITING (parking) > May 30 16:36:10 at sun.misc.Unsafe.park(Native Method) > May 30 16:36:10 - parking to wait for <0x00000000c2571b80> (a > java.util.concurrent.CompletableFuture$Signaller) > May 30 16:36:10 at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > May 30 16:36:10 at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > May 30 16:36:10 at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > May 30 16:36:10 at > org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256) > May 30 16:36:10 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > May 30 16:36:10 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 30 16:36:10 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > May 30 16:36:10 at java.lang.reflect.Method.invoke(Method.java:498) > [...] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)