[ 
https://issues.apache.org/jira/browse/FLINK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598256#comment-17598256
 ] 

Matthias Pohl commented on FLINK-28078:
---------------------------------------

Ok, I tried to reproduce the issue in [build 
20220830.14|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=40525&view=results]
 (I let each of the test jobs run the unit test repeatedly except for the 
`core` stage due to some bug in the if statement :facepalm:). I attached the 
results of the run for reproducability reasons to this issue.

In total there were 6178 test execution over all participating modules (there 
were failed test runs due to the jobs being cancelled after ~4hours).
{code}
for f in $(ls *zip); do job_name="${f%".zip"}"; unzip -p $f 
${job_name}/mvn-1.log | grep -c "Test 
org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers
 successfully run."; done  | paste -sd+ | bc
{code}

I couldn't find any evidence that any of the test runs ran into the issue by 
checking the size of the {{zookeeper-server-1.log}} for each of the jobs (we 
should have observed a peak in file size due to the infinite loop on the ZK 
side). All logs have similar sizes:
{code}
for f in $(ls *zip); do job_name="${f%".zip"}"; echo $job_name; unzip -p $f 
${job_name}/zookeeper-server-1.log | wc --bytes | numfmt --to iec --format 
"%8.4f"; done
logs-ci-test_ci_connect_1-1661858590
41,0784M
logs-ci-test_ci_connect_2-1661858534
43,2716M
logs-ci-test_ci_core-1661858500
caution: filename not matched:  
logs-ci-test_ci_core-1661858500/zookeeper-server-1.log
  0,0000
logs-ci-test_ci_finegrained_resource_management-1661858511
42,9698M
logs-ci-test_ci_misc-1661858514
39,8520M
logs-ci-test_ci_python-1661858538
41,2648M
logs-ci-test_ci_table-1661858510
43,8197M
logs-ci-test_ci_tests-1661858630
40,8045M
{code}

I will proceed with implementing the (dirty temporary) workaround in the PR. 
Let's see whether that reduces the likelihood for this test failing again.

> ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers
>  runs into timeout
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28078
>                 URL: https://issues.apache.org/jira/browse/FLINK-28078
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0, 1.15.2
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: pull-request-available, stale-assigned, test-stability
>
> [Build 
> #36189|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=36189&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=10455]
>  got stuck in 
> {{ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers}}
> {code}
> "ForkJoinPool-45-worker-25" #525 daemon prio=5 os_prio=0 
> tid=0x00007fc74d9e3800 nid=0x62c8 waiting on condition [0x00007fc6ff2f2000]
> May 30 16:36:10    java.lang.Thread.State: WAITING (parking)
> May 30 16:36:10       at sun.misc.Unsafe.park(Native Method)
> May 30 16:36:10       - parking to wait for  <0x00000000c2571b80> (a 
> java.util.concurrent.CompletableFuture$Signaller)
> May 30 16:36:10       at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 30 16:36:10       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 30 16:36:10       at 
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> May 30 16:36:10       at 
> org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers(ZooKeeperMultipleComponentLeaderElectionDriverTest.java:256)
> May 30 16:36:10       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> May 30 16:36:10       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 30 16:36:10       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 30 16:36:10       at java.lang.reflect.Method.invoke(Method.java:498)
> [...]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to