Hi Kirk I have been using this new tool to analyze the trends of test failures: https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin and general build failures: https://ge.apache.org/scans/failures?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
About the classes of build failure, if we look at the last 28 days, I do not observe an increasing trend. The top causes of failure are: (link [2]) 1. Failures due to checkstyle (193 builds) 2. Timeout waiting to lock cache. It is currently in-use by another Gradle instance. 3. Compilation failures (116 builds) 4. "Gradle Test Executor" finished with a non-zero exit value. Process 'Gradle Test Executor 180' finished with non-zero exit value 1 #4 is caused by a test failure that causes a crash of the Gradle process. To debug this, I usually go to complete test output and try to figure out which was the last test that 'Gradle Test Executor 180' was running. As an example, consider https://ge.apache.org/s/luizhogirob4e. We observe that this fails for PR-14094. Now, we need to see the complete system out. To find that, I will go to Kafka PR builder at https://ci-builds.apache.org/job/Kafka/job/kafka-pr/view/change-requests/ and find the build page for PR-14094. That page is https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/. Next, find last failed build at https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/lastFailedBuild/ , observe that we have a failure for "Gradle Test Executor 177", click on view as plain text (it takes a long time to load), find what the GradleTest Executor was doing. In this case, it failed with the following error. I strongly believe that it is due to https://github.com/apache/kafka/pull/13572 but unfortunately, this was reverted and never fixed after that. Perhaps you might want to re Gradle Test Run :core:integrationTest > Gradle Test Executor 177 > ProducerFailureHandlingTest > testTooLargeRecordWithAckZero() STARTED > Task :clients:integrationTest FAILED org.gradle.internal.remote.internal.ConnectException: Could not connect to server [bd7b0504-7491-43f8-a716-513adb302c92 port:43321, addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1]. at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) at org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) Caused by: java.net.ConnectException: Connection refused at java.base/sun.nio.ch.Net.pollConnect(Native Method) at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) at java.base/sun.nio.ch.SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1141) at java.base/sun.nio.ch.SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1183) at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:98) at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) at org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) ... 5 more About the classes of test failure problems, if we look at the last 28 days, the following tests are the biggest culprits. If we fix just these two, our CI would be in a much better shape. (link [1]) 1. https://issues.apache.org/jira/browse/KAFKA-15197 (this test passes only 53% of the time) 2. https://issues.apache.org/jira/browse/KAFKA-15052 (this test passes only 49% of the time) [1] https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin [2] https://ge.apache.org/scans/failures?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin -- Divij Vaidya On Tue, Jul 25, 2023 at 8:09 PM Kirk True <k...@kirktrue.pro> wrote: > > Hi all! > > I’ve noticed that we’re back in the state where it’s tough to get a clean PR > Jenkins test run. Spot checking the top ~10 pull request runs show this > doesn’t appear to be an issue with just my PRs :P > > I know we have some chronic flaky tests, but I’ve seen at least two other > classes of problems: > > 1. Jenkins test runners hanging and eventually timing out > 2. Intra Jenkins-container/pod/VM/machine/turtle communication issues > > How do we go about diagnosing test runs that fail in such an opaque fashion? > > Thanks! > Kirk