I created this INFRA Jira just now to see if they can help resolve some of the 
intermittent Jenkins build issues:

https://issues.apache.org/jira/browse/INFRA-24874


> On Aug 1, 2023, at 4:04 PM, Kirk True <k...@kirktrue.pro> wrote:
> 
> Hi Divij,
> 
> Thanks for the pointer to Gradle Enterprise! That’s exactly what I was 
> looking for.
> 
> Did we track builds before July 12? I see only tiny blips of failures on the 
> 90-day view.
> 
> Thanks,
> Kirk
> 
>> On Jul 26, 2023, at 2:08 AM, Divij Vaidya <divijvaidy...@gmail.com> wrote:
>> 
>> Hi Kirk
>> 
>> I have been using this new tool to analyze the trends of test
>> failures: 
>> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
>> and general build failures:
>> https://ge.apache.org/scans/failures?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
>> 
>> About the classes of build failure, if we look at the last 28 days, I
>> do not observe an increasing trend. The top causes of failure are:
>> (link [2])
>> 1. Failures due to checkstyle (193 builds)
>> 2. Timeout waiting to lock cache. It is currently in-use by another
>> Gradle instance.
>> 3. Compilation failures (116 builds)
>> 4. "Gradle Test Executor" finished with a non-zero exit value. Process
>> 'Gradle Test Executor 180' finished with non-zero exit value 1
>> 
>> #4 is caused by a test failure that causes a crash of the Gradle
>> process. To debug this, I usually go to complete test output and try
>> to figure out which was the last test that 'Gradle Test Executor 180'
>> was running. As an example, consider
>> https://ge.apache.org/s/luizhogirob4e. We observe that this fails for
>> PR-14094. Now, we need to see the complete system out. To find that, I
>> will go to Kafka PR builder at
>> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/view/change-requests/
>> and find the build page for PR-14094. That page is
>> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/.
>> Next, find last failed build at
>> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/lastFailedBuild/
>> , observe that we have a failure for "Gradle Test Executor 177", click
>> on view as plain text (it takes a long time to load), find what the
>> GradleTest Executor was doing. In this case, it failed with the
>> following error. I strongly believe that it is due to
>> https://github.com/apache/kafka/pull/13572 but unfortunately, this was
>> reverted and never fixed after that. Perhaps you might want to re
>> 
>> Gradle Test Run :core:integrationTest > Gradle Test Executor 177 >
>> ProducerFailureHandlingTest > testTooLargeRecordWithAckZero() STARTED
>> 
>>> Task :clients:integrationTest FAILED
>> org.gradle.internal.remote.internal.ConnectException: Could not
>> connect to server [bd7b0504-7491-43f8-a716-513adb302c92 port:43321,
>> addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1].
>> at 
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
>> at 
>> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
>> at 
>> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
>> at 
>> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
>> at 
>> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
>> at 
>> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
>> Caused by: java.net.ConnectException: Connection refused
>> at java.base/sun.nio.ch.Net.pollConnect(Native Method)
>> at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
>> at 
>> java.base/sun.nio.ch.SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1141)
>> at 
>> java.base/sun.nio.ch.SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1183)
>> at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:98)
>> at 
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
>> at 
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
>> ... 5 more
>> 
>> 
>> 
>> 
>> About the classes of test failure problems, if we look at the last 28
>> days, the following tests are the biggest culprits. If we fix just
>> these two, our CI would be in a much better shape. (link [1])
>> 1. https://issues.apache.org/jira/browse/KAFKA-15197 (this test passes
>> only 53% of the time)
>> 2. https://issues.apache.org/jira/browse/KAFKA-15052 (this test passes
>> only 49% of the time)
>> 
>> 
>> [1] 
>> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
>> [2] 
>> https://ge.apache.org/scans/failures?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
>> 
>> 
>> --
>> Divij Vaidya
>> 
>> On Tue, Jul 25, 2023 at 8:09 PM Kirk True <k...@kirktrue.pro> wrote:
>>> 
>>> Hi all!
>>> 
>>> I’ve noticed that we’re back in the state where it’s tough to get a clean 
>>> PR Jenkins test run. Spot checking the top ~10 pull request runs show this 
>>> doesn’t appear to be an issue with just my PRs :P
>>> 
>>> I know we have some chronic flaky tests, but I’ve seen at least two other 
>>> classes of problems:
>>> 
>>> 1. Jenkins test runners hanging and eventually timing out
>>> 2. Intra Jenkins-container/pod/VM/machine/turtle communication issues
>>> 
>>> How do we go about diagnosing test runs that fail in such an opaque fashion?
>>> 
>>> Thanks!
>>> Kirk
> 

Reply via email to