I just merged both PRs.

Cheers,
David

Le ven. 22 déc. 2023 à 14:38, David Jacot <david.ja...@gmail.com> a écrit :

> Hey folks,
>
> I believe that my two PRs will fix most of the issues. I have also tweaked
> the configuration of Jenkins to fix the issues relating to cloning the
> repo. There may be other issues but the overall situation should be much
> better when I merge those two.
>
> I will update this thread when I merge them.
>
> Cheers,
> David
>
> Le ven. 22 déc. 2023 à 14:22, Divij Vaidya <divijvaidy...@gmail.com> a
> écrit :
>
>> Hey folks
>>
>> I think David (dajac) has some fixes lined-up to improve CI such as
>> https://github.com/apache/kafka/pull/15063 and
>> https://github.com/apache/kafka/pull/15062.
>>
>> I have some bandwidth for the next two days to work on fixing the CI. Let
>> me start by taking a look at the list that Sophie shared here.
>>
>> --
>> Divij Vaidya
>>
>>
>>
>> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen <show...@gmail.com> wrote:
>>
>> > Hi Sophie and Philip and all,
>> >
>> > I share the same pain as you.
>> > I've been waiting for a CI build result in a PR for days.
>> Unfortunately, I
>> > can only get 1 result each day because it takes 8 hours for each run,
>> and
>> > with failed results. :(
>> >
>> > I've looked into the 8 hour timeout build issue and would like to
>> propose
>> > to set a global test timeout as 10 mins using the junit5 feature
>> > <
>> >
>> https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts
>> > >
>> > .
>> > This way, we can fail those long running tests quickly without impacting
>> > other tests.
>> > PR: https://github.com/apache/kafka/pull/15065
>> > I've tested in my local environment and it works as expected.
>> >
>> > Any feedback is welcome.
>> >
>> > Thanks.
>> > Luke
>> >
>> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee <philip...@gmail.com> wrote:
>> >
>> > > Hey Sophie - I've gotten 2 inflight PRs each with more than 15
>> retries...
>> > > Namely: https://github.com/apache/kafka/pull/15023 and
>> > > https://github.com/apache/kafka/pull/15035
>> > >
>> > > justin filed a flaky test report here though:
>> > > https://issues.apache.org/jira/browse/KAFKA-16045
>> > >
>> > > P
>> > >
>> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie Blee-Goldman <
>> > sop...@responsive.dev
>> > > >
>> > > wrote:
>> > >
>> > > > On a related note, has anyone else had trouble getting even a single
>> > run
>> > > > with no build failures lately? I've had multiple pure-docs PRs
>> blocked
>> > > for
>> > > > days or even weeks because of miscellaneous infra, test, and timeout
>> > > > failures. I know we just had a discussion about whether it's
>> acceptable
>> > > to
>> > > > ever merge with a failing build, and the consensus (which I agree
>> with)
>> > > was
>> > > > NO -- but seriously, this is getting ridiculous. The build might be
>> the
>> > > > worst I've ever seen it, and it just makes it really difficult to
>> > > maintain
>> > > > good will with external contributors.
>> > > >
>> > > > Take for example this small docs PR:
>> > > > https://github.com/apache/kafka/pull/14949
>> > > >
>> > > > It's on its 7th replay, with the first 6 runs all having (at least)
>> one
>> > > > build that failed completely. The issues I saw on this one PR are a
>> > good
>> > > > summary of what I've been seeing elsewhere, so here's the briefing:
>> > > >
>> > > > 1. gradle issue:
>> > > >
>> > > > > * What went wrong:
>> > > > >
>> > > > > Gradle could not start your build.
>> > > > >
>> > > > > > Cannot create service of type BuildSessionActionExecutor using
>> > method
>> > > > >
>> > >
>> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor()
>> > > > as
>> > > > > there is a problem with parameter #21 of type
>> > > > FileSystemWatchingInformation.
>> > > > >
>> > > > >    > Cannot create service of type
>> > BuildLifecycleAwareVirtualFileSystem
>> > > > > using method
>> > > > >
>> > > >
>> > >
>> >
>> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem()
>> > > > > as there is a problem with parameter #7 of type
>> GlobalCacheLocations.
>> > > > >       > Cannot create service of type GlobalCacheLocations using
>> > method
>> > > > > GradleUserHomeScopeServices.createGlobalCacheLocations() as there
>> is
>> > a
>> > > > > problem with parameter #1 of type List<GlobalCache>.
>> > > > >          > Could not create service of type FileAccessTimeJournal
>> > using
>> > > > > GradleUserHomeScopeServices.createFileAccessTimeJournal().
>> > > > >             > Timeout waiting to lock journal cache
>> > > > > (/home/jenkins/.gradle/caches/journal-1). It is currently in use
>> by
>> > > > another
>> > > > > Gradle instance.
>> > > > >
>> > > >
>> > > > 2. git issue:
>> > > >
>> > > > > ERROR: Error cloning remote repo 'origin'
>> > > > > hudson.plugins.git.GitException: java.io.IOException: Remote call
>> on
>> > > > > builds43 failed
>> > > >
>> > > >
>> > > > 3. storage test calling System.exit (I think)
>> > > >
>> > > > > * What went wrong:
>> > > > >  Execution failed for task ':storage:test'.
>> > > > >  > Process 'Gradle Test Executor 73' finished with non-zero exit
>> > value
>> > > 1
>> > > >
>> > > >     This problem might be caused by incorrect test process
>> > configuration.
>> > > >
>> > > >
>> > > > 4.  3/4 builds aborted suddenly for no clear reason
>> > > >
>> > > > 5. 1 build was aborted, 1 build failed due to a gradle(?) issue
>> with a
>> > > > storage test:
>> > > >
>> > > > Failed to map supported failure
>> 'org.opentest4j.AssertionFailedError:
>> > > > > Failed to observe commit callback before timeout' with mapper
>> > > > >
>> > > >
>> > >
>> >
>> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea
>> > > > ':
>> > > > > null
>> > > >
>> > > >
>> > > >
>> > > > * What went wrong:
>> > > > > Execution failed for task ':storage:test'.
>> > > > > > Process 'Gradle Test Executor 73' finished with non-zero exit
>> > value 1
>> > > > >   This problem might be caused by incorrect test process
>> > configuration.
>> > > > >
>> > > >
>> > > > 6.  Unknown issue with a core test:
>> > > >
>> > > > > Unexpected exception thrown.
>> > > > > org.gradle.internal.remote.internal.MessageIOException: Could not
>> > read
>> > > > > message from '/127.0.0.1:46952'.
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>> > > > >   at java.base/java.lang.Thread.run(Thread.java:1583)
>> > > > > Caused by: java.lang.IllegalArgumentException
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
>> > > > > ... 6 more
>> > > > > org.gradle.internal.remote.internal.ConnectException: Could not
>> > connect
>> > > > to
>> > > > > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289,
>> addresses:[/
>> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1].
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
>> > > > > Caused by: java.net.ConnectException: Connection refused
>> > > > >   at java.base/sun.nio.ch.Net.pollConnect(Native Method)
>> > > > >   at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
>> > > > >   at
>> > > > > java.base/sun.nio.ch
>> > > > .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191)
>> > > > >   at
>> > > > > java.base/sun.nio.ch
>> > > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233)
>> > > > >   at java.base/sun.nio.ch
>> > > .SocketAdaptor.connect(SocketAdaptor.java:102)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
>> > > > >   at
>> > > > >
>> > > >
>> > >
>> >
>> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
>> > > > > ... 5 more
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > >  * What went wrong:
>> > > >
>> > > > Execution failed for task ':core:test'.
>> > > >
>> > > > > Process 'Gradle Test Executor 104' finished with non-zero exit
>> value
>> > 1
>> > > >
>> > > >   This problem might be caused by incorrect test process
>> configuration.
>> > > >
>> > > >
>> > > > I've seen almost all of the above issues multiple times, so it might
>> > be a
>> > > > good list to start with to focus any efforts on improving the build.
>> > That
>> > > > said, I'm not sure what we can really do about most of these, and
>> not
>> > > sure
>> > > > how to narrow down the root cause in the more mysterious cases of
>> > aborted
>> > > > builds and the builds that end with "finished with non-zero exit
>> value
>> > 1
>> > > "
>> > > > with no additional context (that I could find)
>> > > >
>> > > > If nothing else, there seems to be something happening in one (or
>> more)
>> > > of
>> > > > the storage tests, because by far the most common failure I've seen
>> is
>> > > that
>> > > > in 3 & 5. Unfortunately it's not really clear to me how to tell
>> which
>> > is
>> > > > the offending test, so I'm not even sure what to file a ticket for
>> > > >
>> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot
>> > <dja...@confluent.io.invalid
>> > > >
>> > > > wrote:
>> > > >
>> > > > > The slowness of the CI is definitely causing us a lot of pain. I
>> > wonder
>> > > > if
>> > > > > we should move to a dedicated CI infrastructure for Kafka. Our
>> > > > integration
>> > > > > tests are quite heavy and ASF's CI is not really tuned for them.
>> We
>> > > could
>> > > > > tune it for our needs and this would also allow external
>> companies to
>> > > > > sponsor more workers. I heard that we have a few cloud providers
>> in
>> > > > > the community ;). I think that we should consider this. What do
>> you
>> > > > think?
>> > > > > I already discussed this with the INFRA team. I could continue if
>> we
>> > > > > believe that it is a way forward.
>> > > > >
>> > > > > Best,
>> > > > > David
>> > > > >
>> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski
>> > > > > <stanis...@confluent.io.invalid> wrote:
>> > > > >
>> > > > > > Hey Николай,
>> > > > > >
>> > > > > > Apologies about this - I wasn't aware of this behavior. I have
>> made
>> > > all
>> > > > > the
>> > > > > > gists public.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris
>> > > > > <greg.har...@aiven.io.invalid
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hey Stan,
>> > > > > > >
>> > > > > > > Thanks for opening the discussion. I haven't been looking at
>> > > overall
>> > > > > > > build duration recently, so it's good that you are calling it
>> > out.
>> > > > > > >
>> > > > > > > I worry about us over-indexing on this one build, which itself
>> > > > appears
>> > > > > > > to be an outlier. I only see one other build [1] above 6h
>> overall
>> > > in
>> > > > > > > the last 90 days in this view: [2]
>> > > > > > > And I don't see any overlap of failed tests in these two
>> builds,
>> > > > which
>> > > > > > > makes it less likely that these particular failed tests are
>> the
>> > > > causes
>> > > > > > > of long build times.
>> > > > > > >
>> > > > > > > Separately, I've been investigating build environment
>> slowness,
>> > and
>> > > > > > > trying to connect it with test failures [3]. I observed that
>> the
>> > CI
>> > > > > > > build environment is 2-20 times slower than my developer
>> machine
>> > > (M1
>> > > > > > > mac).
>> > > > > > > When I simulate a similar slowdown locally, there are tests
>> which
>> > > > > > > become significantly more flakey, often due to hard-coded
>> > timeouts.
>> > > > > > > I think that these particularly nasty builds could be
>> explained
>> > by
>> > > > > > > long-tail slowdowns causing arbitrary tests to take an
>> excessive
>> > > time
>> > > > > > > to execute.
>> > > > > > >
>> > > > > > > Rather than trying to find signals in these rare test
>> failures, I
>> > > > > > > think we should find tests that have these sorts of failures
>> more
>> > > > > > > regularly.
>> > > > > > > There are lots of builds in the 5-6h duration bracket, which
>> is
>> > > > > > > certainly unacceptably long. We should look into these builds
>> to
>> > > find
>> > > > > > > improvements and optimizations.
>> > > > > > >
>> > > > > > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/
>> > > > > > > [2]
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
>> > > > > > > [3] https://github.com/apache/kafka/pull/15008
>> > > > > > >
>> > > > > > > Thanks for looking into this!
>> > > > > > > Greg
>> > > > > > >
>> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <
>> > > nizhi...@apache.org>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > Hello, Stanislav.
>> > > > > > > >
>> > > > > > > > Can you, please, make the gist public.
>> > > > > > > > Private gists not available for some GitHub users even if
>> link
>> > > are
>> > > > > > known.
>> > > > > > > >
>> > > > > > > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski <
>> > > > > > stanis...@confluent.io.INVALID>
>> > > > > > > написал(а):
>> > > > > > > > >
>> > > > > > > > > Hey everybody,
>> > > > > > > > > I've heard various complaints that build times in trunk
>> are
>> > > > taking
>> > > > > > too
>> > > > > > > > > long, some taking as much as 8 hours (the timeout) - and
>> this
>> > > is
>> > > > > > > slowing us
>> > > > > > > > > down from being able to meet the code freeze deadline for
>> > 3.7.
>> > > > > > > > >
>> > > > > > > > > I took it upon myself to gather up some data in Gradle
>> > > Enterprise
>> > > > > to
>> > > > > > > see if
>> > > > > > > > > there are any outlier tests that are causing this
>> slowness.
>> > > Turns
>> > > > > out
>> > > > > > > there
>> > > > > > > > > are a few, in this particular build -
>> > > > > > > https://ge.apache.org/s/un2hv7n6j374k/
>> > > > > > > > > - which took 10 hours and 29 minutes in total.
>> > > > > > > > >
>> > > > > > > > > I have compiled the tests that took a disproportionately
>> > large
>> > > > > amount
>> > > > > > > of
>> > > > > > > > > time (20m+), alongside their time, error message and a
>> link
>> > to
>> > > > > their
>> > > > > > > full
>> > > > > > > > > log output here -
>> > > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
>> > > > > > > > >
>> > > > > > > > > It includes failures from core, streams, storage and
>> clients.
>> > > > > > > > > Interestingly, some other tests that don't fail also take
>> a
>> > > long
>> > > > > time
>> > > > > > > in
>> > > > > > > > > what is apparently the test harness framework. See the
>> gist
>> > for
>> > > > > more
>> > > > > > > > > information.
>> > > > > > > > >
>> > > > > > > > > I am starting this thread with the intention of getting
>> the
>> > > > > > discussion
>> > > > > > > > > started and brainstorming what we can do to get the build
>> > times
>> > > > > back
>> > > > > > > under
>> > > > > > > > > control.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Best,
>> > > > > > > > > Stanislav
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Best,
>> > > > > > Stanislav
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to