I just merged both PRs. Cheers, David
Le ven. 22 déc. 2023 à 14:38, David Jacot <david.ja...@gmail.com> a écrit : > Hey folks, > > I believe that my two PRs will fix most of the issues. I have also tweaked > the configuration of Jenkins to fix the issues relating to cloning the > repo. There may be other issues but the overall situation should be much > better when I merge those two. > > I will update this thread when I merge them. > > Cheers, > David > > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya <divijvaidy...@gmail.com> a > écrit : > >> Hey folks >> >> I think David (dajac) has some fixes lined-up to improve CI such as >> https://github.com/apache/kafka/pull/15063 and >> https://github.com/apache/kafka/pull/15062. >> >> I have some bandwidth for the next two days to work on fixing the CI. Let >> me start by taking a look at the list that Sophie shared here. >> >> -- >> Divij Vaidya >> >> >> >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen <show...@gmail.com> wrote: >> >> > Hi Sophie and Philip and all, >> > >> > I share the same pain as you. >> > I've been waiting for a CI build result in a PR for days. >> Unfortunately, I >> > can only get 1 result each day because it takes 8 hours for each run, >> and >> > with failed results. :( >> > >> > I've looked into the 8 hour timeout build issue and would like to >> propose >> > to set a global test timeout as 10 mins using the junit5 feature >> > < >> > >> https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts >> > > >> > . >> > This way, we can fail those long running tests quickly without impacting >> > other tests. >> > PR: https://github.com/apache/kafka/pull/15065 >> > I've tested in my local environment and it works as expected. >> > >> > Any feedback is welcome. >> > >> > Thanks. >> > Luke >> > >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee <philip...@gmail.com> wrote: >> > >> > > Hey Sophie - I've gotten 2 inflight PRs each with more than 15 >> retries... >> > > Namely: https://github.com/apache/kafka/pull/15023 and >> > > https://github.com/apache/kafka/pull/15035 >> > > >> > > justin filed a flaky test report here though: >> > > https://issues.apache.org/jira/browse/KAFKA-16045 >> > > >> > > P >> > > >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie Blee-Goldman < >> > sop...@responsive.dev >> > > > >> > > wrote: >> > > >> > > > On a related note, has anyone else had trouble getting even a single >> > run >> > > > with no build failures lately? I've had multiple pure-docs PRs >> blocked >> > > for >> > > > days or even weeks because of miscellaneous infra, test, and timeout >> > > > failures. I know we just had a discussion about whether it's >> acceptable >> > > to >> > > > ever merge with a failing build, and the consensus (which I agree >> with) >> > > was >> > > > NO -- but seriously, this is getting ridiculous. The build might be >> the >> > > > worst I've ever seen it, and it just makes it really difficult to >> > > maintain >> > > > good will with external contributors. >> > > > >> > > > Take for example this small docs PR: >> > > > https://github.com/apache/kafka/pull/14949 >> > > > >> > > > It's on its 7th replay, with the first 6 runs all having (at least) >> one >> > > > build that failed completely. The issues I saw on this one PR are a >> > good >> > > > summary of what I've been seeing elsewhere, so here's the briefing: >> > > > >> > > > 1. gradle issue: >> > > > >> > > > > * What went wrong: >> > > > > >> > > > > Gradle could not start your build. >> > > > > >> > > > > > Cannot create service of type BuildSessionActionExecutor using >> > method >> > > > > >> > > >> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor() >> > > > as >> > > > > there is a problem with parameter #21 of type >> > > > FileSystemWatchingInformation. >> > > > > >> > > > > > Cannot create service of type >> > BuildLifecycleAwareVirtualFileSystem >> > > > > using method >> > > > > >> > > > >> > > >> > >> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem() >> > > > > as there is a problem with parameter #7 of type >> GlobalCacheLocations. >> > > > > > Cannot create service of type GlobalCacheLocations using >> > method >> > > > > GradleUserHomeScopeServices.createGlobalCacheLocations() as there >> is >> > a >> > > > > problem with parameter #1 of type List<GlobalCache>. >> > > > > > Could not create service of type FileAccessTimeJournal >> > using >> > > > > GradleUserHomeScopeServices.createFileAccessTimeJournal(). >> > > > > > Timeout waiting to lock journal cache >> > > > > (/home/jenkins/.gradle/caches/journal-1). It is currently in use >> by >> > > > another >> > > > > Gradle instance. >> > > > > >> > > > >> > > > 2. git issue: >> > > > >> > > > > ERROR: Error cloning remote repo 'origin' >> > > > > hudson.plugins.git.GitException: java.io.IOException: Remote call >> on >> > > > > builds43 failed >> > > > >> > > > >> > > > 3. storage test calling System.exit (I think) >> > > > >> > > > > * What went wrong: >> > > > > Execution failed for task ':storage:test'. >> > > > > > Process 'Gradle Test Executor 73' finished with non-zero exit >> > value >> > > 1 >> > > > >> > > > This problem might be caused by incorrect test process >> > configuration. >> > > > >> > > > >> > > > 4. 3/4 builds aborted suddenly for no clear reason >> > > > >> > > > 5. 1 build was aborted, 1 build failed due to a gradle(?) issue >> with a >> > > > storage test: >> > > > >> > > > Failed to map supported failure >> 'org.opentest4j.AssertionFailedError: >> > > > > Failed to observe commit callback before timeout' with mapper >> > > > > >> > > > >> > > >> > >> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea >> > > > ': >> > > > > null >> > > > >> > > > >> > > > >> > > > * What went wrong: >> > > > > Execution failed for task ':storage:test'. >> > > > > > Process 'Gradle Test Executor 73' finished with non-zero exit >> > value 1 >> > > > > This problem might be caused by incorrect test process >> > configuration. >> > > > > >> > > > >> > > > 6. Unknown issue with a core test: >> > > > >> > > > > Unexpected exception thrown. >> > > > > org.gradle.internal.remote.internal.MessageIOException: Could not >> > read >> > > > > message from '/127.0.0.1:46952'. >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47) >> > > > > at >> > > > > >> > > > >> > > >> > >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) >> > > > > at >> > > > > >> > > > >> > > >> > >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) >> > > > > at java.base/java.lang.Thread.run(Thread.java:1583) >> > > > > Caused by: java.lang.IllegalArgumentException >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81) >> > > > > ... 6 more >> > > > > org.gradle.internal.remote.internal.ConnectException: Could not >> > connect >> > > > to >> > > > > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289, >> addresses:[/ >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1]. >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) >> > > > > at >> > > > > >> > > > >> > > >> > >> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) >> > > > > at >> > > > > >> > > > >> > > >> > >> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) >> > > > > Caused by: java.net.ConnectException: Connection refused >> > > > > at java.base/sun.nio.ch.Net.pollConnect(Native Method) >> > > > > at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682) >> > > > > at >> > > > > java.base/sun.nio.ch >> > > > .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191) >> > > > > at >> > > > > java.base/sun.nio.ch >> > > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233) >> > > > > at java.base/sun.nio.ch >> > > .SocketAdaptor.connect(SocketAdaptor.java:102) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) >> > > > > ... 5 more >> > > > > >> > > > >> > > > >> > > > >> > > > > * What went wrong: >> > > > >> > > > Execution failed for task ':core:test'. >> > > > >> > > > > Process 'Gradle Test Executor 104' finished with non-zero exit >> value >> > 1 >> > > > >> > > > This problem might be caused by incorrect test process >> configuration. >> > > > >> > > > >> > > > I've seen almost all of the above issues multiple times, so it might >> > be a >> > > > good list to start with to focus any efforts on improving the build. >> > That >> > > > said, I'm not sure what we can really do about most of these, and >> not >> > > sure >> > > > how to narrow down the root cause in the more mysterious cases of >> > aborted >> > > > builds and the builds that end with "finished with non-zero exit >> value >> > 1 >> > > " >> > > > with no additional context (that I could find) >> > > > >> > > > If nothing else, there seems to be something happening in one (or >> more) >> > > of >> > > > the storage tests, because by far the most common failure I've seen >> is >> > > that >> > > > in 3 & 5. Unfortunately it's not really clear to me how to tell >> which >> > is >> > > > the offending test, so I'm not even sure what to file a ticket for >> > > > >> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot >> > <dja...@confluent.io.invalid >> > > > >> > > > wrote: >> > > > >> > > > > The slowness of the CI is definitely causing us a lot of pain. I >> > wonder >> > > > if >> > > > > we should move to a dedicated CI infrastructure for Kafka. Our >> > > > integration >> > > > > tests are quite heavy and ASF's CI is not really tuned for them. >> We >> > > could >> > > > > tune it for our needs and this would also allow external >> companies to >> > > > > sponsor more workers. I heard that we have a few cloud providers >> in >> > > > > the community ;). I think that we should consider this. What do >> you >> > > > think? >> > > > > I already discussed this with the INFRA team. I could continue if >> we >> > > > > believe that it is a way forward. >> > > > > >> > > > > Best, >> > > > > David >> > > > > >> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski >> > > > > <stanis...@confluent.io.invalid> wrote: >> > > > > >> > > > > > Hey Николай, >> > > > > > >> > > > > > Apologies about this - I wasn't aware of this behavior. I have >> made >> > > all >> > > > > the >> > > > > > gists public. >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris >> > > > > <greg.har...@aiven.io.invalid >> > > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Hey Stan, >> > > > > > > >> > > > > > > Thanks for opening the discussion. I haven't been looking at >> > > overall >> > > > > > > build duration recently, so it's good that you are calling it >> > out. >> > > > > > > >> > > > > > > I worry about us over-indexing on this one build, which itself >> > > > appears >> > > > > > > to be an outlier. I only see one other build [1] above 6h >> overall >> > > in >> > > > > > > the last 90 days in this view: [2] >> > > > > > > And I don't see any overlap of failed tests in these two >> builds, >> > > > which >> > > > > > > makes it less likely that these particular failed tests are >> the >> > > > causes >> > > > > > > of long build times. >> > > > > > > >> > > > > > > Separately, I've been investigating build environment >> slowness, >> > and >> > > > > > > trying to connect it with test failures [3]. I observed that >> the >> > CI >> > > > > > > build environment is 2-20 times slower than my developer >> machine >> > > (M1 >> > > > > > > mac). >> > > > > > > When I simulate a similar slowdown locally, there are tests >> which >> > > > > > > become significantly more flakey, often due to hard-coded >> > timeouts. >> > > > > > > I think that these particularly nasty builds could be >> explained >> > by >> > > > > > > long-tail slowdowns causing arbitrary tests to take an >> excessive >> > > time >> > > > > > > to execute. >> > > > > > > >> > > > > > > Rather than trying to find signals in these rare test >> failures, I >> > > > > > > think we should find tests that have these sorts of failures >> more >> > > > > > > regularly. >> > > > > > > There are lots of builds in the 5-6h duration bracket, which >> is >> > > > > > > certainly unacceptably long. We should look into these builds >> to >> > > find >> > > > > > > improvements and optimizations. >> > > > > > > >> > > > > > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/ >> > > > > > > [2] >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York >> > > > > > > [3] https://github.com/apache/kafka/pull/15008 >> > > > > > > >> > > > > > > Thanks for looking into this! >> > > > > > > Greg >> > > > > > > >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков < >> > > nizhi...@apache.org> >> > > > > > > wrote: >> > > > > > > > >> > > > > > > > Hello, Stanislav. >> > > > > > > > >> > > > > > > > Can you, please, make the gist public. >> > > > > > > > Private gists not available for some GitHub users even if >> link >> > > are >> > > > > > known. >> > > > > > > > >> > > > > > > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski < >> > > > > > stanis...@confluent.io.INVALID> >> > > > > > > написал(а): >> > > > > > > > > >> > > > > > > > > Hey everybody, >> > > > > > > > > I've heard various complaints that build times in trunk >> are >> > > > taking >> > > > > > too >> > > > > > > > > long, some taking as much as 8 hours (the timeout) - and >> this >> > > is >> > > > > > > slowing us >> > > > > > > > > down from being able to meet the code freeze deadline for >> > 3.7. >> > > > > > > > > >> > > > > > > > > I took it upon myself to gather up some data in Gradle >> > > Enterprise >> > > > > to >> > > > > > > see if >> > > > > > > > > there are any outlier tests that are causing this >> slowness. >> > > Turns >> > > > > out >> > > > > > > there >> > > > > > > > > are a few, in this particular build - >> > > > > > > https://ge.apache.org/s/un2hv7n6j374k/ >> > > > > > > > > - which took 10 hours and 29 minutes in total. >> > > > > > > > > >> > > > > > > > > I have compiled the tests that took a disproportionately >> > large >> > > > > amount >> > > > > > > of >> > > > > > > > > time (20m+), alongside their time, error message and a >> link >> > to >> > > > > their >> > > > > > > full >> > > > > > > > > log output here - >> > > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2 >> > > > > > > > > >> > > > > > > > > It includes failures from core, streams, storage and >> clients. >> > > > > > > > > Interestingly, some other tests that don't fail also take >> a >> > > long >> > > > > time >> > > > > > > in >> > > > > > > > > what is apparently the test harness framework. See the >> gist >> > for >> > > > > more >> > > > > > > > > information. >> > > > > > > > > >> > > > > > > > > I am starting this thread with the intention of getting >> the >> > > > > > discussion >> > > > > > > > > started and brainstorming what we can do to get the build >> > times >> > > > > back >> > > > > > > under >> > > > > > > > > control. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > Best, >> > > > > > > > > Stanislav >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Best, >> > > > > > Stanislav >> > > > > > >> > > > > >> > > > >> > > >> > >> >