Addendum: I've opened a PR with what I believe are the changes necessary to enable Remote Build Caching, if you choose to go that route: https://github.com/apache/kafka/pull/15109
On Tue, 2 Jan 2024 at 14:31, Nick Telford <nick.telf...@gmail.com> wrote: > Hi everyone, > > Regarding building a "dependency graph"... Gradle already has this > information, albeit fairly coarse-grained. You might be able to get some > considerable improvement by configuring the Gradle Remote Build Cache. It > looks like it's currently disabled explicitly: > https://github.com/apache/kafka/blob/trunk/settings.gradle#L46 > > The trick is to have trunk builds write to the cache, and PR builds only > read from it. This way, any PR based on trunk should be able to cache not > only the compilation, but also the tests from dependent modules that > haven't changed (e.g. for a PR that only touches the connect/streams > modules). > > This would probably be preferable to having to hand-maintain some > rules/dependency graph in the CI configuration, and it's quite > straight-forward to configure. > > Bonus points if the Remote Build Cache is readable publicly, enabling > contributors to benefit from it locally. > > Regards, > Nick > > On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy <lbruts...@confluent.io.invalid> > wrote: > >> Thanks for all the work that has already been done on this in the past >> days! >> >> Have we considered running our test suite with >> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as >> Jenkins build artifacts? This could speed up debugging. Even if we >> store them only for a day and do it only for trunk, I think it could >> be worth it. The heap dumps shouldn't contain any secrets, and I >> checked with the ASF infra team, and they are not concerned about the >> additional disk usage. >> >> Cheers, >> Lucas >> >> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya <divijvaidy...@gmail.com> >> wrote: >> > >> > I have started to perform an analysis of the OOM at >> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel free to >> > contribute to the investigation. >> > >> > -- >> > Divij Vaidya >> > >> > >> > >> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan >> <jols...@confluent.io.invalid> >> > wrote: >> > >> > > I am still seeing quite a few OOM errors in the builds and I was >> curious if >> > > folks had any ideas on how to identify the cause and fix the issue. I >> was >> > > looking in gradle enterprise and found some info about memory usage, >> but >> > > nothing detailed enough to help figure the issue out. >> > > >> > > OOMs sometimes fail the build immediately and in other cases I see it >> get >> > > stuck for 8 hours. (See >> > > >> > > >> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12 >> > > ) >> > > >> > > I appreciate all the work folks are doing here and I will continue to >> try >> > > to help as best as I can. >> > > >> > > Justine >> > > >> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur >> > > <david.art...@confluent.io.invalid> wrote: >> > > >> > > > S2. We’ve looked into this before, and it wasn’t possible at the >> time >> > > with >> > > > JUnit. We commonly set a timeout on each test class (especially >> > > integration >> > > > tests). It is probably worth looking at this again and seeing if >> > > something >> > > > has changed with JUnit (or our usage of it) that would allow a >> global >> > > > timeout. >> > > > >> > > > >> > > > S3. Dedicated infra sounds nice, if we can get it. It would at least >> > > remove >> > > > some variability between the builds, and hopefully eliminate the >> > > > infra/setup class of failures. >> > > > >> > > > >> > > > S4. Running tests for what has changed sounds nice, but I think it >> is >> > > risky >> > > > to implement broadly. As Sophie mentioned, there are probably some >> lines >> > > we >> > > > could draw where we feel confident that only running a subset of >> tests is >> > > > safe. As a start, we could probably work towards skipping CI for >> non-code >> > > > PRs. >> > > > >> > > > >> > > > --- >> > > > >> > > > >> > > > As an aside, I experimented with build caching and running affected >> > > tests a >> > > > few months ago. I used the opportunity to play with Github Actions, >> and I >> > > > quite liked it. Here’s the workflow I used: >> > > > >> https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml. I >> > > > was trying to see if we could use a build cache to reduce the >> compilation >> > > > time on PRs. A nightly/periodic job would build trunk and populate a >> > > Gradle >> > > > build cache. PR builds would read from that cache which would >> enable them >> > > > to only compile changed code. The same idea could be extended to >> tests, >> > > but >> > > > I didn’t get that far. >> > > > >> > > > >> > > > As for Github Actions, the idea there is that ASF would provide >> generic >> > > > Action “runners” that would pick up jobs from the Github Action >> build >> > > queue >> > > > and run them. It is also possible to self-host runners to expand the >> > > build >> > > > capacity of the project (i.e., other organizations could donate >> > > > build capacity). The advantage of this is that we would have more >> control >> > > > over our build/reports and not be “stuck” with whatever ASF Jenkins >> > > offers. >> > > > The Actions workflows are very customizable and it would let us >> create >> > > our >> > > > own custom plugins. There is also a substantial marketplace of >> plugins. I >> > > > think it’s worth exploring this more, I just haven’t had time >> lately. >> > > > >> > > > On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman < >> > > sop...@responsive.dev >> > > > > >> > > > wrote: >> > > > >> > > > > Regarding: >> > > > > >> > > > > S-4. Separate tests ran depending on what module is changed. >> > > > > > >> > > > > - This makes sense although is tricky to implement successfully, >> as >> > > > > > unrelated tests may expose problems in an unrelated change (e.g >> > > > changing >> > > > > > core stuff like clients, the server, etc) >> > > > > >> > > > > >> > > > > Imo this avenue could provide a massive improvement to dev >> productivity >> > > > > with very little effort or investment, and if we do it right, >> without >> > > > even >> > > > > any risk. We should be able to draft a simple dependency graph >> between >> > > > > modules and then skip the tests for anything that is clearly, >> provably >> > > > > unrelated and/or upstream of the target changes. This has the >> potential >> > > > to >> > > > > substantially speed up and improve the developer experience in >> modules >> > > at >> > > > > the end of the dependency graph, which I believe is worth doing >> even if >> > > > it >> > > > > unfortunately would not benefit everyone equally. >> > > > > >> > > > > For example, we can save a lot of grief with just a simple set of >> rules >> > > > > that are easy to check. I'll throw out a few to start with: >> > > > > >> > > > > 1. A pure docs PR (ie that only touches files under the docs/ >> > > > directory) >> > > > > should be allowed to skip the tests of all modules >> > > > > 2. Connect PRs (that only touch connect/) only need to run the >> > > Connect >> > > > > tests -- ie they can skip the tests for core, clients, >> streams, etc >> > > > > 3. Similarly, Streams PRs should only need to run the Streams >> tests >> > > -- >> > > > > but again, only if all the changes are contained within >> streams/ >> > > > > >> > > > > I'll let others chime in on how or if we can construct some safe >> rules >> > > as >> > > > > to which modules can or can't be skipped between the core, >> clients, >> > > raft, >> > > > > storage, etc >> > > > > >> > > > > And over time we could in theory build up a literal dependency >> graph >> > > on a >> > > > > more granular level so that, for example, changes to the >> core/storage >> > > > > module are allowed to skip any Streams tests that don't use an >> embedded >> > > > > broker, ie all unit tests and TopologyTestDriver-based integration >> > > tests. >> > > > > The danger here would be in making sure this graph is kept up to >> date >> > > as >> > > > > tests are added and changed, but my point is just that there's a >> way to >> > > > > extend the benefit of this tactic to those who work primarily on >> the >> > > core >> > > > > module as well. Personally, I think we should just start out with >> the >> > > > > example ruleset listed above, workshop it a bit since there might >> be >> > > > other >> > > > > obvious rules I left out, and try to implement it. >> > > > > >> > > > > Thoughts? >> > > > > >> > > > > On Tue, Dec 26, 2023 at 2:25 AM Stanislav Kozlovski >> > > > > <stanis...@confluent.io.invalid> wrote: >> > > > > >> > > > > > Great discussion! >> > > > > > >> > > > > > >> > > > > > Greg, that was a good call out regarding the two long-running >> > > builds. I >> > > > > > missed that 90d view. >> > > > > > >> > > > > > My takeaway from that is that our average build time for tests >> is >> > > > between >> > > > > > 3-4 hours. Which in of itself seems large. >> > > > > > >> > > > > > But then reconciling this with Sophie's statement - is it >> possible >> > > that >> > > > > > these timed-out 8-hour builds don't get captured in that view? >> > > > > > >> > > > > > It is weird that people are reporting these things and Gradle >> > > > Enterprise >> > > > > > isn't showing them. >> > > > > > >> > > > > > --- >> > > > > > >> > > > > > > I think that these particularly nasty builds could be >> explained by >> > > > > > long-tail slowdowns causing arbitrary tests to take an >> excessive time >> > > > to >> > > > > > execute. >> > > > > > >> > > > > > I'm not sure I understood that. If the tests have timeouts, >> where >> > > would >> > > > > the >> > > > > > slowdown come from? Problems in tearing down the test? >> > > > > > >> > > > > > --- >> > > > > > >> > > > > > David, thanks for the great work in identifying and even fixing >> those >> > > > two >> > > > > > top offenders! And thank you for cherry-picking to 3.7 >> > > > > > >> > > > > > -- >> > > > > > >> > > > > > All in all, from this thread I can summarize a few potential >> > > solutions: >> > > > > > >> > > > > > S-1. Dedicated work identifying and fixing some of the issues >> (e.g. >> > > > what >> > > > > > David did). >> > > > > > - Should help alleviate the issues as it can be speculated that >> it's >> > > > > > frequently 1 or 2 tests causing the majority of issues. >> > > > > > - With regards to that, KAFKA-16045 seems open for taking if >> there >> > > are >> > > > > any >> > > > > > volunteers >> > > > > > - Sophie's list also contains good candidates >> > > > > > >> > > > > > S-2. Global 10-minute timeout for tests. >> > > > > > - Should lay the foundation for a strong catch-all for any >> > > misbehaving >> > > > > > tests. I like this idea since it's guaranteed to save each >> > > contributor >> > > > > many >> > > > > > hours of waiting for an 8hr+ time out build. >> > > > > > - Luke already has a PR out for this: >> > > > > > https://github.com/apache/kafka/pull/15065 >> > > > > > >> > > > > > S-3. Separate infrastructure for our CI >> > > > > > - This would help with Greg's comment about the developer >> machine >> > > being >> > > > > > 2-20 times faster than the CI. >> > > > > > - Requires volunteer funding from external companies. If every >> > > > > contributor >> > > > > > would bring up the idea with their employer, we may be able to >> stitch >> > > > > > something together. >> > > > > > >> > > > > > S-4. Separate tests ran depending on what module is changed. >> > > > > > - This makes sense although is tricky to implement >> successfully, as >> > > > > > unrelated tests may expose problems in an unrelated change (e.g >> > > > changing >> > > > > > core stuff like clients, the server, etc) >> > > > > > >> > > > > > S-5. Greater committer diligence when merging PRs >> > > > > > - This should always be there. Unfortunately it is a bit of a >> > > > > > self-perpetuating effect in that when the builds get worse, >> people >> > > are >> > > > > > incentivized to be less diligent (slowed down while in a rush to >> > > merge, >> > > > > > recency bias of failed builds, etc.) >> > > > > > >> > > > > > On Fri, Dec 22, 2023 at 4:16 PM Justine Olshan >> > > > > > <jols...@confluent.io.invalid> >> > > > > > wrote: >> > > > > > >> > > > > > > Thanks David! I think this should help a lot! >> > > > > > > >> > > > > > > While we should include these improvements, I think it is >> also good >> > > > to >> > > > > > > remind folks that a lot of these issues come from merging on >> builds >> > > > > that >> > > > > > > regress the CI. >> > > > > > > I know I'm not perfect at this (and have merged on flaky and >> > > failing >> > > > > > > tests), but let's all be super careful going forward. There >> were a >> > > > few >> > > > > > > times I retried the build 10+ times and thought it was other >> issues >> > > > > with >> > > > > > > the CI but the failed builds were actually due to the changes >> I >> > > > > wrote/was >> > > > > > > reviewing. >> > > > > > > >> > > > > > > We all need to work together on this to ensure the builds stay >> > > > healthy! >> > > > > > > Thanks all for being concerned about our builds! >> > > > > > > >> > > > > > > Justine >> > > > > > > >> > > > > > > On Fri, Dec 22, 2023 at 6:02 AM David Jacot < >> david.ja...@gmail.com >> > > > >> > > > > > wrote: >> > > > > > > >> > > > > > > > I just merged both PRs. >> > > > > > > > >> > > > > > > > Cheers, >> > > > > > > > David >> > > > > > > > >> > > > > > > > Le ven. 22 déc. 2023 à 14:38, David Jacot < >> david.ja...@gmail.com >> > > > >> > > > a >> > > > > > > écrit >> > > > > > > > : >> > > > > > > > >> > > > > > > > > Hey folks, >> > > > > > > > > >> > > > > > > > > I believe that my two PRs will fix most of the issues. I >> have >> > > > also >> > > > > > > > tweaked >> > > > > > > > > the configuration of Jenkins to fix the issues relating to >> > > > cloning >> > > > > > the >> > > > > > > > > repo. There may be other issues but the overall situation >> > > should >> > > > be >> > > > > > > much >> > > > > > > > > better when I merge those two. >> > > > > > > > > >> > > > > > > > > I will update this thread when I merge them. >> > > > > > > > > >> > > > > > > > > Cheers, >> > > > > > > > > David >> > > > > > > > > >> > > > > > > > > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya < >> > > > > divijvaidy...@gmail.com> >> > > > > > a >> > > > > > > > > écrit : >> > > > > > > > > >> > > > > > > > >> Hey folks >> > > > > > > > >> >> > > > > > > > >> I think David (dajac) has some fixes lined-up to improve >> CI >> > > such >> > > > > as >> > > > > > > > >> https://github.com/apache/kafka/pull/15063 and >> > > > > > > > >> https://github.com/apache/kafka/pull/15062. >> > > > > > > > >> >> > > > > > > > >> I have some bandwidth for the next two days to work on >> fixing >> > > > the >> > > > > > CI. >> > > > > > > > Let >> > > > > > > > >> me start by taking a look at the list that Sophie shared >> here. >> > > > > > > > >> >> > > > > > > > >> -- >> > > > > > > > >> Divij Vaidya >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen < >> show...@gmail.com> >> > > > > > wrote: >> > > > > > > > >> >> > > > > > > > >> > Hi Sophie and Philip and all, >> > > > > > > > >> > >> > > > > > > > >> > I share the same pain as you. >> > > > > > > > >> > I've been waiting for a CI build result in a PR for >> days. >> > > > > > > > >> Unfortunately, I >> > > > > > > > >> > can only get 1 result each day because it takes 8 >> hours for >> > > > each >> > > > > > > run, >> > > > > > > > >> and >> > > > > > > > >> > with failed results. :( >> > > > > > > > >> > >> > > > > > > > >> > I've looked into the 8 hour timeout build issue and >> would >> > > like >> > > > > to >> > > > > > > > >> propose >> > > > > > > > >> > to set a global test timeout as 10 mins using the >> junit5 >> > > > feature >> > > > > > > > >> > < >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts >> > > > > > > > >> > > >> > > > > > > > >> > . >> > > > > > > > >> > This way, we can fail those long running tests quickly >> > > without >> > > > > > > > impacting >> > > > > > > > >> > other tests. >> > > > > > > > >> > PR: https://github.com/apache/kafka/pull/15065 >> > > > > > > > >> > I've tested in my local environment and it works as >> > > expected. >> > > > > > > > >> > >> > > > > > > > >> > Any feedback is welcome. >> > > > > > > > >> > >> > > > > > > > >> > Thanks. >> > > > > > > > >> > Luke >> > > > > > > > >> > >> > > > > > > > >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee < >> > > > philip...@gmail.com >> > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > >> > > > > > > > >> > > Hey Sophie - I've gotten 2 inflight PRs each with >> more >> > > than >> > > > 15 >> > > > > > > > >> retries... >> > > > > > > > >> > > Namely: https://github.com/apache/kafka/pull/15023 >> and >> > > > > > > > >> > > https://github.com/apache/kafka/pull/15035 >> > > > > > > > >> > > >> > > > > > > > >> > > justin filed a flaky test report here though: >> > > > > > > > >> > > https://issues.apache.org/jira/browse/KAFKA-16045 >> > > > > > > > >> > > >> > > > > > > > >> > > P >> > > > > > > > >> > > >> > > > > > > > >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie Blee-Goldman < >> > > > > > > > >> > sop...@responsive.dev >> > > > > > > > >> > > > >> > > > > > > > >> > > wrote: >> > > > > > > > >> > > >> > > > > > > > >> > > > On a related note, has anyone else had trouble >> getting >> > > > even >> > > > > a >> > > > > > > > single >> > > > > > > > >> > run >> > > > > > > > >> > > > with no build failures lately? I've had multiple >> > > pure-docs >> > > > > PRs >> > > > > > > > >> blocked >> > > > > > > > >> > > for >> > > > > > > > >> > > > days or even weeks because of miscellaneous infra, >> test, >> > > > and >> > > > > > > > timeout >> > > > > > > > >> > > > failures. I know we just had a discussion about >> whether >> > > > it's >> > > > > > > > >> acceptable >> > > > > > > > >> > > to >> > > > > > > > >> > > > ever merge with a failing build, and the consensus >> > > (which >> > > > I >> > > > > > > agree >> > > > > > > > >> with) >> > > > > > > > >> > > was >> > > > > > > > >> > > > NO -- but seriously, this is getting ridiculous. >> The >> > > build >> > > > > > might >> > > > > > > > be >> > > > > > > > >> the >> > > > > > > > >> > > > worst I've ever seen it, and it just makes it >> really >> > > > > difficult >> > > > > > > to >> > > > > > > > >> > > maintain >> > > > > > > > >> > > > good will with external contributors. >> > > > > > > > >> > > > >> > > > > > > > >> > > > Take for example this small docs PR: >> > > > > > > > >> > > > https://github.com/apache/kafka/pull/14949 >> > > > > > > > >> > > > >> > > > > > > > >> > > > It's on its 7th replay, with the first 6 runs all >> having >> > > > (at >> > > > > > > > least) >> > > > > > > > >> one >> > > > > > > > >> > > > build that failed completely. The issues I saw on >> this >> > > one >> > > > > PR >> > > > > > > are >> > > > > > > > a >> > > > > > > > >> > good >> > > > > > > > >> > > > summary of what I've been seeing elsewhere, so >> here's >> > > the >> > > > > > > > briefing: >> > > > > > > > >> > > > >> > > > > > > > >> > > > 1. gradle issue: >> > > > > > > > >> > > > >> > > > > > > > >> > > > > * What went wrong: >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > Gradle could not start your build. >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > > Cannot create service of type >> > > > BuildSessionActionExecutor >> > > > > > > using >> > > > > > > > >> > method >> > > > > > > > >> > > > > >> > > > > > > > >> > > >> > > > > > > > >> >> > > > > > > >> > > > > >> > > >> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor() >> > > > > > > > >> > > > as >> > > > > > > > >> > > > > there is a problem with parameter #21 of type >> > > > > > > > >> > > > FileSystemWatchingInformation. >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > > Cannot create service of type >> > > > > > > > >> > BuildLifecycleAwareVirtualFileSystem >> > > > > > > > >> > > > > using method >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem() >> > > > > > > > >> > > > > as there is a problem with parameter #7 of type >> > > > > > > > >> GlobalCacheLocations. >> > > > > > > > >> > > > > > Cannot create service of type >> > > > GlobalCacheLocations >> > > > > > > using >> > > > > > > > >> > method >> > > > > > > > >> > > > > >> > > GradleUserHomeScopeServices.createGlobalCacheLocations() >> > > > > as >> > > > > > > > there >> > > > > > > > >> is >> > > > > > > > >> > a >> > > > > > > > >> > > > > problem with parameter #1 of type >> List<GlobalCache>. >> > > > > > > > >> > > > > > Could not create service of type >> > > > > > > > FileAccessTimeJournal >> > > > > > > > >> > using >> > > > > > > > >> > > > > >> > > > GradleUserHomeScopeServices.createFileAccessTimeJournal(). >> > > > > > > > >> > > > > > Timeout waiting to lock journal >> cache >> > > > > > > > >> > > > > (/home/jenkins/.gradle/caches/journal-1). It is >> > > > currently >> > > > > in >> > > > > > > use >> > > > > > > > >> by >> > > > > > > > >> > > > another >> > > > > > > > >> > > > > Gradle instance. >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > 2. git issue: >> > > > > > > > >> > > > >> > > > > > > > >> > > > > ERROR: Error cloning remote repo 'origin' >> > > > > > > > >> > > > > hudson.plugins.git.GitException: >> java.io.IOException: >> > > > > Remote >> > > > > > > > call >> > > > > > > > >> on >> > > > > > > > >> > > > > builds43 failed >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > 3. storage test calling System.exit (I think) >> > > > > > > > >> > > > >> > > > > > > > >> > > > > * What went wrong: >> > > > > > > > >> > > > > Execution failed for task ':storage:test'. >> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73' finished >> with >> > > > > non-zero >> > > > > > > exit >> > > > > > > > >> > value >> > > > > > > > >> > > 1 >> > > > > > > > >> > > > >> > > > > > > > >> > > > This problem might be caused by incorrect test >> > > process >> > > > > > > > >> > configuration. >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > 4. 3/4 builds aborted suddenly for no clear reason >> > > > > > > > >> > > > >> > > > > > > > >> > > > 5. 1 build was aborted, 1 build failed due to a >> > > gradle(?) >> > > > > > issue >> > > > > > > > >> with a >> > > > > > > > >> > > > storage test: >> > > > > > > > >> > > > >> > > > > > > > >> > > > Failed to map supported failure >> > > > > > > > >> 'org.opentest4j.AssertionFailedError: >> > > > > > > > >> > > > > Failed to observe commit callback before >> timeout' with >> > > > > > mapper >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea >> > > > > > > > >> > > > ': >> > > > > > > > >> > > > > null >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > * What went wrong: >> > > > > > > > >> > > > > Execution failed for task ':storage:test'. >> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73' finished with >> > > > non-zero >> > > > > > > exit >> > > > > > > > >> > value 1 >> > > > > > > > >> > > > > This problem might be caused by incorrect test >> > > process >> > > > > > > > >> > configuration. >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > 6. Unknown issue with a core test: >> > > > > > > > >> > > > >> > > > > > > > >> > > > > Unexpected exception thrown. >> > > > > > > > >> > > > > >> > > org.gradle.internal.remote.internal.MessageIOException: >> > > > > > Could >> > > > > > > > not >> > > > > > > > >> > read >> > > > > > > > >> > > > > message from '/127.0.0.1:46952'. >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) >> > > > > > > > >> > > > > at >> java.base/java.lang.Thread.run(Thread.java:1583) >> > > > > > > > >> > > > > Caused by: java.lang.IllegalArgumentException >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81) >> > > > > > > > >> > > > > ... 6 more >> > > > > > > > >> > > > > >> org.gradle.internal.remote.internal.ConnectException: >> > > > > Could >> > > > > > > not >> > > > > > > > >> > connect >> > > > > > > > >> > > > to >> > > > > > > > >> > > > > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 >> > > port:41289, >> > > > > > > > >> addresses:[/ >> > > > > > > > >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1]. >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) >> > > > > > > > >> > > > > Caused by: java.net.ConnectException: Connection >> > > refused >> > > > > > > > >> > > > > at java.base/sun.nio.ch.Net.pollConnect(Native >> > > > Method) >> > > > > > > > >> > > > > at java.base/sun.nio.ch.Net >> > > > > .pollConnectNow(Net.java:682) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > java.base/sun.nio.ch >> > > > > > > > >> > > > >> > > > > > > >> .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > java.base/sun.nio.ch >> > > > > > > > >> > > > >> > > > > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233) >> > > > > > > > >> > > > > at java.base/sun.nio.ch >> > > > > > > > >> > > .SocketAdaptor.connect(SocketAdaptor.java:102) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) >> > > > > > > > >> > > > > at >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) >> > > > > > > > >> > > > > ... 5 more >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > > * What went wrong: >> > > > > > > > >> > > > >> > > > > > > > >> > > > Execution failed for task ':core:test'. >> > > > > > > > >> > > > >> > > > > > > > >> > > > > Process 'Gradle Test Executor 104' finished with >> > > > non-zero >> > > > > > exit >> > > > > > > > >> value >> > > > > > > > >> > 1 >> > > > > > > > >> > > > >> > > > > > > > >> > > > This problem might be caused by incorrect test >> process >> > > > > > > > >> configuration. >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > I've seen almost all of the above issues multiple >> times, >> > > > so >> > > > > it >> > > > > > > > might >> > > > > > > > >> > be a >> > > > > > > > >> > > > good list to start with to focus any efforts on >> > > improving >> > > > > the >> > > > > > > > build. >> > > > > > > > >> > That >> > > > > > > > >> > > > said, I'm not sure what we can really do about >> most of >> > > > > these, >> > > > > > > and >> > > > > > > > >> not >> > > > > > > > >> > > sure >> > > > > > > > >> > > > how to narrow down the root cause in the more >> mysterious >> > > > > cases >> > > > > > > of >> > > > > > > > >> > aborted >> > > > > > > > >> > > > builds and the builds that end with "finished with >> > > > non-zero >> > > > > > exit >> > > > > > > > >> value >> > > > > > > > >> > 1 >> > > > > > > > >> > > " >> > > > > > > > >> > > > with no additional context (that I could find) >> > > > > > > > >> > > > >> > > > > > > > >> > > > If nothing else, there seems to be something >> happening >> > > in >> > > > > one >> > > > > > > (or >> > > > > > > > >> more) >> > > > > > > > >> > > of >> > > > > > > > >> > > > the storage tests, because by far the most common >> > > failure >> > > > > I've >> > > > > > > > seen >> > > > > > > > >> is >> > > > > > > > >> > > that >> > > > > > > > >> > > > in 3 & 5. Unfortunately it's not really clear to >> me how >> > > to >> > > > > > tell >> > > > > > > > >> which >> > > > > > > > >> > is >> > > > > > > > >> > > > the offending test, so I'm not even sure what to >> file a >> > > > > ticket >> > > > > > > for >> > > > > > > > >> > > > >> > > > > > > > >> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot >> > > > > > > > >> > <dja...@confluent.io.invalid >> > > > > > > > >> > > > >> > > > > > > > >> > > > wrote: >> > > > > > > > >> > > > >> > > > > > > > >> > > > > The slowness of the CI is definitely causing us >> a lot >> > > of >> > > > > > > pain. I >> > > > > > > > >> > wonder >> > > > > > > > >> > > > if >> > > > > > > > >> > > > > we should move to a dedicated CI infrastructure >> for >> > > > Kafka. >> > > > > > Our >> > > > > > > > >> > > > integration >> > > > > > > > >> > > > > tests are quite heavy and ASF's CI is not really >> tuned >> > > > for >> > > > > > > them. >> > > > > > > > >> We >> > > > > > > > >> > > could >> > > > > > > > >> > > > > tune it for our needs and this would also allow >> > > external >> > > > > > > > >> companies to >> > > > > > > > >> > > > > sponsor more workers. I heard that we have a few >> cloud >> > > > > > > providers >> > > > > > > > >> in >> > > > > > > > >> > > > > the community ;). I think that we should consider >> > > this. >> > > > > What >> > > > > > > do >> > > > > > > > >> you >> > > > > > > > >> > > > think? >> > > > > > > > >> > > > > I already discussed this with the INFRA team. I >> could >> > > > > > continue >> > > > > > > > if >> > > > > > > > >> we >> > > > > > > > >> > > > > believe that it is a way forward. >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > Best, >> > > > > > > > >> > > > > David >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav >> Kozlovski >> > > > > > > > >> > > > > <stanis...@confluent.io.invalid> wrote: >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > > Hey Николай, >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > Apologies about this - I wasn't aware of this >> > > > behavior. >> > > > > I >> > > > > > > have >> > > > > > > > >> made >> > > > > > > > >> > > all >> > > > > > > > >> > > > > the >> > > > > > > > >> > > > > > gists public. >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris >> > > > > > > > >> > > > > <greg.har...@aiven.io.invalid >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > wrote: >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > > Hey Stan, >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > Thanks for opening the discussion. I haven't >> been >> > > > > > looking >> > > > > > > at >> > > > > > > > >> > > overall >> > > > > > > > >> > > > > > > build duration recently, so it's good that >> you are >> > > > > > calling >> > > > > > > > it >> > > > > > > > >> > out. >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > I worry about us over-indexing on this one >> build, >> > > > > which >> > > > > > > > itself >> > > > > > > > >> > > > appears >> > > > > > > > >> > > > > > > to be an outlier. I only see one other build >> [1] >> > > > above >> > > > > > 6h >> > > > > > > > >> overall >> > > > > > > > >> > > in >> > > > > > > > >> > > > > > > the last 90 days in this view: [2] >> > > > > > > > >> > > > > > > And I don't see any overlap of failed tests >> in >> > > these >> > > > > two >> > > > > > > > >> builds, >> > > > > > > > >> > > > which >> > > > > > > > >> > > > > > > makes it less likely that these particular >> failed >> > > > > tests >> > > > > > > are >> > > > > > > > >> the >> > > > > > > > >> > > > causes >> > > > > > > > >> > > > > > > of long build times. >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > Separately, I've been investigating build >> > > > environment >> > > > > > > > >> slowness, >> > > > > > > > >> > and >> > > > > > > > >> > > > > > > trying to connect it with test failures [3]. >> I >> > > > > observed >> > > > > > > that >> > > > > > > > >> the >> > > > > > > > >> > CI >> > > > > > > > >> > > > > > > build environment is 2-20 times slower than >> my >> > > > > developer >> > > > > > > > >> machine >> > > > > > > > >> > > (M1 >> > > > > > > > >> > > > > > > mac). >> > > > > > > > >> > > > > > > When I simulate a similar slowdown locally, >> there >> > > > are >> > > > > > > tests >> > > > > > > > >> which >> > > > > > > > >> > > > > > > become significantly more flakey, often due >> to >> > > > > > hard-coded >> > > > > > > > >> > timeouts. >> > > > > > > > >> > > > > > > I think that these particularly nasty builds >> could >> > > > be >> > > > > > > > >> explained >> > > > > > > > >> > by >> > > > > > > > >> > > > > > > long-tail slowdowns causing arbitrary tests >> to >> > > take >> > > > an >> > > > > > > > >> excessive >> > > > > > > > >> > > time >> > > > > > > > >> > > > > > > to execute. >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > Rather than trying to find signals in these >> rare >> > > > test >> > > > > > > > >> failures, I >> > > > > > > > >> > > > > > > think we should find tests that have these >> sorts >> > > of >> > > > > > > failures >> > > > > > > > >> more >> > > > > > > > >> > > > > > > regularly. >> > > > > > > > >> > > > > > > There are lots of builds in the 5-6h duration >> > > > bracket, >> > > > > > > which >> > > > > > > > >> is >> > > > > > > > >> > > > > > > certainly unacceptably long. We should look >> into >> > > > these >> > > > > > > > builds >> > > > > > > > >> to >> > > > > > > > >> > > find >> > > > > > > > >> > > > > > > improvements and optimizations. >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/ >> > > > > > > > >> > > > > > > [2] >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York >> > > > > > > > >> > > > > > > [3] >> https://github.com/apache/kafka/pull/15008 >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > Thanks for looking into this! >> > > > > > > > >> > > > > > > Greg >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM Николай >> Ижиков < >> > > > > > > > >> > > nizhi...@apache.org> >> > > > > > > > >> > > > > > > wrote: >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > Hello, Stanislav. >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > Can you, please, make the gist public. >> > > > > > > > >> > > > > > > > Private gists not available for some GitHub >> > > users >> > > > > even >> > > > > > > if >> > > > > > > > >> link >> > > > > > > > >> > > are >> > > > > > > > >> > > > > > known. >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > 19 дек. 2023 г., в 17:33, Stanislav >> Kozlovski >> > > < >> > > > > > > > >> > > > > > stanis...@confluent.io.INVALID> >> > > > > > > > >> > > > > > > написал(а): >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > Hey everybody, >> > > > > > > > >> > > > > > > > > I've heard various complaints that build >> times >> > > > in >> > > > > > > trunk >> > > > > > > > >> are >> > > > > > > > >> > > > taking >> > > > > > > > >> > > > > > too >> > > > > > > > >> > > > > > > > > long, some taking as much as 8 hours (the >> > > > > timeout) - >> > > > > > > and >> > > > > > > > >> this >> > > > > > > > >> > > is >> > > > > > > > >> > > > > > > slowing us >> > > > > > > > >> > > > > > > > > down from being able to meet the code >> freeze >> > > > > > deadline >> > > > > > > > for >> > > > > > > > >> > 3.7. >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > I took it upon myself to gather up some >> data >> > > in >> > > > > > Gradle >> > > > > > > > >> > > Enterprise >> > > > > > > > >> > > > > to >> > > > > > > > >> > > > > > > see if >> > > > > > > > >> > > > > > > > > there are any outlier tests that are >> causing >> > > > this >> > > > > > > > >> slowness. >> > > > > > > > >> > > Turns >> > > > > > > > >> > > > > out >> > > > > > > > >> > > > > > > there >> > > > > > > > >> > > > > > > > > are a few, in this particular build - >> > > > > > > > >> > > > > > > https://ge.apache.org/s/un2hv7n6j374k/ >> > > > > > > > >> > > > > > > > > - which took 10 hours and 29 minutes in >> total. >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > I have compiled the tests that took a >> > > > > > > disproportionately >> > > > > > > > >> > large >> > > > > > > > >> > > > > amount >> > > > > > > > >> > > > > > > of >> > > > > > > > >> > > > > > > > > time (20m+), alongside their time, error >> > > message >> > > > > > and a >> > > > > > > > >> link >> > > > > > > > >> > to >> > > > > > > > >> > > > > their >> > > > > > > > >> > > > > > > full >> > > > > > > > >> > > > > > > > > log output here - >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2 >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > It includes failures from core, streams, >> > > storage >> > > > > and >> > > > > > > > >> clients. >> > > > > > > > >> > > > > > > > > Interestingly, some other tests that >> don't >> > > fail >> > > > > also >> > > > > > > > take >> > > > > > > > >> a >> > > > > > > > >> > > long >> > > > > > > > >> > > > > time >> > > > > > > > >> > > > > > > in >> > > > > > > > >> > > > > > > > > what is apparently the test harness >> framework. >> > > > See >> > > > > > the >> > > > > > > > >> gist >> > > > > > > > >> > for >> > > > > > > > >> > > > > more >> > > > > > > > >> > > > > > > > > information. >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > I am starting this thread with the >> intention >> > > of >> > > > > > > getting >> > > > > > > > >> the >> > > > > > > > >> > > > > > discussion >> > > > > > > > >> > > > > > > > > started and brainstorming what we can do >> to >> > > get >> > > > > the >> > > > > > > > build >> > > > > > > > >> > times >> > > > > > > > >> > > > > back >> > > > > > > > >> > > > > > > under >> > > > > > > > >> > > > > > > > > control. >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > -- >> > > > > > > > >> > > > > > > > > Best, >> > > > > > > > >> > > > > > > > > Stanislav >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > -- >> > > > > > > > >> > > > > > Best, >> > > > > > > > >> > > > > > Stanislav >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Best, >> > > > > > Stanislav >> > > > > > >> > > > > >> > > > >> > > > >> > > > -- >> > > > -David >> > > > >> > > >> >>