It looks like there was some server maintenance that shut down Jenkins. Upon coming back up, the builds were expired but unable to stop.
They all had similar logs: Cancelling nested steps due to timeoutCancelling nested steps due to timeoutBody did not finish within grace period; terminating with extreme prejudiceBody did not finish within grace period; terminating with extreme prejudicePausing (Preparing for shutdown) Resuming build at Thu Jan 25 06:56:23 UTC 2024 after Jenkins restart Resuming build at Thu Jan 25 09:45:03 UTC 2024 after Jenkins restart Pausing (Preparing for shutdown) Resuming build at Thu Jan 25 10:37:41 UTC 2024 after Jenkins restartTimeout expired 7 hr 39 min agoTimeout expired 7 hr 39 min agoCancelling nested steps due to timeoutCancelling nested steps due to timeout*02:37:41* Waiting for reconnection of builds41 before proceeding with build*02:37:41* Waiting for reconnection of builds32 before proceeding with buildStill pausedBody did not finish within grace period; terminating with extreme prejudiceBody did not finish within grace period; terminating with extreme prejudice I forcibly killed the builds running over one day to free resources. I believe the rest are running as expected now. Justine On Thu, Jan 25, 2024 at 10:22 AM Justine Olshan <jols...@confluent.io> wrote: > Hey folks -- I noticed some builds have been running for a day or more. I > thought we limited builds to 8 hours. Any ideas why this is happening? > > > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/activity/ > I tried to abort the build for PR-15257, and it also still seems to be > running. > > Justine > > On Sun, Jan 14, 2024 at 6:25 AM Qichao Chu <qic...@uber.com.invalid> > wrote: > >> Hi Divij and all, >> >> Regarding the speeding up of the build & de-flaking tests, LinkedIn has >> done some great work which we probably can borrow ideas from. >> In the LinkedIn/Kafka repo, we can see one of their most recent PRs >> <https://github.com/linkedin/kafka/pull/500/checks> only took < 9 >> min(unit >> test) + < 12 min (integration-test) + < 9 (code check) = < 30 min to >> finish >> all the checks: >> >> 1. Similar to what David(mumrah) has mentioned/experimented with, the >> LinkedIn team used GitHub Actions, which displayed the results in a >> cleaner >> way directly from GitHub. >> 2. Each top-level package is checked separately to increase the >> concurrency. To further boost the speed for integration tests, the >> tests >> inside one package are divided into sub-groups(A-Z) based on their >> names(see this job >> <https://github.com/linkedin/kafka/actions/runs/7303478151/> for >> details). >> 3. Once the tests are running at a smaller granularity with a decent >> runner, heavy integration tests are less likely to be flaky, and flaky >> tests are easier to catch. >> >> >> -- >> Qichao >> >> >> On Wed, Jan 10, 2024 at 2:57 PM Divij Vaidya <divijvaidy...@gmail.com> >> wrote: >> >> > Hey folks >> > >> > We seem to have a handle on the OOM issues with the multiple fixes >> > community members made. In >> > https://issues.apache.org/jira/browse/KAFKA-16052, >> > you can see the "before" profile in the description and the "after" >> profile >> > in the latest comment to see the difference. To prevent future >> recurrence, >> > we have an ongoing solution at >> https://github.com/apache/kafka/pull/15101 >> > and after that we will start another once to get rid of mockito mocks at >> > the end of every test suite using a similar extension. Note that this >> > doesn't solve the flaky test problems in the trunk but it removes the >> > aspect of build failures due to OOM (one of the many problems). >> > >> > To fix the flaky test problem, we probably need to run our tests in a >> > separate CI environment (like Apache Beam does) instead of sharing the 3 >> > hosts that run our CI with many many other Apache projects. This >> assumption >> > is based on the fact that the tests are less flaky when running on >> laptops >> > / powerful EC2 machines. One of the avenues to get funding for these >> > Kafka-only hosts is >> > >> > >> https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ >> > . I will start the conversation on this one with AWS & Apache Infra in >> the >> > next 1-2 months. >> > >> > -- >> > Divij Vaidya >> > >> > >> > >> > On Tue, Jan 9, 2024 at 9:21 PM Colin McCabe <cmcc...@apache.org> wrote: >> > >> > > Sorry, but to put it bluntly, the current build setup isn't good >> enough >> > at >> > > partial rebuilds that build caching would make sense. All Kafka devs >> have >> > > had the experience of needing to clean the build directory in order to >> > get >> > > a valid build. The scala code esspecially seems to have this issue. >> > > >> > > regards, >> > > Colin >> > > >> > > >> > > On Tue, Jan 2, 2024, at 07:00, Nick Telford wrote: >> > > > Addendum: I've opened a PR with what I believe are the changes >> > necessary >> > > to >> > > > enable Remote Build Caching, if you choose to go that route: >> > > > https://github.com/apache/kafka/pull/15109 >> > > > >> > > > On Tue, 2 Jan 2024 at 14:31, Nick Telford <nick.telf...@gmail.com> >> > > wrote: >> > > > >> > > >> Hi everyone, >> > > >> >> > > >> Regarding building a "dependency graph"... Gradle already has this >> > > >> information, albeit fairly coarse-grained. You might be able to get >> > some >> > > >> considerable improvement by configuring the Gradle Remote Build >> Cache. >> > > It >> > > >> looks like it's currently disabled explicitly: >> > > >> https://github.com/apache/kafka/blob/trunk/settings.gradle#L46 >> > > >> >> > > >> The trick is to have trunk builds write to the cache, and PR builds >> > only >> > > >> read from it. This way, any PR based on trunk should be able to >> cache >> > > not >> > > >> only the compilation, but also the tests from dependent modules >> that >> > > >> haven't changed (e.g. for a PR that only touches the >> connect/streams >> > > >> modules). >> > > >> >> > > >> This would probably be preferable to having to hand-maintain some >> > > >> rules/dependency graph in the CI configuration, and it's quite >> > > >> straight-forward to configure. >> > > >> >> > > >> Bonus points if the Remote Build Cache is readable publicly, >> enabling >> > > >> contributors to benefit from it locally. >> > > >> >> > > >> Regards, >> > > >> Nick >> > > >> >> > > >> On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy < >> lbruts...@confluent.io >> > > .invalid> >> > > >> wrote: >> > > >> >> > > >>> Thanks for all the work that has already been done on this in the >> > past >> > > >>> days! >> > > >>> >> > > >>> Have we considered running our test suite with >> > > >>> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as >> > > >>> Jenkins build artifacts? This could speed up debugging. Even if we >> > > >>> store them only for a day and do it only for trunk, I think it >> could >> > > >>> be worth it. The heap dumps shouldn't contain any secrets, and I >> > > >>> checked with the ASF infra team, and they are not concerned about >> the >> > > >>> additional disk usage. >> > > >>> >> > > >>> Cheers, >> > > >>> Lucas >> > > >>> >> > > >>> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya < >> > divijvaidy...@gmail.com> >> > > >>> wrote: >> > > >>> > >> > > >>> > I have started to perform an analysis of the OOM at >> > > >>> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel >> > free >> > > to >> > > >>> > contribute to the investigation. >> > > >>> > >> > > >>> > -- >> > > >>> > Divij Vaidya >> > > >>> > >> > > >>> > >> > > >>> > >> > > >>> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan >> > > >>> <jols...@confluent.io.invalid> >> > > >>> > wrote: >> > > >>> > >> > > >>> > > I am still seeing quite a few OOM errors in the builds and I >> was >> > > >>> curious if >> > > >>> > > folks had any ideas on how to identify the cause and fix the >> > > issue. I >> > > >>> was >> > > >>> > > looking in gradle enterprise and found some info about memory >> > > usage, >> > > >>> but >> > > >>> > > nothing detailed enough to help figure the issue out. >> > > >>> > > >> > > >>> > > OOMs sometimes fail the build immediately and in other cases I >> > see >> > > it >> > > >>> get >> > > >>> > > stuck for 8 hours. (See >> > > >>> > > >> > > >>> > > >> > > >>> >> > > >> > >> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12 >> > > >>> > > ) >> > > >>> > > >> > > >>> > > I appreciate all the work folks are doing here and I will >> > continue >> > > to >> > > >>> try >> > > >>> > > to help as best as I can. >> > > >>> > > >> > > >>> > > Justine >> > > >>> > > >> > > >>> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur >> > > >>> > > <david.art...@confluent.io.invalid> wrote: >> > > >>> > > >> > > >>> > > > S2. We’ve looked into this before, and it wasn’t possible at >> > the >> > > >>> time >> > > >>> > > with >> > > >>> > > > JUnit. We commonly set a timeout on each test class >> (especially >> > > >>> > > integration >> > > >>> > > > tests). It is probably worth looking at this again and >> seeing >> > if >> > > >>> > > something >> > > >>> > > > has changed with JUnit (or our usage of it) that would >> allow a >> > > >>> global >> > > >>> > > > timeout. >> > > >>> > > > >> > > >>> > > > >> > > >>> > > > S3. Dedicated infra sounds nice, if we can get it. It would >> at >> > > least >> > > >>> > > remove >> > > >>> > > > some variability between the builds, and hopefully eliminate >> > the >> > > >>> > > > infra/setup class of failures. >> > > >>> > > > >> > > >>> > > > >> > > >>> > > > S4. Running tests for what has changed sounds nice, but I >> think >> > > it >> > > >>> is >> > > >>> > > risky >> > > >>> > > > to implement broadly. As Sophie mentioned, there are >> probably >> > > some >> > > >>> lines >> > > >>> > > we >> > > >>> > > > could draw where we feel confident that only running a >> subset >> > of >> > > >>> tests is >> > > >>> > > > safe. As a start, we could probably work towards skipping CI >> > for >> > > >>> non-code >> > > >>> > > > PRs. >> > > >>> > > > >> > > >>> > > > >> > > >>> > > > --- >> > > >>> > > > >> > > >>> > > > >> > > >>> > > > As an aside, I experimented with build caching and running >> > > affected >> > > >>> > > tests a >> > > >>> > > > few months ago. I used the opportunity to play with Github >> > > Actions, >> > > >>> and I >> > > >>> > > > quite liked it. Here’s the workflow I used: >> > > >>> > > > >> > > >>> >> > https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml. >> > > I >> > > >>> > > > was trying to see if we could use a build cache to reduce >> the >> > > >>> compilation >> > > >>> > > > time on PRs. A nightly/periodic job would build trunk and >> > > populate a >> > > >>> > > Gradle >> > > >>> > > > build cache. PR builds would read from that cache which >> would >> > > >>> enable them >> > > >>> > > > to only compile changed code. The same idea could be >> extended >> > to >> > > >>> tests, >> > > >>> > > but >> > > >>> > > > I didn’t get that far. >> > > >>> > > > >> > > >>> > > > >> > > >>> > > > As for Github Actions, the idea there is that ASF would >> provide >> > > >>> generic >> > > >>> > > > Action “runners” that would pick up jobs from the Github >> Action >> > > >>> build >> > > >>> > > queue >> > > >>> > > > and run them. It is also possible to self-host runners to >> > expand >> > > the >> > > >>> > > build >> > > >>> > > > capacity of the project (i.e., other organizations could >> donate >> > > >>> > > > build capacity). The advantage of this is that we would have >> > more >> > > >>> control >> > > >>> > > > over our build/reports and not be “stuck” with whatever ASF >> > > Jenkins >> > > >>> > > offers. >> > > >>> > > > The Actions workflows are very customizable and it would >> let us >> > > >>> create >> > > >>> > > our >> > > >>> > > > own custom plugins. There is also a substantial marketplace >> of >> > > >>> plugins. I >> > > >>> > > > think it’s worth exploring this more, I just haven’t had >> time >> > > >>> lately. >> > > >>> > > > >> > > >>> > > > On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman < >> > > >>> > > sop...@responsive.dev >> > > >>> > > > > >> > > >>> > > > wrote: >> > > >>> > > > >> > > >>> > > > > Regarding: >> > > >>> > > > > >> > > >>> > > > > S-4. Separate tests ran depending on what module is >> changed. >> > > >>> > > > > > >> > > >>> > > > > - This makes sense although is tricky to implement >> > > successfully, >> > > >>> as >> > > >>> > > > > > unrelated tests may expose problems in an unrelated >> change >> > > (e.g >> > > >>> > > > changing >> > > >>> > > > > > core stuff like clients, the server, etc) >> > > >>> > > > > >> > > >>> > > > > >> > > >>> > > > > Imo this avenue could provide a massive improvement to dev >> > > >>> productivity >> > > >>> > > > > with very little effort or investment, and if we do it >> right, >> > > >>> without >> > > >>> > > > even >> > > >>> > > > > any risk. We should be able to draft a simple dependency >> > graph >> > > >>> between >> > > >>> > > > > modules and then skip the tests for anything that is >> clearly, >> > > >>> provably >> > > >>> > > > > unrelated and/or upstream of the target changes. This has >> the >> > > >>> potential >> > > >>> > > > to >> > > >>> > > > > substantially speed up and improve the developer >> experience >> > in >> > > >>> modules >> > > >>> > > at >> > > >>> > > > > the end of the dependency graph, which I believe is worth >> > doing >> > > >>> even if >> > > >>> > > > it >> > > >>> > > > > unfortunately would not benefit everyone equally. >> > > >>> > > > > >> > > >>> > > > > For example, we can save a lot of grief with just a simple >> > set >> > > of >> > > >>> rules >> > > >>> > > > > that are easy to check. I'll throw out a few to start >> with: >> > > >>> > > > > >> > > >>> > > > > 1. A pure docs PR (ie that only touches files under the >> > > docs/ >> > > >>> > > > directory) >> > > >>> > > > > should be allowed to skip the tests of all modules >> > > >>> > > > > 2. Connect PRs (that only touch connect/) only need to >> run >> > > the >> > > >>> > > Connect >> > > >>> > > > > tests -- ie they can skip the tests for core, clients, >> > > >>> streams, etc >> > > >>> > > > > 3. Similarly, Streams PRs should only need to run the >> > > Streams >> > > >>> tests >> > > >>> > > -- >> > > >>> > > > > but again, only if all the changes are contained within >> > > >>> streams/ >> > > >>> > > > > >> > > >>> > > > > I'll let others chime in on how or if we can construct >> some >> > > safe >> > > >>> rules >> > > >>> > > as >> > > >>> > > > > to which modules can or can't be skipped between the core, >> > > >>> clients, >> > > >>> > > raft, >> > > >>> > > > > storage, etc >> > > >>> > > > > >> > > >>> > > > > And over time we could in theory build up a literal >> > dependency >> > > >>> graph >> > > >>> > > on a >> > > >>> > > > > more granular level so that, for example, changes to the >> > > >>> core/storage >> > > >>> > > > > module are allowed to skip any Streams tests that don't >> use >> > an >> > > >>> embedded >> > > >>> > > > > broker, ie all unit tests and TopologyTestDriver-based >> > > integration >> > > >>> > > tests. >> > > >>> > > > > The danger here would be in making sure this graph is >> kept up >> > > to >> > > >>> date >> > > >>> > > as >> > > >>> > > > > tests are added and changed, but my point is just that >> > there's >> > > a >> > > >>> way to >> > > >>> > > > > extend the benefit of this tactic to those who work >> primarily >> > > on >> > > >>> the >> > > >>> > > core >> > > >>> > > > > module as well. Personally, I think we should just start >> out >> > > with >> > > >>> the >> > > >>> > > > > example ruleset listed above, workshop it a bit since >> there >> > > might >> > > >>> be >> > > >>> > > > other >> > > >>> > > > > obvious rules I left out, and try to implement it. >> > > >>> > > > > >> > > >>> > > > > Thoughts? >> > > >>> > > > > >> > > >>> > > > > On Tue, Dec 26, 2023 at 2:25 AM Stanislav Kozlovski >> > > >>> > > > > <stanis...@confluent.io.invalid> wrote: >> > > >>> > > > > >> > > >>> > > > > > Great discussion! >> > > >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > Greg, that was a good call out regarding the two >> > long-running >> > > >>> > > builds. I >> > > >>> > > > > > missed that 90d view. >> > > >>> > > > > > >> > > >>> > > > > > My takeaway from that is that our average build time for >> > > tests >> > > >>> is >> > > >>> > > > between >> > > >>> > > > > > 3-4 hours. Which in of itself seems large. >> > > >>> > > > > > >> > > >>> > > > > > But then reconciling this with Sophie's statement - is >> it >> > > >>> possible >> > > >>> > > that >> > > >>> > > > > > these timed-out 8-hour builds don't get captured in that >> > > view? >> > > >>> > > > > > >> > > >>> > > > > > It is weird that people are reporting these things and >> > Gradle >> > > >>> > > > Enterprise >> > > >>> > > > > > isn't showing them. >> > > >>> > > > > > >> > > >>> > > > > > --- >> > > >>> > > > > > >> > > >>> > > > > > > I think that these particularly nasty builds could be >> > > >>> explained by >> > > >>> > > > > > long-tail slowdowns causing arbitrary tests to take an >> > > >>> excessive time >> > > >>> > > > to >> > > >>> > > > > > execute. >> > > >>> > > > > > >> > > >>> > > > > > I'm not sure I understood that. If the tests have >> timeouts, >> > > >>> where >> > > >>> > > would >> > > >>> > > > > the >> > > >>> > > > > > slowdown come from? Problems in tearing down the test? >> > > >>> > > > > > >> > > >>> > > > > > --- >> > > >>> > > > > > >> > > >>> > > > > > David, thanks for the great work in identifying and even >> > > fixing >> > > >>> those >> > > >>> > > > two >> > > >>> > > > > > top offenders! And thank you for cherry-picking to 3.7 >> > > >>> > > > > > >> > > >>> > > > > > -- >> > > >>> > > > > > >> > > >>> > > > > > All in all, from this thread I can summarize a few >> > potential >> > > >>> > > solutions: >> > > >>> > > > > > >> > > >>> > > > > > S-1. Dedicated work identifying and fixing some of the >> > issues >> > > >>> (e.g. >> > > >>> > > > what >> > > >>> > > > > > David did). >> > > >>> > > > > > - Should help alleviate the issues as it can be >> speculated >> > > that >> > > >>> it's >> > > >>> > > > > > frequently 1 or 2 tests causing the majority of issues. >> > > >>> > > > > > - With regards to that, KAFKA-16045 seems open for >> taking >> > if >> > > >>> there >> > > >>> > > are >> > > >>> > > > > any >> > > >>> > > > > > volunteers >> > > >>> > > > > > - Sophie's list also contains good candidates >> > > >>> > > > > > >> > > >>> > > > > > S-2. Global 10-minute timeout for tests. >> > > >>> > > > > > - Should lay the foundation for a strong catch-all for >> any >> > > >>> > > misbehaving >> > > >>> > > > > > tests. I like this idea since it's guaranteed to save >> each >> > > >>> > > contributor >> > > >>> > > > > many >> > > >>> > > > > > hours of waiting for an 8hr+ time out build. >> > > >>> > > > > > - Luke already has a PR out for this: >> > > >>> > > > > > https://github.com/apache/kafka/pull/15065 >> > > >>> > > > > > >> > > >>> > > > > > S-3. Separate infrastructure for our CI >> > > >>> > > > > > - This would help with Greg's comment about the >> developer >> > > >>> machine >> > > >>> > > being >> > > >>> > > > > > 2-20 times faster than the CI. >> > > >>> > > > > > - Requires volunteer funding from external companies. If >> > > every >> > > >>> > > > > contributor >> > > >>> > > > > > would bring up the idea with their employer, we may be >> able >> > > to >> > > >>> stitch >> > > >>> > > > > > something together. >> > > >>> > > > > > >> > > >>> > > > > > S-4. Separate tests ran depending on what module is >> > changed. >> > > >>> > > > > > - This makes sense although is tricky to implement >> > > >>> successfully, as >> > > >>> > > > > > unrelated tests may expose problems in an unrelated >> change >> > > (e.g >> > > >>> > > > changing >> > > >>> > > > > > core stuff like clients, the server, etc) >> > > >>> > > > > > >> > > >>> > > > > > S-5. Greater committer diligence when merging PRs >> > > >>> > > > > > - This should always be there. Unfortunately it is a bit >> > of a >> > > >>> > > > > > self-perpetuating effect in that when the builds get >> worse, >> > > >>> people >> > > >>> > > are >> > > >>> > > > > > incentivized to be less diligent (slowed down while in a >> > > rush to >> > > >>> > > merge, >> > > >>> > > > > > recency bias of failed builds, etc.) >> > > >>> > > > > > >> > > >>> > > > > > On Fri, Dec 22, 2023 at 4:16 PM Justine Olshan >> > > >>> > > > > > <jols...@confluent.io.invalid> >> > > >>> > > > > > wrote: >> > > >>> > > > > > >> > > >>> > > > > > > Thanks David! I think this should help a lot! >> > > >>> > > > > > > >> > > >>> > > > > > > While we should include these improvements, I think >> it is >> > > >>> also good >> > > >>> > > > to >> > > >>> > > > > > > remind folks that a lot of these issues come from >> merging >> > > on >> > > >>> builds >> > > >>> > > > > that >> > > >>> > > > > > > regress the CI. >> > > >>> > > > > > > I know I'm not perfect at this (and have merged on >> flaky >> > > and >> > > >>> > > failing >> > > >>> > > > > > > tests), but let's all be super careful going forward. >> > There >> > > >>> were a >> > > >>> > > > few >> > > >>> > > > > > > times I retried the build 10+ times and thought it was >> > > other >> > > >>> issues >> > > >>> > > > > with >> > > >>> > > > > > > the CI but the failed builds were actually due to the >> > > changes >> > > >>> I >> > > >>> > > > > wrote/was >> > > >>> > > > > > > reviewing. >> > > >>> > > > > > > >> > > >>> > > > > > > We all need to work together on this to ensure the >> builds >> > > stay >> > > >>> > > > healthy! >> > > >>> > > > > > > Thanks all for being concerned about our builds! >> > > >>> > > > > > > >> > > >>> > > > > > > Justine >> > > >>> > > > > > > >> > > >>> > > > > > > On Fri, Dec 22, 2023 at 6:02 AM David Jacot < >> > > >>> david.ja...@gmail.com >> > > >>> > > > >> > > >>> > > > > > wrote: >> > > >>> > > > > > > >> > > >>> > > > > > > > I just merged both PRs. >> > > >>> > > > > > > > >> > > >>> > > > > > > > Cheers, >> > > >>> > > > > > > > David >> > > >>> > > > > > > > >> > > >>> > > > > > > > Le ven. 22 déc. 2023 à 14:38, David Jacot < >> > > >>> david.ja...@gmail.com >> > > >>> > > > >> > > >>> > > > a >> > > >>> > > > > > > écrit >> > > >>> > > > > > > > : >> > > >>> > > > > > > > >> > > >>> > > > > > > > > Hey folks, >> > > >>> > > > > > > > > >> > > >>> > > > > > > > > I believe that my two PRs will fix most of the >> > issues. >> > > I >> > > >>> have >> > > >>> > > > also >> > > >>> > > > > > > > tweaked >> > > >>> > > > > > > > > the configuration of Jenkins to fix the issues >> > > relating to >> > > >>> > > > cloning >> > > >>> > > > > > the >> > > >>> > > > > > > > > repo. There may be other issues but the overall >> > > situation >> > > >>> > > should >> > > >>> > > > be >> > > >>> > > > > > > much >> > > >>> > > > > > > > > better when I merge those two. >> > > >>> > > > > > > > > >> > > >>> > > > > > > > > I will update this thread when I merge them. >> > > >>> > > > > > > > > >> > > >>> > > > > > > > > Cheers, >> > > >>> > > > > > > > > David >> > > >>> > > > > > > > > >> > > >>> > > > > > > > > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya < >> > > >>> > > > > divijvaidy...@gmail.com> >> > > >>> > > > > > a >> > > >>> > > > > > > > > écrit : >> > > >>> > > > > > > > > >> > > >>> > > > > > > > >> Hey folks >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> I think David (dajac) has some fixes lined-up to >> > > improve >> > > >>> CI >> > > >>> > > such >> > > >>> > > > > as >> > > >>> > > > > > > > >> https://github.com/apache/kafka/pull/15063 and >> > > >>> > > > > > > > >> https://github.com/apache/kafka/pull/15062. >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> I have some bandwidth for the next two days to >> work >> > on >> > > >>> fixing >> > > >>> > > > the >> > > >>> > > > > > CI. >> > > >>> > > > > > > > Let >> > > >>> > > > > > > > >> me start by taking a look at the list that Sophie >> > > shared >> > > >>> here. >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> -- >> > > >>> > > > > > > > >> Divij Vaidya >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen < >> > > >>> show...@gmail.com> >> > > >>> > > > > > wrote: >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > Hi Sophie and Philip and all, >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> > I share the same pain as you. >> > > >>> > > > > > > > >> > I've been waiting for a CI build result in a PR >> > for >> > > >>> days. >> > > >>> > > > > > > > >> Unfortunately, I >> > > >>> > > > > > > > >> > can only get 1 result each day because it >> takes 8 >> > > >>> hours for >> > > >>> > > > each >> > > >>> > > > > > > run, >> > > >>> > > > > > > > >> and >> > > >>> > > > > > > > >> > with failed results. :( >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> > I've looked into the 8 hour timeout build issue >> > and >> > > >>> would >> > > >>> > > like >> > > >>> > > > > to >> > > >>> > > > > > > > >> propose >> > > >>> > > > > > > > >> > to set a global test timeout as 10 mins using >> the >> > > >>> junit5 >> > > >>> > > > feature >> > > >>> > > > > > > > >> > < >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > . >> > > >>> > > > > > > > >> > This way, we can fail those long running tests >> > > quickly >> > > >>> > > without >> > > >>> > > > > > > > impacting >> > > >>> > > > > > > > >> > other tests. >> > > >>> > > > > > > > >> > PR: https://github.com/apache/kafka/pull/15065 >> > > >>> > > > > > > > >> > I've tested in my local environment and it >> works >> > as >> > > >>> > > expected. >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> > Any feedback is welcome. >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> > Thanks. >> > > >>> > > > > > > > >> > Luke >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee < >> > > >>> > > > philip...@gmail.com >> > > >>> > > > > > >> > > >>> > > > > > > > wrote: >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> > > Hey Sophie - I've gotten 2 inflight PRs each >> > with >> > > >>> more >> > > >>> > > than >> > > >>> > > > 15 >> > > >>> > > > > > > > >> retries... >> > > >>> > > > > > > > >> > > Namely: >> > > https://github.com/apache/kafka/pull/15023 >> > > >>> and >> > > >>> > > > > > > > >> > > https://github.com/apache/kafka/pull/15035 >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > > justin filed a flaky test report here though: >> > > >>> > > > > > > > >> > > >> > https://issues.apache.org/jira/browse/KAFKA-16045 >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > > P >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie >> > > Blee-Goldman < >> > > >>> > > > > > > > >> > sop...@responsive.dev >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > wrote: >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > > > On a related note, has anyone else had >> trouble >> > > >>> getting >> > > >>> > > > even >> > > >>> > > > > a >> > > >>> > > > > > > > single >> > > >>> > > > > > > > >> > run >> > > >>> > > > > > > > >> > > > with no build failures lately? I've had >> > multiple >> > > >>> > > pure-docs >> > > >>> > > > > PRs >> > > >>> > > > > > > > >> blocked >> > > >>> > > > > > > > >> > > for >> > > >>> > > > > > > > >> > > > days or even weeks because of miscellaneous >> > > infra, >> > > >>> test, >> > > >>> > > > and >> > > >>> > > > > > > > timeout >> > > >>> > > > > > > > >> > > > failures. I know we just had a discussion >> > about >> > > >>> whether >> > > >>> > > > it's >> > > >>> > > > > > > > >> acceptable >> > > >>> > > > > > > > >> > > to >> > > >>> > > > > > > > >> > > > ever merge with a failing build, and the >> > > consensus >> > > >>> > > (which >> > > >>> > > > I >> > > >>> > > > > > > agree >> > > >>> > > > > > > > >> with) >> > > >>> > > > > > > > >> > > was >> > > >>> > > > > > > > >> > > > NO -- but seriously, this is getting >> > ridiculous. >> > > >>> The >> > > >>> > > build >> > > >>> > > > > > might >> > > >>> > > > > > > > be >> > > >>> > > > > > > > >> the >> > > >>> > > > > > > > >> > > > worst I've ever seen it, and it just makes >> it >> > > >>> really >> > > >>> > > > > difficult >> > > >>> > > > > > > to >> > > >>> > > > > > > > >> > > maintain >> > > >>> > > > > > > > >> > > > good will with external contributors. >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > Take for example this small docs PR: >> > > >>> > > > > > > > >> > > > https://github.com/apache/kafka/pull/14949 >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > It's on its 7th replay, with the first 6 >> runs >> > > all >> > > >>> having >> > > >>> > > > (at >> > > >>> > > > > > > > least) >> > > >>> > > > > > > > >> one >> > > >>> > > > > > > > >> > > > build that failed completely. The issues I >> saw >> > > on >> > > >>> this >> > > >>> > > one >> > > >>> > > > > PR >> > > >>> > > > > > > are >> > > >>> > > > > > > > a >> > > >>> > > > > > > > >> > good >> > > >>> > > > > > > > >> > > > summary of what I've been seeing >> elsewhere, so >> > > >>> here's >> > > >>> > > the >> > > >>> > > > > > > > briefing: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > 1. gradle issue: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > * What went wrong: >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > > Gradle could not start your build. >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > > > Cannot create service of type >> > > >>> > > > BuildSessionActionExecutor >> > > >>> > > > > > > using >> > > >>> > > > > > > > >> > method >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > >> > > >>> > > > > >> > > >>> > > >> > > >>> >> > > >> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor() >> > > >>> > > > > > > > >> > > > as >> > > >>> > > > > > > > >> > > > > there is a problem with parameter #21 of >> > type >> > > >>> > > > > > > > >> > > > FileSystemWatchingInformation. >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > > > Cannot create service of type >> > > >>> > > > > > > > >> > BuildLifecycleAwareVirtualFileSystem >> > > >>> > > > > > > > >> > > > > using method >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem() >> > > >>> > > > > > > > >> > > > > as there is a problem with parameter #7 >> of >> > > type >> > > >>> > > > > > > > >> GlobalCacheLocations. >> > > >>> > > > > > > > >> > > > > > Cannot create service of type >> > > >>> > > > GlobalCacheLocations >> > > >>> > > > > > > using >> > > >>> > > > > > > > >> > method >> > > >>> > > > > > > > >> > > > > >> > > >>> > > GradleUserHomeScopeServices.createGlobalCacheLocations() >> > > >>> > > > > as >> > > >>> > > > > > > > there >> > > >>> > > > > > > > >> is >> > > >>> > > > > > > > >> > a >> > > >>> > > > > > > > >> > > > > problem with parameter #1 of type >> > > >>> List<GlobalCache>. >> > > >>> > > > > > > > >> > > > > > Could not create service of >> type >> > > >>> > > > > > > > FileAccessTimeJournal >> > > >>> > > > > > > > >> > using >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > GradleUserHomeScopeServices.createFileAccessTimeJournal(). >> > > >>> > > > > > > > >> > > > > > Timeout waiting to lock >> > journal >> > > >>> cache >> > > >>> > > > > > > > >> > > > > >> (/home/jenkins/.gradle/caches/journal-1). It >> > > is >> > > >>> > > > currently >> > > >>> > > > > in >> > > >>> > > > > > > use >> > > >>> > > > > > > > >> by >> > > >>> > > > > > > > >> > > > another >> > > >>> > > > > > > > >> > > > > Gradle instance. >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > 2. git issue: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > ERROR: Error cloning remote repo 'origin' >> > > >>> > > > > > > > >> > > > > hudson.plugins.git.GitException: >> > > >>> java.io.IOException: >> > > >>> > > > > Remote >> > > >>> > > > > > > > call >> > > >>> > > > > > > > >> on >> > > >>> > > > > > > > >> > > > > builds43 failed >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > 3. storage test calling System.exit (I >> think) >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > * What went wrong: >> > > >>> > > > > > > > >> > > > > Execution failed for task >> ':storage:test'. >> > > >>> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73' >> > finished >> > > >>> with >> > > >>> > > > > non-zero >> > > >>> > > > > > > exit >> > > >>> > > > > > > > >> > value >> > > >>> > > > > > > > >> > > 1 >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > This problem might be caused by >> incorrect >> > > test >> > > >>> > > process >> > > >>> > > > > > > > >> > configuration. >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > 4. 3/4 builds aborted suddenly for no >> clear >> > > reason >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > 5. 1 build was aborted, 1 build failed due >> to >> > a >> > > >>> > > gradle(?) >> > > >>> > > > > > issue >> > > >>> > > > > > > > >> with a >> > > >>> > > > > > > > >> > > > storage test: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > Failed to map supported failure >> > > >>> > > > > > > > >> 'org.opentest4j.AssertionFailedError: >> > > >>> > > > > > > > >> > > > > Failed to observe commit callback before >> > > >>> timeout' with >> > > >>> > > > > > mapper >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea >> > > >>> > > > > > > > >> > > > ': >> > > >>> > > > > > > > >> > > > > null >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > * What went wrong: >> > > >>> > > > > > > > >> > > > > Execution failed for task >> ':storage:test'. >> > > >>> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73' >> finished >> > > with >> > > >>> > > > non-zero >> > > >>> > > > > > > exit >> > > >>> > > > > > > > >> > value 1 >> > > >>> > > > > > > > >> > > > > This problem might be caused by >> incorrect >> > > test >> > > >>> > > process >> > > >>> > > > > > > > >> > configuration. >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > 6. Unknown issue with a core test: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > Unexpected exception thrown. >> > > >>> > > > > > > > >> > > > > >> > > >>> > > org.gradle.internal.remote.internal.MessageIOException: >> > > >>> > > > > > Could >> > > >>> > > > > > > > not >> > > >>> > > > > > > > >> > read >> > > >>> > > > > > > > >> > > > > message from '/127.0.0.1:46952'. >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) >> > > >>> > > > > > > > >> > > > > at >> > > >>> java.base/java.lang.Thread.run(Thread.java:1583) >> > > >>> > > > > > > > >> > > > > Caused by: >> > java.lang.IllegalArgumentException >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81) >> > > >>> > > > > > > > >> > > > > ... 6 more >> > > >>> > > > > > > > >> > > > > >> > > >>> org.gradle.internal.remote.internal.ConnectException: >> > > >>> > > > > Could >> > > >>> > > > > > > not >> > > >>> > > > > > > > >> > connect >> > > >>> > > > > > > > >> > > > to >> > > >>> > > > > > > > >> > > > > server >> [1d62bf97-6a3e-441d-93b6-093617cbbea9 >> > > >>> > > port:41289, >> > > >>> > > > > > > > >> addresses:[/ >> > > >>> > > > > > > > >> > > > > 127.0.0.1]]. Tried addresses: [/ >> 127.0.0.1]. >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) >> > > >>> > > > > > > > >> > > > > Caused by: java.net.ConnectException: >> > > Connection >> > > >>> > > refused >> > > >>> > > > > > > > >> > > > > at java.base/sun.nio.ch.Net >> > > .pollConnect(Native >> > > >>> > > > Method) >> > > >>> > > > > > > > >> > > > > at java.base/sun.nio.ch.Net >> > > >>> > > > > .pollConnectNow(Net.java:682) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > java.base/sun.nio.ch >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > >> > > >>> .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > java.base/sun.nio.ch >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > >> > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233) >> > > >>> > > > > > > > >> > > > > at java.base/sun.nio.ch >> > > >>> > > > > > > > >> > > >> .SocketAdaptor.connect(SocketAdaptor.java:102) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) >> > > >>> > > > > > > > >> > > > > at >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) >> > > >>> > > > > > > > >> > > > > ... 5 more >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > * What went wrong: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > Execution failed for task ':core:test'. >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > Process 'Gradle Test Executor 104' >> finished >> > > with >> > > >>> > > > non-zero >> > > >>> > > > > > exit >> > > >>> > > > > > > > >> value >> > > >>> > > > > > > > >> > 1 >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > This problem might be caused by incorrect >> > test >> > > >>> process >> > > >>> > > > > > > > >> configuration. >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > I've seen almost all of the above issues >> > > multiple >> > > >>> times, >> > > >>> > > > so >> > > >>> > > > > it >> > > >>> > > > > > > > might >> > > >>> > > > > > > > >> > be a >> > > >>> > > > > > > > >> > > > good list to start with to focus any >> efforts >> > on >> > > >>> > > improving >> > > >>> > > > > the >> > > >>> > > > > > > > build. >> > > >>> > > > > > > > >> > That >> > > >>> > > > > > > > >> > > > said, I'm not sure what we can really do >> about >> > > >>> most of >> > > >>> > > > > these, >> > > >>> > > > > > > and >> > > >>> > > > > > > > >> not >> > > >>> > > > > > > > >> > > sure >> > > >>> > > > > > > > >> > > > how to narrow down the root cause in the >> more >> > > >>> mysterious >> > > >>> > > > > cases >> > > >>> > > > > > > of >> > > >>> > > > > > > > >> > aborted >> > > >>> > > > > > > > >> > > > builds and the builds that end with >> "finished >> > > with >> > > >>> > > > non-zero >> > > >>> > > > > > exit >> > > >>> > > > > > > > >> value >> > > >>> > > > > > > > >> > 1 >> > > >>> > > > > > > > >> > > " >> > > >>> > > > > > > > >> > > > with no additional context (that I could >> find) >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > If nothing else, there seems to be >> something >> > > >>> happening >> > > >>> > > in >> > > >>> > > > > one >> > > >>> > > > > > > (or >> > > >>> > > > > > > > >> more) >> > > >>> > > > > > > > >> > > of >> > > >>> > > > > > > > >> > > > the storage tests, because by far the most >> > > common >> > > >>> > > failure >> > > >>> > > > > I've >> > > >>> > > > > > > > seen >> > > >>> > > > > > > > >> is >> > > >>> > > > > > > > >> > > that >> > > >>> > > > > > > > >> > > > in 3 & 5. Unfortunately it's not really >> clear >> > to >> > > >>> me how >> > > >>> > > to >> > > >>> > > > > > tell >> > > >>> > > > > > > > >> which >> > > >>> > > > > > > > >> > is >> > > >>> > > > > > > > >> > > > the offending test, so I'm not even sure >> what >> > to >> > > >>> file a >> > > >>> > > > > ticket >> > > >>> > > > > > > for >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > On Tue, Dec 19, 2023 at 11:55 PM David >> Jacot >> > > >>> > > > > > > > >> > <dja...@confluent.io.invalid >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > wrote: >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > > > The slowness of the CI is definitely >> causing >> > > us >> > > >>> a lot >> > > >>> > > of >> > > >>> > > > > > > pain. I >> > > >>> > > > > > > > >> > wonder >> > > >>> > > > > > > > >> > > > if >> > > >>> > > > > > > > >> > > > > we should move to a dedicated CI >> > > infrastructure >> > > >>> for >> > > >>> > > > Kafka. >> > > >>> > > > > > Our >> > > >>> > > > > > > > >> > > > integration >> > > >>> > > > > > > > >> > > > > tests are quite heavy and ASF's CI is not >> > > really >> > > >>> tuned >> > > >>> > > > for >> > > >>> > > > > > > them. >> > > >>> > > > > > > > >> We >> > > >>> > > > > > > > >> > > could >> > > >>> > > > > > > > >> > > > > tune it for our needs and this would also >> > > allow >> > > >>> > > external >> > > >>> > > > > > > > >> companies to >> > > >>> > > > > > > > >> > > > > sponsor more workers. I heard that we >> have a >> > > few >> > > >>> cloud >> > > >>> > > > > > > providers >> > > >>> > > > > > > > >> in >> > > >>> > > > > > > > >> > > > > the community ;). I think that we should >> > > consider >> > > >>> > > this. >> > > >>> > > > > What >> > > >>> > > > > > > do >> > > >>> > > > > > > > >> you >> > > >>> > > > > > > > >> > > > think? >> > > >>> > > > > > > > >> > > > > I already discussed this with the INFRA >> > team. >> > > I >> > > >>> could >> > > >>> > > > > > continue >> > > >>> > > > > > > > if >> > > >>> > > > > > > > >> we >> > > >>> > > > > > > > >> > > > > believe that it is a way forward. >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > > Best, >> > > >>> > > > > > > > >> > > > > David >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > > On Wed, Dec 20, 2023 at 12:17 AM >> Stanislav >> > > >>> Kozlovski >> > > >>> > > > > > > > >> > > > > <stanis...@confluent.io.invalid> wrote: >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > > > Hey Николай, >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > Apologies about this - I wasn't aware >> of >> > > this >> > > >>> > > > behavior. >> > > >>> > > > > I >> > > >>> > > > > > > have >> > > >>> > > > > > > > >> made >> > > >>> > > > > > > > >> > > all >> > > >>> > > > > > > > >> > > > > the >> > > >>> > > > > > > > >> > > > > > gists public. >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg >> > Harris >> > > >>> > > > > > > > >> > > > > <greg.har...@aiven.io.invalid >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > wrote: >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > > Hey Stan, >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > Thanks for opening the discussion. I >> > > haven't >> > > >>> been >> > > >>> > > > > > looking >> > > >>> > > > > > > at >> > > >>> > > > > > > > >> > > overall >> > > >>> > > > > > > > >> > > > > > > build duration recently, so it's good >> > that >> > > >>> you are >> > > >>> > > > > > calling >> > > >>> > > > > > > > it >> > > >>> > > > > > > > >> > out. >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > I worry about us over-indexing on >> this >> > one >> > > >>> build, >> > > >>> > > > > which >> > > >>> > > > > > > > itself >> > > >>> > > > > > > > >> > > > appears >> > > >>> > > > > > > > >> > > > > > > to be an outlier. I only see one >> other >> > > build >> > > >>> [1] >> > > >>> > > > above >> > > >>> > > > > > 6h >> > > >>> > > > > > > > >> overall >> > > >>> > > > > > > > >> > > in >> > > >>> > > > > > > > >> > > > > > > the last 90 days in this view: [2] >> > > >>> > > > > > > > >> > > > > > > And I don't see any overlap of failed >> > > tests >> > > >>> in >> > > >>> > > these >> > > >>> > > > > two >> > > >>> > > > > > > > >> builds, >> > > >>> > > > > > > > >> > > > which >> > > >>> > > > > > > > >> > > > > > > makes it less likely that these >> > particular >> > > >>> failed >> > > >>> > > > > tests >> > > >>> > > > > > > are >> > > >>> > > > > > > > >> the >> > > >>> > > > > > > > >> > > > causes >> > > >>> > > > > > > > >> > > > > > > of long build times. >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > Separately, I've been investigating >> > build >> > > >>> > > > environment >> > > >>> > > > > > > > >> slowness, >> > > >>> > > > > > > > >> > and >> > > >>> > > > > > > > >> > > > > > > trying to connect it with test >> failures >> > > [3]. >> > > >>> I >> > > >>> > > > > observed >> > > >>> > > > > > > that >> > > >>> > > > > > > > >> the >> > > >>> > > > > > > > >> > CI >> > > >>> > > > > > > > >> > > > > > > build environment is 2-20 times >> slower >> > > than >> > > >>> my >> > > >>> > > > > developer >> > > >>> > > > > > > > >> machine >> > > >>> > > > > > > > >> > > (M1 >> > > >>> > > > > > > > >> > > > > > > mac). >> > > >>> > > > > > > > >> > > > > > > When I simulate a similar slowdown >> > > locally, >> > > >>> there >> > > >>> > > > are >> > > >>> > > > > > > tests >> > > >>> > > > > > > > >> which >> > > >>> > > > > > > > >> > > > > > > become significantly more flakey, >> often >> > > due >> > > >>> to >> > > >>> > > > > > hard-coded >> > > >>> > > > > > > > >> > timeouts. >> > > >>> > > > > > > > >> > > > > > > I think that these particularly nasty >> > > builds >> > > >>> could >> > > >>> > > > be >> > > >>> > > > > > > > >> explained >> > > >>> > > > > > > > >> > by >> > > >>> > > > > > > > >> > > > > > > long-tail slowdowns causing arbitrary >> > > tests >> > > >>> to >> > > >>> > > take >> > > >>> > > > an >> > > >>> > > > > > > > >> excessive >> > > >>> > > > > > > > >> > > time >> > > >>> > > > > > > > >> > > > > > > to execute. >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > Rather than trying to find signals in >> > > these >> > > >>> rare >> > > >>> > > > test >> > > >>> > > > > > > > >> failures, I >> > > >>> > > > > > > > >> > > > > > > think we should find tests that have >> > these >> > > >>> sorts >> > > >>> > > of >> > > >>> > > > > > > failures >> > > >>> > > > > > > > >> more >> > > >>> > > > > > > > >> > > > > > > regularly. >> > > >>> > > > > > > > >> > > > > > > There are lots of builds in the 5-6h >> > > duration >> > > >>> > > > bracket, >> > > >>> > > > > > > which >> > > >>> > > > > > > > >> is >> > > >>> > > > > > > > >> > > > > > > certainly unacceptably long. We >> should >> > > look >> > > >>> into >> > > >>> > > > these >> > > >>> > > > > > > > builds >> > > >>> > > > > > > > >> to >> > > >>> > > > > > > > >> > > find >> > > >>> > > > > > > > >> > > > > > > improvements and optimizations. >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > [1] >> > > https://ge.apache.org/s/ygh4gbz4uma6i/ >> > > >>> > > > > > > > >> > > > > > > [2] >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York >> > > >>> > > > > > > > >> > > > > > > [3] >> > > >>> https://github.com/apache/kafka/pull/15008 >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > Thanks for looking into this! >> > > >>> > > > > > > > >> > > > > > > Greg >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM >> Николай >> > > >>> Ижиков < >> > > >>> > > > > > > > >> > > nizhi...@apache.org> >> > > >>> > > > > > > > >> > > > > > > wrote: >> > > >>> > > > > > > > >> > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > Hello, Stanislav. >> > > >>> > > > > > > > >> > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > Can you, please, make the gist >> public. >> > > >>> > > > > > > > >> > > > > > > > Private gists not available for >> some >> > > GitHub >> > > >>> > > users >> > > >>> > > > > even >> > > >>> > > > > > > if >> > > >>> > > > > > > > >> link >> > > >>> > > > > > > > >> > > are >> > > >>> > > > > > > > >> > > > > > known. >> > > >>> > > > > > > > >> > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > 19 дек. 2023 г., в 17:33, >> Stanislav >> > > >>> Kozlovski >> > > >>> > > < >> > > >>> > > > > > > > >> > > > > > stanis...@confluent.io.INVALID> >> > > >>> > > > > > > > >> > > > > > > написал(а): >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > Hey everybody, >> > > >>> > > > > > > > >> > > > > > > > > I've heard various complaints >> that >> > > build >> > > >>> times >> > > >>> > > > in >> > > >>> > > > > > > trunk >> > > >>> > > > > > > > >> are >> > > >>> > > > > > > > >> > > > taking >> > > >>> > > > > > > > >> > > > > > too >> > > >>> > > > > > > > >> > > > > > > > > long, some taking as much as 8 >> hours >> > > (the >> > > >>> > > > > timeout) - >> > > >>> > > > > > > and >> > > >>> > > > > > > > >> this >> > > >>> > > > > > > > >> > > is >> > > >>> > > > > > > > >> > > > > > > slowing us >> > > >>> > > > > > > > >> > > > > > > > > down from being able to meet the >> > code >> > > >>> freeze >> > > >>> > > > > > deadline >> > > >>> > > > > > > > for >> > > >>> > > > > > > > >> > 3.7. >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > I took it upon myself to gather >> up >> > > some >> > > >>> data >> > > >>> > > in >> > > >>> > > > > > Gradle >> > > >>> > > > > > > > >> > > Enterprise >> > > >>> > > > > > > > >> > > > > to >> > > >>> > > > > > > > >> > > > > > > see if >> > > >>> > > > > > > > >> > > > > > > > > there are any outlier tests that >> are >> > > >>> causing >> > > >>> > > > this >> > > >>> > > > > > > > >> slowness. >> > > >>> > > > > > > > >> > > Turns >> > > >>> > > > > > > > >> > > > > out >> > > >>> > > > > > > > >> > > > > > > there >> > > >>> > > > > > > > >> > > > > > > > > are a few, in this particular >> build >> > - >> > > >>> > > > > > > > >> > > > > > > >> https://ge.apache.org/s/un2hv7n6j374k/ >> > > >>> > > > > > > > >> > > > > > > > > - which took 10 hours and 29 >> minutes >> > > in >> > > >>> total. >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > I have compiled the tests that >> took >> > a >> > > >>> > > > > > > disproportionately >> > > >>> > > > > > > > >> > large >> > > >>> > > > > > > > >> > > > > amount >> > > >>> > > > > > > > >> > > > > > > of >> > > >>> > > > > > > > >> > > > > > > > > time (20m+), alongside their >> time, >> > > error >> > > >>> > > message >> > > >>> > > > > > and a >> > > >>> > > > > > > > >> link >> > > >>> > > > > > > > >> > to >> > > >>> > > > > > > > >> > > > > their >> > > >>> > > > > > > > >> > > > > > > full >> > > >>> > > > > > > > >> > > > > > > > > log output here - >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >> > >> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2 >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > It includes failures from core, >> > > streams, >> > > >>> > > storage >> > > >>> > > > > and >> > > >>> > > > > > > > >> clients. >> > > >>> > > > > > > > >> > > > > > > > > Interestingly, some other tests >> that >> > > >>> don't >> > > >>> > > fail >> > > >>> > > > > also >> > > >>> > > > > > > > take >> > > >>> > > > > > > > >> a >> > > >>> > > > > > > > >> > > long >> > > >>> > > > > > > > >> > > > > time >> > > >>> > > > > > > > >> > > > > > > in >> > > >>> > > > > > > > >> > > > > > > > > what is apparently the test >> harness >> > > >>> framework. >> > > >>> > > > See >> > > >>> > > > > > the >> > > >>> > > > > > > > >> gist >> > > >>> > > > > > > > >> > for >> > > >>> > > > > > > > >> > > > > more >> > > >>> > > > > > > > >> > > > > > > > > information. >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > I am starting this thread with >> the >> > > >>> intention >> > > >>> > > of >> > > >>> > > > > > > getting >> > > >>> > > > > > > > >> the >> > > >>> > > > > > > > >> > > > > > discussion >> > > >>> > > > > > > > >> > > > > > > > > started and brainstorming what we >> > can >> > > do >> > > >>> to >> > > >>> > > get >> > > >>> > > > > the >> > > >>> > > > > > > > build >> > > >>> > > > > > > > >> > times >> > > >>> > > > > > > > >> > > > > back >> > > >>> > > > > > > > >> > > > > > > under >> > > >>> > > > > > > > >> > > > > > > > > control. >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > >> > > >>> > > > > > > > >> > > > > > > > > -- >> > > >>> > > > > > > > >> > > > > > > > > Best, >> > > >>> > > > > > > > >> > > > > > > > > Stanislav >> > > >>> > > > > > > > >> > > > > > > > >> > > >>> > > > > > > > >> > > > > > > >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > > -- >> > > >>> > > > > > > > >> > > > > > Best, >> > > >>> > > > > > > > >> > > > > > Stanislav >> > > >>> > > > > > > > >> > > > > > >> > > >>> > > > > > > > >> > > > > >> > > >>> > > > > > > > >> > > > >> > > >>> > > > > > > > >> > > >> > > >>> > > > > > > > >> > >> > > >>> > > > > > > > >> >> > > >>> > > > > > > > > >> > > >>> > > > > > > > >> > > >>> > > > > > > >> > > >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > -- >> > > >>> > > > > > Best, >> > > >>> > > > > > Stanislav >> > > >>> > > > > > >> > > >>> > > > > >> > > >>> > > > >> > > >>> > > > >> > > >>> > > > -- >> > > >>> > > > -David >> > > >>> > > > >> > > >>> > > >> > > >>> >> > > >>> >> > > >> > >> >