Hey folks -- I noticed some builds have been running for a day or more. I thought we limited builds to 8 hours. Any ideas why this is happening?
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/activity/ I tried to abort the build for PR-15257, and it also still seems to be running. Justine On Sun, Jan 14, 2024 at 6:25 AM Qichao Chu <qic...@uber.com.invalid> wrote: > Hi Divij and all, > > Regarding the speeding up of the build & de-flaking tests, LinkedIn has > done some great work which we probably can borrow ideas from. > In the LinkedIn/Kafka repo, we can see one of their most recent PRs > <https://github.com/linkedin/kafka/pull/500/checks> only took < 9 min(unit > test) + < 12 min (integration-test) + < 9 (code check) = < 30 min to finish > all the checks: > > 1. Similar to what David(mumrah) has mentioned/experimented with, the > LinkedIn team used GitHub Actions, which displayed the results in a > cleaner > way directly from GitHub. > 2. Each top-level package is checked separately to increase the > concurrency. To further boost the speed for integration tests, the tests > inside one package are divided into sub-groups(A-Z) based on their > names(see this job > <https://github.com/linkedin/kafka/actions/runs/7303478151/> for > details). > 3. Once the tests are running at a smaller granularity with a decent > runner, heavy integration tests are less likely to be flaky, and flaky > tests are easier to catch. > > > -- > Qichao > > > On Wed, Jan 10, 2024 at 2:57 PM Divij Vaidya <divijvaidy...@gmail.com> > wrote: > > > Hey folks > > > > We seem to have a handle on the OOM issues with the multiple fixes > > community members made. In > > https://issues.apache.org/jira/browse/KAFKA-16052, > > you can see the "before" profile in the description and the "after" > profile > > in the latest comment to see the difference. To prevent future > recurrence, > > we have an ongoing solution at > https://github.com/apache/kafka/pull/15101 > > and after that we will start another once to get rid of mockito mocks at > > the end of every test suite using a similar extension. Note that this > > doesn't solve the flaky test problems in the trunk but it removes the > > aspect of build failures due to OOM (one of the many problems). > > > > To fix the flaky test problem, we probably need to run our tests in a > > separate CI environment (like Apache Beam does) instead of sharing the 3 > > hosts that run our CI with many many other Apache projects. This > assumption > > is based on the fact that the tests are less flaky when running on > laptops > > / powerful EC2 machines. One of the avenues to get funding for these > > Kafka-only hosts is > > > > > https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ > > . I will start the conversation on this one with AWS & Apache Infra in > the > > next 1-2 months. > > > > -- > > Divij Vaidya > > > > > > > > On Tue, Jan 9, 2024 at 9:21 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > Sorry, but to put it bluntly, the current build setup isn't good enough > > at > > > partial rebuilds that build caching would make sense. All Kafka devs > have > > > had the experience of needing to clean the build directory in order to > > get > > > a valid build. The scala code esspecially seems to have this issue. > > > > > > regards, > > > Colin > > > > > > > > > On Tue, Jan 2, 2024, at 07:00, Nick Telford wrote: > > > > Addendum: I've opened a PR with what I believe are the changes > > necessary > > > to > > > > enable Remote Build Caching, if you choose to go that route: > > > > https://github.com/apache/kafka/pull/15109 > > > > > > > > On Tue, 2 Jan 2024 at 14:31, Nick Telford <nick.telf...@gmail.com> > > > wrote: > > > > > > > >> Hi everyone, > > > >> > > > >> Regarding building a "dependency graph"... Gradle already has this > > > >> information, albeit fairly coarse-grained. You might be able to get > > some > > > >> considerable improvement by configuring the Gradle Remote Build > Cache. > > > It > > > >> looks like it's currently disabled explicitly: > > > >> https://github.com/apache/kafka/blob/trunk/settings.gradle#L46 > > > >> > > > >> The trick is to have trunk builds write to the cache, and PR builds > > only > > > >> read from it. This way, any PR based on trunk should be able to > cache > > > not > > > >> only the compilation, but also the tests from dependent modules that > > > >> haven't changed (e.g. for a PR that only touches the connect/streams > > > >> modules). > > > >> > > > >> This would probably be preferable to having to hand-maintain some > > > >> rules/dependency graph in the CI configuration, and it's quite > > > >> straight-forward to configure. > > > >> > > > >> Bonus points if the Remote Build Cache is readable publicly, > enabling > > > >> contributors to benefit from it locally. > > > >> > > > >> Regards, > > > >> Nick > > > >> > > > >> On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy <lbruts...@confluent.io > > > .invalid> > > > >> wrote: > > > >> > > > >>> Thanks for all the work that has already been done on this in the > > past > > > >>> days! > > > >>> > > > >>> Have we considered running our test suite with > > > >>> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as > > > >>> Jenkins build artifacts? This could speed up debugging. Even if we > > > >>> store them only for a day and do it only for trunk, I think it > could > > > >>> be worth it. The heap dumps shouldn't contain any secrets, and I > > > >>> checked with the ASF infra team, and they are not concerned about > the > > > >>> additional disk usage. > > > >>> > > > >>> Cheers, > > > >>> Lucas > > > >>> > > > >>> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya < > > divijvaidy...@gmail.com> > > > >>> wrote: > > > >>> > > > > >>> > I have started to perform an analysis of the OOM at > > > >>> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel > > free > > > to > > > >>> > contribute to the investigation. > > > >>> > > > > >>> > -- > > > >>> > Divij Vaidya > > > >>> > > > > >>> > > > > >>> > > > > >>> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan > > > >>> <jols...@confluent.io.invalid> > > > >>> > wrote: > > > >>> > > > > >>> > > I am still seeing quite a few OOM errors in the builds and I > was > > > >>> curious if > > > >>> > > folks had any ideas on how to identify the cause and fix the > > > issue. I > > > >>> was > > > >>> > > looking in gradle enterprise and found some info about memory > > > usage, > > > >>> but > > > >>> > > nothing detailed enough to help figure the issue out. > > > >>> > > > > > >>> > > OOMs sometimes fail the build immediately and in other cases I > > see > > > it > > > >>> get > > > >>> > > stuck for 8 hours. (See > > > >>> > > > > > >>> > > > > > >>> > > > > > > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12 > > > >>> > > ) > > > >>> > > > > > >>> > > I appreciate all the work folks are doing here and I will > > continue > > > to > > > >>> try > > > >>> > > to help as best as I can. > > > >>> > > > > > >>> > > Justine > > > >>> > > > > > >>> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur > > > >>> > > <david.art...@confluent.io.invalid> wrote: > > > >>> > > > > > >>> > > > S2. We’ve looked into this before, and it wasn’t possible at > > the > > > >>> time > > > >>> > > with > > > >>> > > > JUnit. We commonly set a timeout on each test class > (especially > > > >>> > > integration > > > >>> > > > tests). It is probably worth looking at this again and seeing > > if > > > >>> > > something > > > >>> > > > has changed with JUnit (or our usage of it) that would allow > a > > > >>> global > > > >>> > > > timeout. > > > >>> > > > > > > >>> > > > > > > >>> > > > S3. Dedicated infra sounds nice, if we can get it. It would > at > > > least > > > >>> > > remove > > > >>> > > > some variability between the builds, and hopefully eliminate > > the > > > >>> > > > infra/setup class of failures. > > > >>> > > > > > > >>> > > > > > > >>> > > > S4. Running tests for what has changed sounds nice, but I > think > > > it > > > >>> is > > > >>> > > risky > > > >>> > > > to implement broadly. As Sophie mentioned, there are probably > > > some > > > >>> lines > > > >>> > > we > > > >>> > > > could draw where we feel confident that only running a subset > > of > > > >>> tests is > > > >>> > > > safe. As a start, we could probably work towards skipping CI > > for > > > >>> non-code > > > >>> > > > PRs. > > > >>> > > > > > > >>> > > > > > > >>> > > > --- > > > >>> > > > > > > >>> > > > > > > >>> > > > As an aside, I experimented with build caching and running > > > affected > > > >>> > > tests a > > > >>> > > > few months ago. I used the opportunity to play with Github > > > Actions, > > > >>> and I > > > >>> > > > quite liked it. Here’s the workflow I used: > > > >>> > > > > > > >>> > > https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml. > > > I > > > >>> > > > was trying to see if we could use a build cache to reduce the > > > >>> compilation > > > >>> > > > time on PRs. A nightly/periodic job would build trunk and > > > populate a > > > >>> > > Gradle > > > >>> > > > build cache. PR builds would read from that cache which would > > > >>> enable them > > > >>> > > > to only compile changed code. The same idea could be extended > > to > > > >>> tests, > > > >>> > > but > > > >>> > > > I didn’t get that far. > > > >>> > > > > > > >>> > > > > > > >>> > > > As for Github Actions, the idea there is that ASF would > provide > > > >>> generic > > > >>> > > > Action “runners” that would pick up jobs from the Github > Action > > > >>> build > > > >>> > > queue > > > >>> > > > and run them. It is also possible to self-host runners to > > expand > > > the > > > >>> > > build > > > >>> > > > capacity of the project (i.e., other organizations could > donate > > > >>> > > > build capacity). The advantage of this is that we would have > > more > > > >>> control > > > >>> > > > over our build/reports and not be “stuck” with whatever ASF > > > Jenkins > > > >>> > > offers. > > > >>> > > > The Actions workflows are very customizable and it would let > us > > > >>> create > > > >>> > > our > > > >>> > > > own custom plugins. There is also a substantial marketplace > of > > > >>> plugins. I > > > >>> > > > think it’s worth exploring this more, I just haven’t had time > > > >>> lately. > > > >>> > > > > > > >>> > > > On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman < > > > >>> > > sop...@responsive.dev > > > >>> > > > > > > > >>> > > > wrote: > > > >>> > > > > > > >>> > > > > Regarding: > > > >>> > > > > > > > >>> > > > > S-4. Separate tests ran depending on what module is > changed. > > > >>> > > > > > > > > >>> > > > > - This makes sense although is tricky to implement > > > successfully, > > > >>> as > > > >>> > > > > > unrelated tests may expose problems in an unrelated > change > > > (e.g > > > >>> > > > changing > > > >>> > > > > > core stuff like clients, the server, etc) > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > Imo this avenue could provide a massive improvement to dev > > > >>> productivity > > > >>> > > > > with very little effort or investment, and if we do it > right, > > > >>> without > > > >>> > > > even > > > >>> > > > > any risk. We should be able to draft a simple dependency > > graph > > > >>> between > > > >>> > > > > modules and then skip the tests for anything that is > clearly, > > > >>> provably > > > >>> > > > > unrelated and/or upstream of the target changes. This has > the > > > >>> potential > > > >>> > > > to > > > >>> > > > > substantially speed up and improve the developer experience > > in > > > >>> modules > > > >>> > > at > > > >>> > > > > the end of the dependency graph, which I believe is worth > > doing > > > >>> even if > > > >>> > > > it > > > >>> > > > > unfortunately would not benefit everyone equally. > > > >>> > > > > > > > >>> > > > > For example, we can save a lot of grief with just a simple > > set > > > of > > > >>> rules > > > >>> > > > > that are easy to check. I'll throw out a few to start with: > > > >>> > > > > > > > >>> > > > > 1. A pure docs PR (ie that only touches files under the > > > docs/ > > > >>> > > > directory) > > > >>> > > > > should be allowed to skip the tests of all modules > > > >>> > > > > 2. Connect PRs (that only touch connect/) only need to > run > > > the > > > >>> > > Connect > > > >>> > > > > tests -- ie they can skip the tests for core, clients, > > > >>> streams, etc > > > >>> > > > > 3. Similarly, Streams PRs should only need to run the > > > Streams > > > >>> tests > > > >>> > > -- > > > >>> > > > > but again, only if all the changes are contained within > > > >>> streams/ > > > >>> > > > > > > > >>> > > > > I'll let others chime in on how or if we can construct some > > > safe > > > >>> rules > > > >>> > > as > > > >>> > > > > to which modules can or can't be skipped between the core, > > > >>> clients, > > > >>> > > raft, > > > >>> > > > > storage, etc > > > >>> > > > > > > > >>> > > > > And over time we could in theory build up a literal > > dependency > > > >>> graph > > > >>> > > on a > > > >>> > > > > more granular level so that, for example, changes to the > > > >>> core/storage > > > >>> > > > > module are allowed to skip any Streams tests that don't use > > an > > > >>> embedded > > > >>> > > > > broker, ie all unit tests and TopologyTestDriver-based > > > integration > > > >>> > > tests. > > > >>> > > > > The danger here would be in making sure this graph is kept > up > > > to > > > >>> date > > > >>> > > as > > > >>> > > > > tests are added and changed, but my point is just that > > there's > > > a > > > >>> way to > > > >>> > > > > extend the benefit of this tactic to those who work > primarily > > > on > > > >>> the > > > >>> > > core > > > >>> > > > > module as well. Personally, I think we should just start > out > > > with > > > >>> the > > > >>> > > > > example ruleset listed above, workshop it a bit since there > > > might > > > >>> be > > > >>> > > > other > > > >>> > > > > obvious rules I left out, and try to implement it. > > > >>> > > > > > > > >>> > > > > Thoughts? > > > >>> > > > > > > > >>> > > > > On Tue, Dec 26, 2023 at 2:25 AM Stanislav Kozlovski > > > >>> > > > > <stanis...@confluent.io.invalid> wrote: > > > >>> > > > > > > > >>> > > > > > Great discussion! > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > Greg, that was a good call out regarding the two > > long-running > > > >>> > > builds. I > > > >>> > > > > > missed that 90d view. > > > >>> > > > > > > > > >>> > > > > > My takeaway from that is that our average build time for > > > tests > > > >>> is > > > >>> > > > between > > > >>> > > > > > 3-4 hours. Which in of itself seems large. > > > >>> > > > > > > > > >>> > > > > > But then reconciling this with Sophie's statement - is it > > > >>> possible > > > >>> > > that > > > >>> > > > > > these timed-out 8-hour builds don't get captured in that > > > view? > > > >>> > > > > > > > > >>> > > > > > It is weird that people are reporting these things and > > Gradle > > > >>> > > > Enterprise > > > >>> > > > > > isn't showing them. > > > >>> > > > > > > > > >>> > > > > > --- > > > >>> > > > > > > > > >>> > > > > > > I think that these particularly nasty builds could be > > > >>> explained by > > > >>> > > > > > long-tail slowdowns causing arbitrary tests to take an > > > >>> excessive time > > > >>> > > > to > > > >>> > > > > > execute. > > > >>> > > > > > > > > >>> > > > > > I'm not sure I understood that. If the tests have > timeouts, > > > >>> where > > > >>> > > would > > > >>> > > > > the > > > >>> > > > > > slowdown come from? Problems in tearing down the test? > > > >>> > > > > > > > > >>> > > > > > --- > > > >>> > > > > > > > > >>> > > > > > David, thanks for the great work in identifying and even > > > fixing > > > >>> those > > > >>> > > > two > > > >>> > > > > > top offenders! And thank you for cherry-picking to 3.7 > > > >>> > > > > > > > > >>> > > > > > -- > > > >>> > > > > > > > > >>> > > > > > All in all, from this thread I can summarize a few > > potential > > > >>> > > solutions: > > > >>> > > > > > > > > >>> > > > > > S-1. Dedicated work identifying and fixing some of the > > issues > > > >>> (e.g. > > > >>> > > > what > > > >>> > > > > > David did). > > > >>> > > > > > - Should help alleviate the issues as it can be > speculated > > > that > > > >>> it's > > > >>> > > > > > frequently 1 or 2 tests causing the majority of issues. > > > >>> > > > > > - With regards to that, KAFKA-16045 seems open for taking > > if > > > >>> there > > > >>> > > are > > > >>> > > > > any > > > >>> > > > > > volunteers > > > >>> > > > > > - Sophie's list also contains good candidates > > > >>> > > > > > > > > >>> > > > > > S-2. Global 10-minute timeout for tests. > > > >>> > > > > > - Should lay the foundation for a strong catch-all for > any > > > >>> > > misbehaving > > > >>> > > > > > tests. I like this idea since it's guaranteed to save > each > > > >>> > > contributor > > > >>> > > > > many > > > >>> > > > > > hours of waiting for an 8hr+ time out build. > > > >>> > > > > > - Luke already has a PR out for this: > > > >>> > > > > > https://github.com/apache/kafka/pull/15065 > > > >>> > > > > > > > > >>> > > > > > S-3. Separate infrastructure for our CI > > > >>> > > > > > - This would help with Greg's comment about the developer > > > >>> machine > > > >>> > > being > > > >>> > > > > > 2-20 times faster than the CI. > > > >>> > > > > > - Requires volunteer funding from external companies. If > > > every > > > >>> > > > > contributor > > > >>> > > > > > would bring up the idea with their employer, we may be > able > > > to > > > >>> stitch > > > >>> > > > > > something together. > > > >>> > > > > > > > > >>> > > > > > S-4. Separate tests ran depending on what module is > > changed. > > > >>> > > > > > - This makes sense although is tricky to implement > > > >>> successfully, as > > > >>> > > > > > unrelated tests may expose problems in an unrelated > change > > > (e.g > > > >>> > > > changing > > > >>> > > > > > core stuff like clients, the server, etc) > > > >>> > > > > > > > > >>> > > > > > S-5. Greater committer diligence when merging PRs > > > >>> > > > > > - This should always be there. Unfortunately it is a bit > > of a > > > >>> > > > > > self-perpetuating effect in that when the builds get > worse, > > > >>> people > > > >>> > > are > > > >>> > > > > > incentivized to be less diligent (slowed down while in a > > > rush to > > > >>> > > merge, > > > >>> > > > > > recency bias of failed builds, etc.) > > > >>> > > > > > > > > >>> > > > > > On Fri, Dec 22, 2023 at 4:16 PM Justine Olshan > > > >>> > > > > > <jols...@confluent.io.invalid> > > > >>> > > > > > wrote: > > > >>> > > > > > > > > >>> > > > > > > Thanks David! I think this should help a lot! > > > >>> > > > > > > > > > >>> > > > > > > While we should include these improvements, I think it > is > > > >>> also good > > > >>> > > > to > > > >>> > > > > > > remind folks that a lot of these issues come from > merging > > > on > > > >>> builds > > > >>> > > > > that > > > >>> > > > > > > regress the CI. > > > >>> > > > > > > I know I'm not perfect at this (and have merged on > flaky > > > and > > > >>> > > failing > > > >>> > > > > > > tests), but let's all be super careful going forward. > > There > > > >>> were a > > > >>> > > > few > > > >>> > > > > > > times I retried the build 10+ times and thought it was > > > other > > > >>> issues > > > >>> > > > > with > > > >>> > > > > > > the CI but the failed builds were actually due to the > > > changes > > > >>> I > > > >>> > > > > wrote/was > > > >>> > > > > > > reviewing. > > > >>> > > > > > > > > > >>> > > > > > > We all need to work together on this to ensure the > builds > > > stay > > > >>> > > > healthy! > > > >>> > > > > > > Thanks all for being concerned about our builds! > > > >>> > > > > > > > > > >>> > > > > > > Justine > > > >>> > > > > > > > > > >>> > > > > > > On Fri, Dec 22, 2023 at 6:02 AM David Jacot < > > > >>> david.ja...@gmail.com > > > >>> > > > > > > >>> > > > > > wrote: > > > >>> > > > > > > > > > >>> > > > > > > > I just merged both PRs. > > > >>> > > > > > > > > > > >>> > > > > > > > Cheers, > > > >>> > > > > > > > David > > > >>> > > > > > > > > > > >>> > > > > > > > Le ven. 22 déc. 2023 à 14:38, David Jacot < > > > >>> david.ja...@gmail.com > > > >>> > > > > > > >>> > > > a > > > >>> > > > > > > écrit > > > >>> > > > > > > > : > > > >>> > > > > > > > > > > >>> > > > > > > > > Hey folks, > > > >>> > > > > > > > > > > > >>> > > > > > > > > I believe that my two PRs will fix most of the > > issues. > > > I > > > >>> have > > > >>> > > > also > > > >>> > > > > > > > tweaked > > > >>> > > > > > > > > the configuration of Jenkins to fix the issues > > > relating to > > > >>> > > > cloning > > > >>> > > > > > the > > > >>> > > > > > > > > repo. There may be other issues but the overall > > > situation > > > >>> > > should > > > >>> > > > be > > > >>> > > > > > > much > > > >>> > > > > > > > > better when I merge those two. > > > >>> > > > > > > > > > > > >>> > > > > > > > > I will update this thread when I merge them. > > > >>> > > > > > > > > > > > >>> > > > > > > > > Cheers, > > > >>> > > > > > > > > David > > > >>> > > > > > > > > > > > >>> > > > > > > > > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya < > > > >>> > > > > divijvaidy...@gmail.com> > > > >>> > > > > > a > > > >>> > > > > > > > > écrit : > > > >>> > > > > > > > > > > > >>> > > > > > > > >> Hey folks > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> I think David (dajac) has some fixes lined-up to > > > improve > > > >>> CI > > > >>> > > such > > > >>> > > > > as > > > >>> > > > > > > > >> https://github.com/apache/kafka/pull/15063 and > > > >>> > > > > > > > >> https://github.com/apache/kafka/pull/15062. > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> I have some bandwidth for the next two days to > work > > on > > > >>> fixing > > > >>> > > > the > > > >>> > > > > > CI. > > > >>> > > > > > > > Let > > > >>> > > > > > > > >> me start by taking a look at the list that Sophie > > > shared > > > >>> here. > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> -- > > > >>> > > > > > > > >> Divij Vaidya > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen < > > > >>> show...@gmail.com> > > > >>> > > > > > wrote: > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > Hi Sophie and Philip and all, > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > I share the same pain as you. > > > >>> > > > > > > > >> > I've been waiting for a CI build result in a PR > > for > > > >>> days. > > > >>> > > > > > > > >> Unfortunately, I > > > >>> > > > > > > > >> > can only get 1 result each day because it takes > 8 > > > >>> hours for > > > >>> > > > each > > > >>> > > > > > > run, > > > >>> > > > > > > > >> and > > > >>> > > > > > > > >> > with failed results. :( > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > I've looked into the 8 hour timeout build issue > > and > > > >>> would > > > >>> > > like > > > >>> > > > > to > > > >>> > > > > > > > >> propose > > > >>> > > > > > > > >> > to set a global test timeout as 10 mins using > the > > > >>> junit5 > > > >>> > > > feature > > > >>> > > > > > > > >> > < > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > . > > > >>> > > > > > > > >> > This way, we can fail those long running tests > > > quickly > > > >>> > > without > > > >>> > > > > > > > impacting > > > >>> > > > > > > > >> > other tests. > > > >>> > > > > > > > >> > PR: https://github.com/apache/kafka/pull/15065 > > > >>> > > > > > > > >> > I've tested in my local environment and it works > > as > > > >>> > > expected. > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > Any feedback is welcome. > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > Thanks. > > > >>> > > > > > > > >> > Luke > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee < > > > >>> > > > philip...@gmail.com > > > >>> > > > > > > > > >>> > > > > > > > wrote: > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > Hey Sophie - I've gotten 2 inflight PRs each > > with > > > >>> more > > > >>> > > than > > > >>> > > > 15 > > > >>> > > > > > > > >> retries... > > > >>> > > > > > > > >> > > Namely: > > > https://github.com/apache/kafka/pull/15023 > > > >>> and > > > >>> > > > > > > > >> > > https://github.com/apache/kafka/pull/15035 > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > justin filed a flaky test report here though: > > > >>> > > > > > > > >> > > > > https://issues.apache.org/jira/browse/KAFKA-16045 > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > P > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie > > > Blee-Goldman < > > > >>> > > > > > > > >> > sop...@responsive.dev > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > wrote: > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > On a related note, has anyone else had > trouble > > > >>> getting > > > >>> > > > even > > > >>> > > > > a > > > >>> > > > > > > > single > > > >>> > > > > > > > >> > run > > > >>> > > > > > > > >> > > > with no build failures lately? I've had > > multiple > > > >>> > > pure-docs > > > >>> > > > > PRs > > > >>> > > > > > > > >> blocked > > > >>> > > > > > > > >> > > for > > > >>> > > > > > > > >> > > > days or even weeks because of miscellaneous > > > infra, > > > >>> test, > > > >>> > > > and > > > >>> > > > > > > > timeout > > > >>> > > > > > > > >> > > > failures. I know we just had a discussion > > about > > > >>> whether > > > >>> > > > it's > > > >>> > > > > > > > >> acceptable > > > >>> > > > > > > > >> > > to > > > >>> > > > > > > > >> > > > ever merge with a failing build, and the > > > consensus > > > >>> > > (which > > > >>> > > > I > > > >>> > > > > > > agree > > > >>> > > > > > > > >> with) > > > >>> > > > > > > > >> > > was > > > >>> > > > > > > > >> > > > NO -- but seriously, this is getting > > ridiculous. > > > >>> The > > > >>> > > build > > > >>> > > > > > might > > > >>> > > > > > > > be > > > >>> > > > > > > > >> the > > > >>> > > > > > > > >> > > > worst I've ever seen it, and it just makes > it > > > >>> really > > > >>> > > > > difficult > > > >>> > > > > > > to > > > >>> > > > > > > > >> > > maintain > > > >>> > > > > > > > >> > > > good will with external contributors. > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > Take for example this small docs PR: > > > >>> > > > > > > > >> > > > https://github.com/apache/kafka/pull/14949 > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > It's on its 7th replay, with the first 6 > runs > > > all > > > >>> having > > > >>> > > > (at > > > >>> > > > > > > > least) > > > >>> > > > > > > > >> one > > > >>> > > > > > > > >> > > > build that failed completely. The issues I > saw > > > on > > > >>> this > > > >>> > > one > > > >>> > > > > PR > > > >>> > > > > > > are > > > >>> > > > > > > > a > > > >>> > > > > > > > >> > good > > > >>> > > > > > > > >> > > > summary of what I've been seeing elsewhere, > so > > > >>> here's > > > >>> > > the > > > >>> > > > > > > > briefing: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > 1. gradle issue: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > * What went wrong: > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > Gradle could not start your build. > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > Cannot create service of type > > > >>> > > > BuildSessionActionExecutor > > > >>> > > > > > > using > > > >>> > > > > > > > >> > method > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > >>> > > > > > > > >>> > > > > > >>> > > > > LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor() > > > >>> > > > > > > > >> > > > as > > > >>> > > > > > > > >> > > > > there is a problem with parameter #21 of > > type > > > >>> > > > > > > > >> > > > FileSystemWatchingInformation. > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > Cannot create service of type > > > >>> > > > > > > > >> > BuildLifecycleAwareVirtualFileSystem > > > >>> > > > > > > > >> > > > > using method > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem() > > > >>> > > > > > > > >> > > > > as there is a problem with parameter #7 of > > > type > > > >>> > > > > > > > >> GlobalCacheLocations. > > > >>> > > > > > > > >> > > > > > Cannot create service of type > > > >>> > > > GlobalCacheLocations > > > >>> > > > > > > using > > > >>> > > > > > > > >> > method > > > >>> > > > > > > > >> > > > > > > > >>> > > GradleUserHomeScopeServices.createGlobalCacheLocations() > > > >>> > > > > as > > > >>> > > > > > > > there > > > >>> > > > > > > > >> is > > > >>> > > > > > > > >> > a > > > >>> > > > > > > > >> > > > > problem with parameter #1 of type > > > >>> List<GlobalCache>. > > > >>> > > > > > > > >> > > > > > Could not create service of > type > > > >>> > > > > > > > FileAccessTimeJournal > > > >>> > > > > > > > >> > using > > > >>> > > > > > > > >> > > > > > > > >>> > > > GradleUserHomeScopeServices.createFileAccessTimeJournal(). > > > >>> > > > > > > > >> > > > > > Timeout waiting to lock > > journal > > > >>> cache > > > >>> > > > > > > > >> > > > > (/home/jenkins/.gradle/caches/journal-1). > It > > > is > > > >>> > > > currently > > > >>> > > > > in > > > >>> > > > > > > use > > > >>> > > > > > > > >> by > > > >>> > > > > > > > >> > > > another > > > >>> > > > > > > > >> > > > > Gradle instance. > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > 2. git issue: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > ERROR: Error cloning remote repo 'origin' > > > >>> > > > > > > > >> > > > > hudson.plugins.git.GitException: > > > >>> java.io.IOException: > > > >>> > > > > Remote > > > >>> > > > > > > > call > > > >>> > > > > > > > >> on > > > >>> > > > > > > > >> > > > > builds43 failed > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > 3. storage test calling System.exit (I > think) > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > * What went wrong: > > > >>> > > > > > > > >> > > > > Execution failed for task > ':storage:test'. > > > >>> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73' > > finished > > > >>> with > > > >>> > > > > non-zero > > > >>> > > > > > > exit > > > >>> > > > > > > > >> > value > > > >>> > > > > > > > >> > > 1 > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > This problem might be caused by > incorrect > > > test > > > >>> > > process > > > >>> > > > > > > > >> > configuration. > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > 4. 3/4 builds aborted suddenly for no clear > > > reason > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > 5. 1 build was aborted, 1 build failed due > to > > a > > > >>> > > gradle(?) > > > >>> > > > > > issue > > > >>> > > > > > > > >> with a > > > >>> > > > > > > > >> > > > storage test: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > Failed to map supported failure > > > >>> > > > > > > > >> 'org.opentest4j.AssertionFailedError: > > > >>> > > > > > > > >> > > > > Failed to observe commit callback before > > > >>> timeout' with > > > >>> > > > > > mapper > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea > > > >>> > > > > > > > >> > > > ': > > > >>> > > > > > > > >> > > > > null > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > * What went wrong: > > > >>> > > > > > > > >> > > > > Execution failed for task ':storage:test'. > > > >>> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73' > finished > > > with > > > >>> > > > non-zero > > > >>> > > > > > > exit > > > >>> > > > > > > > >> > value 1 > > > >>> > > > > > > > >> > > > > This problem might be caused by > incorrect > > > test > > > >>> > > process > > > >>> > > > > > > > >> > configuration. > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > 6. Unknown issue with a core test: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > Unexpected exception thrown. > > > >>> > > > > > > > >> > > > > > > > >>> > > org.gradle.internal.remote.internal.MessageIOException: > > > >>> > > > > > Could > > > >>> > > > > > > > not > > > >>> > > > > > > > >> > read > > > >>> > > > > > > > >> > > > > message from '/127.0.0.1:46952'. > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) > > > >>> > > > > > > > >> > > > > at > > > >>> java.base/java.lang.Thread.run(Thread.java:1583) > > > >>> > > > > > > > >> > > > > Caused by: > > java.lang.IllegalArgumentException > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81) > > > >>> > > > > > > > >> > > > > ... 6 more > > > >>> > > > > > > > >> > > > > > > > >>> org.gradle.internal.remote.internal.ConnectException: > > > >>> > > > > Could > > > >>> > > > > > > not > > > >>> > > > > > > > >> > connect > > > >>> > > > > > > > >> > > > to > > > >>> > > > > > > > >> > > > > server > [1d62bf97-6a3e-441d-93b6-093617cbbea9 > > > >>> > > port:41289, > > > >>> > > > > > > > >> addresses:[/ > > > >>> > > > > > > > >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1 > ]. > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) > > > >>> > > > > > > > >> > > > > Caused by: java.net.ConnectException: > > > Connection > > > >>> > > refused > > > >>> > > > > > > > >> > > > > at java.base/sun.nio.ch.Net > > > .pollConnect(Native > > > >>> > > > Method) > > > >>> > > > > > > > >> > > > > at java.base/sun.nio.ch.Net > > > >>> > > > > .pollConnectNow(Net.java:682) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > java.base/sun.nio.ch > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > > > >>> .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > java.base/sun.nio.ch > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233) > > > >>> > > > > > > > >> > > > > at java.base/sun.nio.ch > > > >>> > > > > > > > >> > > .SocketAdaptor.connect(SocketAdaptor.java:102) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) > > > >>> > > > > > > > >> > > > > at > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) > > > >>> > > > > > > > >> > > > > ... 5 more > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > * What went wrong: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > Execution failed for task ':core:test'. > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > Process 'Gradle Test Executor 104' > finished > > > with > > > >>> > > > non-zero > > > >>> > > > > > exit > > > >>> > > > > > > > >> value > > > >>> > > > > > > > >> > 1 > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > This problem might be caused by incorrect > > test > > > >>> process > > > >>> > > > > > > > >> configuration. > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > I've seen almost all of the above issues > > > multiple > > > >>> times, > > > >>> > > > so > > > >>> > > > > it > > > >>> > > > > > > > might > > > >>> > > > > > > > >> > be a > > > >>> > > > > > > > >> > > > good list to start with to focus any efforts > > on > > > >>> > > improving > > > >>> > > > > the > > > >>> > > > > > > > build. > > > >>> > > > > > > > >> > That > > > >>> > > > > > > > >> > > > said, I'm not sure what we can really do > about > > > >>> most of > > > >>> > > > > these, > > > >>> > > > > > > and > > > >>> > > > > > > > >> not > > > >>> > > > > > > > >> > > sure > > > >>> > > > > > > > >> > > > how to narrow down the root cause in the > more > > > >>> mysterious > > > >>> > > > > cases > > > >>> > > > > > > of > > > >>> > > > > > > > >> > aborted > > > >>> > > > > > > > >> > > > builds and the builds that end with > "finished > > > with > > > >>> > > > non-zero > > > >>> > > > > > exit > > > >>> > > > > > > > >> value > > > >>> > > > > > > > >> > 1 > > > >>> > > > > > > > >> > > " > > > >>> > > > > > > > >> > > > with no additional context (that I could > find) > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > If nothing else, there seems to be something > > > >>> happening > > > >>> > > in > > > >>> > > > > one > > > >>> > > > > > > (or > > > >>> > > > > > > > >> more) > > > >>> > > > > > > > >> > > of > > > >>> > > > > > > > >> > > > the storage tests, because by far the most > > > common > > > >>> > > failure > > > >>> > > > > I've > > > >>> > > > > > > > seen > > > >>> > > > > > > > >> is > > > >>> > > > > > > > >> > > that > > > >>> > > > > > > > >> > > > in 3 & 5. Unfortunately it's not really > clear > > to > > > >>> me how > > > >>> > > to > > > >>> > > > > > tell > > > >>> > > > > > > > >> which > > > >>> > > > > > > > >> > is > > > >>> > > > > > > > >> > > > the offending test, so I'm not even sure > what > > to > > > >>> file a > > > >>> > > > > ticket > > > >>> > > > > > > for > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot > > > >>> > > > > > > > >> > <dja...@confluent.io.invalid > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > wrote: > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > The slowness of the CI is definitely > causing > > > us > > > >>> a lot > > > >>> > > of > > > >>> > > > > > > pain. I > > > >>> > > > > > > > >> > wonder > > > >>> > > > > > > > >> > > > if > > > >>> > > > > > > > >> > > > > we should move to a dedicated CI > > > infrastructure > > > >>> for > > > >>> > > > Kafka. > > > >>> > > > > > Our > > > >>> > > > > > > > >> > > > integration > > > >>> > > > > > > > >> > > > > tests are quite heavy and ASF's CI is not > > > really > > > >>> tuned > > > >>> > > > for > > > >>> > > > > > > them. > > > >>> > > > > > > > >> We > > > >>> > > > > > > > >> > > could > > > >>> > > > > > > > >> > > > > tune it for our needs and this would also > > > allow > > > >>> > > external > > > >>> > > > > > > > >> companies to > > > >>> > > > > > > > >> > > > > sponsor more workers. I heard that we > have a > > > few > > > >>> cloud > > > >>> > > > > > > providers > > > >>> > > > > > > > >> in > > > >>> > > > > > > > >> > > > > the community ;). I think that we should > > > consider > > > >>> > > this. > > > >>> > > > > What > > > >>> > > > > > > do > > > >>> > > > > > > > >> you > > > >>> > > > > > > > >> > > > think? > > > >>> > > > > > > > >> > > > > I already discussed this with the INFRA > > team. > > > I > > > >>> could > > > >>> > > > > > continue > > > >>> > > > > > > > if > > > >>> > > > > > > > >> we > > > >>> > > > > > > > >> > > > > believe that it is a way forward. > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > Best, > > > >>> > > > > > > > >> > > > > David > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav > > > >>> Kozlovski > > > >>> > > > > > > > >> > > > > <stanis...@confluent.io.invalid> wrote: > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > Hey Николай, > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > Apologies about this - I wasn't aware of > > > this > > > >>> > > > behavior. > > > >>> > > > > I > > > >>> > > > > > > have > > > >>> > > > > > > > >> made > > > >>> > > > > > > > >> > > all > > > >>> > > > > > > > >> > > > > the > > > >>> > > > > > > > >> > > > > > gists public. > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg > > Harris > > > >>> > > > > > > > >> > > > > <greg.har...@aiven.io.invalid > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > wrote: > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > Hey Stan, > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > Thanks for opening the discussion. I > > > haven't > > > >>> been > > > >>> > > > > > looking > > > >>> > > > > > > at > > > >>> > > > > > > > >> > > overall > > > >>> > > > > > > > >> > > > > > > build duration recently, so it's good > > that > > > >>> you are > > > >>> > > > > > calling > > > >>> > > > > > > > it > > > >>> > > > > > > > >> > out. > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > I worry about us over-indexing on this > > one > > > >>> build, > > > >>> > > > > which > > > >>> > > > > > > > itself > > > >>> > > > > > > > >> > > > appears > > > >>> > > > > > > > >> > > > > > > to be an outlier. I only see one other > > > build > > > >>> [1] > > > >>> > > > above > > > >>> > > > > > 6h > > > >>> > > > > > > > >> overall > > > >>> > > > > > > > >> > > in > > > >>> > > > > > > > >> > > > > > > the last 90 days in this view: [2] > > > >>> > > > > > > > >> > > > > > > And I don't see any overlap of failed > > > tests > > > >>> in > > > >>> > > these > > > >>> > > > > two > > > >>> > > > > > > > >> builds, > > > >>> > > > > > > > >> > > > which > > > >>> > > > > > > > >> > > > > > > makes it less likely that these > > particular > > > >>> failed > > > >>> > > > > tests > > > >>> > > > > > > are > > > >>> > > > > > > > >> the > > > >>> > > > > > > > >> > > > causes > > > >>> > > > > > > > >> > > > > > > of long build times. > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > Separately, I've been investigating > > build > > > >>> > > > environment > > > >>> > > > > > > > >> slowness, > > > >>> > > > > > > > >> > and > > > >>> > > > > > > > >> > > > > > > trying to connect it with test > failures > > > [3]. > > > >>> I > > > >>> > > > > observed > > > >>> > > > > > > that > > > >>> > > > > > > > >> the > > > >>> > > > > > > > >> > CI > > > >>> > > > > > > > >> > > > > > > build environment is 2-20 times slower > > > than > > > >>> my > > > >>> > > > > developer > > > >>> > > > > > > > >> machine > > > >>> > > > > > > > >> > > (M1 > > > >>> > > > > > > > >> > > > > > > mac). > > > >>> > > > > > > > >> > > > > > > When I simulate a similar slowdown > > > locally, > > > >>> there > > > >>> > > > are > > > >>> > > > > > > tests > > > >>> > > > > > > > >> which > > > >>> > > > > > > > >> > > > > > > become significantly more flakey, > often > > > due > > > >>> to > > > >>> > > > > > hard-coded > > > >>> > > > > > > > >> > timeouts. > > > >>> > > > > > > > >> > > > > > > I think that these particularly nasty > > > builds > > > >>> could > > > >>> > > > be > > > >>> > > > > > > > >> explained > > > >>> > > > > > > > >> > by > > > >>> > > > > > > > >> > > > > > > long-tail slowdowns causing arbitrary > > > tests > > > >>> to > > > >>> > > take > > > >>> > > > an > > > >>> > > > > > > > >> excessive > > > >>> > > > > > > > >> > > time > > > >>> > > > > > > > >> > > > > > > to execute. > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > Rather than trying to find signals in > > > these > > > >>> rare > > > >>> > > > test > > > >>> > > > > > > > >> failures, I > > > >>> > > > > > > > >> > > > > > > think we should find tests that have > > these > > > >>> sorts > > > >>> > > of > > > >>> > > > > > > failures > > > >>> > > > > > > > >> more > > > >>> > > > > > > > >> > > > > > > regularly. > > > >>> > > > > > > > >> > > > > > > There are lots of builds in the 5-6h > > > duration > > > >>> > > > bracket, > > > >>> > > > > > > which > > > >>> > > > > > > > >> is > > > >>> > > > > > > > >> > > > > > > certainly unacceptably long. We should > > > look > > > >>> into > > > >>> > > > these > > > >>> > > > > > > > builds > > > >>> > > > > > > > >> to > > > >>> > > > > > > > >> > > find > > > >>> > > > > > > > >> > > > > > > improvements and optimizations. > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > [1] > > > https://ge.apache.org/s/ygh4gbz4uma6i/ > > > >>> > > > > > > > >> > > > > > > [2] > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York > > > >>> > > > > > > > >> > > > > > > [3] > > > >>> https://github.com/apache/kafka/pull/15008 > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > Thanks for looking into this! > > > >>> > > > > > > > >> > > > > > > Greg > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM > Николай > > > >>> Ижиков < > > > >>> > > > > > > > >> > > nizhi...@apache.org> > > > >>> > > > > > > > >> > > > > > > wrote: > > > >>> > > > > > > > >> > > > > > > > > > > >>> > > > > > > > >> > > > > > > > Hello, Stanislav. > > > >>> > > > > > > > >> > > > > > > > > > > >>> > > > > > > > >> > > > > > > > Can you, please, make the gist > public. > > > >>> > > > > > > > >> > > > > > > > Private gists not available for some > > > GitHub > > > >>> > > users > > > >>> > > > > even > > > >>> > > > > > > if > > > >>> > > > > > > > >> link > > > >>> > > > > > > > >> > > are > > > >>> > > > > > > > >> > > > > > known. > > > >>> > > > > > > > >> > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > 19 дек. 2023 г., в 17:33, > Stanislav > > > >>> Kozlovski > > > >>> > > < > > > >>> > > > > > > > >> > > > > > stanis...@confluent.io.INVALID> > > > >>> > > > > > > > >> > > > > > > написал(а): > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > Hey everybody, > > > >>> > > > > > > > >> > > > > > > > > I've heard various complaints that > > > build > > > >>> times > > > >>> > > > in > > > >>> > > > > > > trunk > > > >>> > > > > > > > >> are > > > >>> > > > > > > > >> > > > taking > > > >>> > > > > > > > >> > > > > > too > > > >>> > > > > > > > >> > > > > > > > > long, some taking as much as 8 > hours > > > (the > > > >>> > > > > timeout) - > > > >>> > > > > > > and > > > >>> > > > > > > > >> this > > > >>> > > > > > > > >> > > is > > > >>> > > > > > > > >> > > > > > > slowing us > > > >>> > > > > > > > >> > > > > > > > > down from being able to meet the > > code > > > >>> freeze > > > >>> > > > > > deadline > > > >>> > > > > > > > for > > > >>> > > > > > > > >> > 3.7. > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > I took it upon myself to gather up > > > some > > > >>> data > > > >>> > > in > > > >>> > > > > > Gradle > > > >>> > > > > > > > >> > > Enterprise > > > >>> > > > > > > > >> > > > > to > > > >>> > > > > > > > >> > > > > > > see if > > > >>> > > > > > > > >> > > > > > > > > there are any outlier tests that > are > > > >>> causing > > > >>> > > > this > > > >>> > > > > > > > >> slowness. > > > >>> > > > > > > > >> > > Turns > > > >>> > > > > > > > >> > > > > out > > > >>> > > > > > > > >> > > > > > > there > > > >>> > > > > > > > >> > > > > > > > > are a few, in this particular > build > > - > > > >>> > > > > > > > >> > > > > > > > https://ge.apache.org/s/un2hv7n6j374k/ > > > >>> > > > > > > > >> > > > > > > > > - which took 10 hours and 29 > minutes > > > in > > > >>> total. > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > I have compiled the tests that > took > > a > > > >>> > > > > > > disproportionately > > > >>> > > > > > > > >> > large > > > >>> > > > > > > > >> > > > > amount > > > >>> > > > > > > > >> > > > > > > of > > > >>> > > > > > > > >> > > > > > > > > time (20m+), alongside their time, > > > error > > > >>> > > message > > > >>> > > > > > and a > > > >>> > > > > > > > >> link > > > >>> > > > > > > > >> > to > > > >>> > > > > > > > >> > > > > their > > > >>> > > > > > > > >> > > > > > > full > > > >>> > > > > > > > >> > > > > > > > > log output here - > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2 > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > It includes failures from core, > > > streams, > > > >>> > > storage > > > >>> > > > > and > > > >>> > > > > > > > >> clients. > > > >>> > > > > > > > >> > > > > > > > > Interestingly, some other tests > that > > > >>> don't > > > >>> > > fail > > > >>> > > > > also > > > >>> > > > > > > > take > > > >>> > > > > > > > >> a > > > >>> > > > > > > > >> > > long > > > >>> > > > > > > > >> > > > > time > > > >>> > > > > > > > >> > > > > > > in > > > >>> > > > > > > > >> > > > > > > > > what is apparently the test > harness > > > >>> framework. > > > >>> > > > See > > > >>> > > > > > the > > > >>> > > > > > > > >> gist > > > >>> > > > > > > > >> > for > > > >>> > > > > > > > >> > > > > more > > > >>> > > > > > > > >> > > > > > > > > information. > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > I am starting this thread with the > > > >>> intention > > > >>> > > of > > > >>> > > > > > > getting > > > >>> > > > > > > > >> the > > > >>> > > > > > > > >> > > > > > discussion > > > >>> > > > > > > > >> > > > > > > > > started and brainstorming what we > > can > > > do > > > >>> to > > > >>> > > get > > > >>> > > > > the > > > >>> > > > > > > > build > > > >>> > > > > > > > >> > times > > > >>> > > > > > > > >> > > > > back > > > >>> > > > > > > > >> > > > > > > under > > > >>> > > > > > > > >> > > > > > > > > control. > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > -- > > > >>> > > > > > > > >> > > > > > > > > Best, > > > >>> > > > > > > > >> > > > > > > > > Stanislav > > > >>> > > > > > > > >> > > > > > > > > > > >>> > > > > > > > >> > > > > > > > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > -- > > > >>> > > > > > > > >> > > > > > Best, > > > >>> > > > > > > > >> > > > > > Stanislav > > > >>> > > > > > > > >> > > > > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > -- > > > >>> > > > > > Best, > > > >>> > > > > > Stanislav > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > -- > > > >>> > > > -David > > > >>> > > > > > > >>> > > > > > >>> > > > >>> > > > > > >