I wonder if we've considered adding a Gradle task timeout [0] on unitTest and integrationTest tasks. The timeout applies separately for each subproject and marks the currently running test as SKIPPED on timeout. This helped me find a test which stalls builds [1].
[0] https://docs.gradle.org/8.5/userguide/more_about_tasks.html#sec:task_timeouts [1] https://issues.apache.org/jira/browse/KAFKA-16219 Best, Gaurav On 2024/01/25 21:49:00 Justine Olshan wrote: > It looks like there was some server maintenance that shut down Jenkins. > Upon coming back up, the builds were expired but unable to stop. > > They all had similar logs: > > Cancelling nested steps due to timeoutCancelling nested steps due to > timeoutBody did not finish within grace period; terminating with > extreme prejudiceBody did not finish within grace period; terminating > with extreme prejudicePausing (Preparing for shutdown) > Resuming build at Thu Jan 25 06:56:23 UTC 2024 after Jenkins restart > Resuming build at Thu Jan 25 09:45:03 UTC 2024 after Jenkins restart > Pausing (Preparing for shutdown) > Resuming build at Thu Jan 25 10:37:41 UTC 2024 after Jenkins > restartTimeout expired 7 hr 39 min agoTimeout expired 7 hr 39 min > agoCancelling nested steps due to timeoutCancelling nested steps due > to timeout*02:37:41* Waiting for reconnection of builds41 before > proceeding with build*02:37:41* Waiting for reconnection of builds32 > before proceeding with buildStill pausedBody did not finish within > grace period; terminating with extreme prejudiceBody did not finish > within grace period; terminating with extreme prejudice > > > I forcibly killed the builds running over one day to free resources. I > believe the rest are running as expected now. > > Justine > > On Thu, Jan 25, 2024 at 10:22 AM Justine Olshan <jo...@confluent.io> > wrote: > > > Hey folks -- I noticed some builds have been running for a day or more. I > > thought we limited builds to 8 hours. Any ideas why this is happening? > > > > > > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/activity/ > > I tried to abort the build for PR-15257, and it also still seems to be > > running. > > > > Justine > > > > On Sun, Jan 14, 2024 at 6:25 AM Qichao Chu <qi...@uber.com.invalid> > > wrote: > > > >> Hi Divij and all, > >> > >> Regarding the speeding up of the build & de-flaking tests, LinkedIn has > >> done some great work which we probably can borrow ideas from. > >> In the LinkedIn/Kafka repo, we can see one of their most recent PRs > >> <https://github.com/linkedin/kafka/pull/500/checks> only took < 9 > >> min(unit > >> test) + < 12 min (integration-test) + < 9 (code check) = < 30 min to > >> finish > >> all the checks: > >> > >> 1. Similar to what David(mumrah) has mentioned/experimented with, the > >> LinkedIn team used GitHub Actions, which displayed the results in a > >> cleaner > >> way directly from GitHub. > >> 2. Each top-level package is checked separately to increase the > >> concurrency. To further boost the speed for integration tests, the > >> tests > >> inside one package are divided into sub-groups(A-Z) based on their > >> names(see this job > >> <https://github.com/linkedin/kafka/actions/runs/7303478151/> for > >> details). > >> 3. Once the tests are running at a smaller granularity with a decent > >> runner, heavy integration tests are less likely to be flaky, and flaky > >> tests are easier to catch. > >> > >> > >> -- > >> Qichao > >> > >> > >> On Wed, Jan 10, 2024 at 2:57 PM Divij Vaidya <di...@gmail.com> > >> wrote: > >> > >> > Hey folks > >> > > >> > We seem to have a handle on the OOM issues with the multiple fixes > >> > community members made. In > >> > https://issues.apache.org/jira/browse/KAFKA-16052, > >> > you can see the "before" profile in the description and the "after" > >> profile > >> > in the latest comment to see the difference. To prevent future > >> recurrence, > >> > we have an ongoing solution at > >> https://github.com/apache/kafka/pull/15101 > >> > and after that we will start another once to get rid of mockito mocks at > >> > the end of every test suite using a similar extension. Note that this > >> > doesn't solve the flaky test problems in the trunk but it removes the > >> > aspect of build failures due to OOM (one of the many problems). > >> > > >> > To fix the flaky test problem, we probably need to run our tests in a > >> > separate CI environment (like Apache Beam does) instead of sharing the 3 > >> > hosts that run our CI with many many other Apache projects. This > >> assumption > >> > is based on the fact that the tests are less flaky when running on > >> laptops > >> > / powerful EC2 machines. One of the avenues to get funding for these > >> > Kafka-only hosts is > >> > > >> > > >> https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ > >> > . I will start the conversation on this one with AWS & Apache Infra in > >> the > >> > next 1-2 months. > >> > > >> > -- > >> > Divij Vaidya > >> > > >> > > >> > > >> > On Tue, Jan 9, 2024 at 9:21 PM Colin McCabe <cm...@apache.org> wrote: > >> > > >> > > Sorry, but to put it bluntly, the current build setup isn't good > >> enough > >> > at > >> > > partial rebuilds that build caching would make sense. All Kafka devs > >> have > >> > > had the experience of needing to clean the build directory in order to > >> > get > >> > > a valid build. The scala code esspecially seems to have this issue. > >> > > > >> > > regards, > >> > > Colin > >> > > > >> > > > >> > > On Tue, Jan 2, 2024, at 07:00, Nick Telford wrote: > >> > > > Addendum: I've opened a PR with what I believe are the changes > >> > necessary > >> > > to > >> > > > enable Remote Build Caching, if you choose to go that route: > >> > > > https://github.com/apache/kafka/pull/15109 > >> > > > > >> > > > On Tue, 2 Jan 2024 at 14:31, Nick Telford <ni...@gmail.com> > >> > > wrote: > >> > > > > >> > > >> Hi everyone, > >> > > >> > >> > > >> Regarding building a "dependency graph"... Gradle already has this > >> > > >> information, albeit fairly coarse-grained. You might be able to get > >> > some > >> > > >> considerable improvement by configuring the Gradle Remote Build > >> Cache. > >> > > It > >> > > >> looks like it's currently disabled explicitly: > >> > > >> https://github.com/apache/kafka/blob/trunk/settings.gradle#L46 > >> > > >> > >> > > >> The trick is to have trunk builds write to the cache, and PR builds > >> > only > >> > > >> read from it. This way, any PR based on trunk should be able to > >> cache > >> > > not > >> > > >> only the compilation, but also the tests from dependent modules > >> that > >> > > >> haven't changed (e.g. for a PR that only touches the > >> connect/streams > >> > > >> modules). > >> > > >> > >> > > >> This would probably be preferable to having to hand-maintain some > >> > > >> rules/dependency graph in the CI configuration, and it's quite > >> > > >> straight-forward to configure. > >> > > >> > >> > > >> Bonus points if the Remote Build Cache is readable publicly, > >> enabling > >> > > >> contributors to benefit from it locally. > >> > > >> > >> > > >> Regards, > >> > > >> Nick > >> > > >> > >> > > >> On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy < > >> lbruts...@confluent.io > >> > > .invalid> > >> > > >> wrote: > >> > > >> > >> > > >>> Thanks for all the work that has already been done on this in the > >> > past > >> > > >>> days! > >> > > >>> > >> > > >>> Have we considered running our test suite with > >> > > >>> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as > >> > > >>> Jenkins build artifacts? This could speed up debugging. Even if we > >> > > >>> store them only for a day and do it only for trunk, I think it > >> could > >> > > >>> be worth it. The heap dumps shouldn't contain any secrets, and I > >> > > >>> checked with the ASF infra team, and they are not concerned about > >> the > >> > > >>> additional disk usage. > >> > > >>> > >> > > >>> Cheers, > >> > > >>> Lucas > >> > > >>> > >> > > >>> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya < > >> > divijvaidy...@gmail.com> > >> > > >>> wrote: > >> > > >>> > > >> > > >>> > I have started to perform an analysis of the OOM at > >> > > >>> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel > >> > free > >> > > to > >> > > >>> > contribute to the investigation. > >> > > >>> > > >> > > >>> > -- > >> > > >>> > Divij Vaidya > >> > > >>> > > >> > > >>> > > >> > > >>> > > >> > > >>> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan > >> > > >>> <jo...@confluent.io.invalid> > >> > > >>> > wrote: > >> > > >>> > > >> > > >>> > > I am still seeing quite a few OOM errors in the builds and I > >> was > >> > > >>> curious if > >> > > >>> > > folks had any ideas on how to identify the cause and fix the > >> > > issue. I > >> > > >>> was > >> > > >>> > > looking in gradle enterprise and found some info about memory > >> > > usage, > >> > > >>> but > >> > > >>> > > nothing detailed enough to help figure the issue out. > >> > > >>> > > > >> > > >>> > > OOMs sometimes fail the build immediately and in other cases I > >> > see > >> > > it > >> > > >>> get > >> > > >>> > > stuck for 8 hours. (See > >> > > >>> > > > >> > > >>> > > > >> > > >>> > >> > > > >> > > >> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12 > >> > > >>> > > ) > >> > > >>> > > > >> > > >>> > > I appreciate all the work folks are doing here and I will > >> > continue > >> > > to > >> > > >>> try > >> > > >>> > > to help as best as I can. > >> > > >>> > > > >> > > >>> > > Justine > >> > > >>> > > > >> > > >>> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur > >> > > >>> > > <da...@confluent.io.invalid> wrote: > >> > > >>> > > > >> > > >>> > > > S2. We’ve looked into this before, and it wasn’t possible at > >> > the > >> > > >>> time > >> > > >>> > > with > >> > > >>> > > > JUnit. We commonly set a timeout on each test class > >> (especially > >> > > >>> > > integration > >> > > >>> > > > tests). It is probably worth looking at this again and > >> seeing > >> > if > >> > > >>> > > something > >> > > >>> > > > has changed with JUnit (or our usage of it) that would > >> allow a > >> > > >>> global > >> > > >>> > > > timeout. > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> > > > S3. Dedicated infra sounds nice, if we can get it. It would > >> at > >> > > least > >> > > >>> > > remove > >> > > >>> > > > some variability between the builds, and hopefully eliminate > >> > the > >> > > >>> > > > infra/setup class of failures. > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> > > > S4. Running tests for what has changed sounds nice, but I > >> think > >> > > it > >> > > >>> is > >> > > >>> > > risky > >> > > >>> > > > to implement broadly. As Sophie mentioned, there are > >> probably > >> > > some > >> > > >>> lines > >> > > >>> > > we > >> > > >>> > > > could draw where we feel confident that only running a > >> subset > >> > of > >> > > >>> tests is > >> > > >>> > > > safe. As a start, we could probably work towards skipping CI > >> > for > >> > > >>> non-code > >> > > >>> > > > PRs. > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> > > > --- > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> > > > As an aside, I experimented with build caching and running > >> > > affected > >> > > >>> > > tests a > >> > > >>> > > > few months ago. I used the opportunity to play with Github > >> > > Actions, > >> > > >>> and I > >> > > >>> > > > quite liked it. Here’s the workflow I used: > >> > > >>> > > > > >> > > >>> > >> > https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml. > >> > > I > >> > > >>> > > > was trying to see if we could use a build cache to reduce > >> the > >> > > >>> compilation > >> > > >>> > > > time on PRs. A nightly/periodic job would build trunk and > >> > > populate a > >> > > >>> > > Gradle > >> > > >>> > > > build cache. PR builds would read from that cache which > >> would > >> > > >>> enable them > >> > > >>> > > > to only compile changed code. The same idea could be > >> extended > >> > to > >> > > >>> tests, > >> > > >>> > > but > >> > > >>> > > > I didn’t get that far. > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> > > > As for Github Actions, the idea there is that ASF would > >> provide > >> > > >>> generic > >> > > >>> > > > Action “runners” that would pick up jobs from the Github > >> Action > >> > > >>> build > >> > > >>> > > queue > >> > > >>> > > > and run them. It is also possible to self-host runners to > >> > expand > >> > > the > >> > > >>> > > build > >> > > >>> > > > capacity of the project (i.e., other organizations could > >> donate > >> > > >>> > > > build capacity). The advantage of this is that we would have > >> > more > >> > > >>> control > >> > > >>> > > > over our build/reports and not be “stuck” with whatever ASF > >> > > Jenkins > >> > > >>> > > offers. > >> > > >>> > > > The Actions workflows are very customizable and it would > >> let us > >> > > >>> create > >> > > >>> > > our > >> > > >>> > > > own custom plugins. There is also a substantial marketplace > >> of > >> > > >>> plugins. I > >> > > >>> > > > think it’s worth exploring this more, I just haven’t had > >> time > >> > > >>> lately. > >> > > >>> > > > > >> > > >>> > > > On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman < > >> > > >>> > > sop...@responsive.dev > >> > > >>> > > > > > >> > > >>> > > > wrote: > >> > > >>> > > > > >> > > >>> > > > > Regarding: > >> > > >>> > > > > > >> > > >>> > > > > S-4. Separate tests ran depending on what module is > >> changed. > >> > > >>> > > > > > > >> > > >>> > > > > - This makes sense although is tricky to implement > >> > > successfully, > >> > > >>> as > >> > > >>> > > > > > unrelated tests may expose problems in an unrelated > >> change > >> > > (e.g > >> > > >>> > > > changing > >> > > >>> > > > > > core stuff like clients, the server, etc) > >> > > >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > Imo this avenue could provide a massive improvement to dev > >> > > >>> productivity > >> > > >>> > > > > with very little effort or investment, and if we do it > >> right, > >> > > >>> without > >> > > >>> > > > even > >> > > >>> > > > > any risk. We should be able to draft a simple dependency > >> > graph > >> > > >>> between > >> > > >>> > > > > modules and then skip the tests for anything that is > >> clearly, > >> > > >>> provably > >> > > >>> > > > > unrelated and/or upstream of the target changes. This has > >> the > >> > > >>> potential > >> > > >>> > > > to > >> > > >>> > > > > substantially speed up and improve the developer > >> experience > >> > in > >> > > >>> modules > >> > > >>> > > at > >> > > >>> > > > > the end of the dependency graph, which I believe is worth > >> > doing > >> > > >>> even if > >> > > >>> > > > it > >> > > >>> > > > > unfortunately would not benefit everyone equally. > >> > > >>> > > > > > >> > > >>> > > > > For example, we can save a lot of grief with just a simple > >> > set > >> > > of > >> > > >>> rules > >> > > >>> > > > > that are easy to check. I'll throw out a few to start > >> with: > >> > > >>> > > > > > >> > > >>> > > > > 1. A pure docs PR (ie that only touches files under the > >> > > docs/ > >> > > >>> > > > directory) > >> > > >>> > > > > should be allowed to skip the tests of all modules > >> > > >>> > > > > 2. Connect PRs (that only touch connect/) only need to > >> run > >> > > the > >> > > >>> > > Connect > >> > > >>> > > > > tests -- ie they can skip the tests for core, clients, > >> > > >>> streams, etc > >> > > >>> > > > > 3. Similarly, Streams PRs should only need to run the > >> > > Streams > >> > > >>> tests > >> > > >>> > > -- > >> > > >>> > > > > but again, only if all the changes are contained within > >> > > >>> streams/ > >> > > >>> > > > > > >> > > >>> > > > > I'll let others chime in on how or if we can construct > >> some > >> > > safe > >> > > >>> rules > >> > > >>> > > as > >> > > >>> > > > > to which modules can or can't be skipped between the core, > >> > > >>> clients, > >> > > >>> > > raft, > >> > > >>> > > > > storage, etc > >> > > >>> > > > > > >> > > >>> > > > > And over time we could in theory build up a literal > >> > dependency > >> > > >>> graph > >> > > >>> > > on a > >> > > >>> > > > > more granular level so that, for example, changes to the > >> > > >>> co [message truncated...]