RE: Re: Kafka trunk test & build stability

kafka Sat, 03 Feb 2024 03:41:02 -0800

I wonder if we've considered adding a Gradle task timeout [0] on unitTest and
integrationTest tasks. The timeout applies separately for each subproject and
marks the currently running test as SKIPPED on timeout. This helped me find
a test which stalls builds [1].


[0] 
https://docs.gradle.org/8.5/userguide/more_about_tasks.html#sec:task_timeouts
[1] https://issues.apache.org/jira/browse/KAFKA-16219

Best,
Gaurav


On 2024/01/25 21:49:00 Justine Olshan wrote:
> It looks like there was some server maintenance that shut down Jenkins.
> Upon coming back up, the builds were expired but unable to stop.
> 
> They all had similar logs:
> 
> Cancelling nested steps due to timeoutCancelling nested steps due to
> timeoutBody did not finish within grace period; terminating with
> extreme prejudiceBody did not finish within grace period; terminating
> with extreme prejudicePausing (Preparing for shutdown)
> Resuming build at Thu Jan 25 06:56:23 UTC 2024 after Jenkins restart
> Resuming build at Thu Jan 25 09:45:03 UTC 2024 after Jenkins restart
> Pausing (Preparing for shutdown)
> Resuming build at Thu Jan 25 10:37:41 UTC 2024 after Jenkins
> restartTimeout expired 7 hr 39 min agoTimeout expired 7 hr 39 min
> agoCancelling nested steps due to timeoutCancelling nested steps due
> to timeout*02:37:41*  Waiting for reconnection of builds41 before
> proceeding with build*02:37:41*  Waiting for reconnection of builds32
> before proceeding with buildStill pausedBody did not finish within
> grace period; terminating with extreme prejudiceBody did not finish
> within grace period; terminating with extreme prejudice
> 
> 
> I forcibly killed the builds running over one day to free resources. I
> believe the rest are running as expected now.
> 
> Justine
> 
> On Thu, Jan 25, 2024 at 10:22 AM Justine Olshan <jo...@confluent.io>
> wrote:
> 
> > Hey folks -- I noticed some builds have been running for a day or more. I
> > thought we limited builds to 8 hours. Any ideas why this is happening?
> >
> >
> > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/activity/
> > I tried to abort the build for PR-15257, and it also still seems to be
> > running.
> >
> > Justine
> >
> > On Sun, Jan 14, 2024 at 6:25 AM Qichao Chu <qi...@uber.com.invalid>
> > wrote:
> >
> >> Hi Divij and all,
> >>
> >> Regarding the speeding up of the build & de-flaking tests, LinkedIn has
> >> done some great work which we probably can borrow ideas from.
> >> In the LinkedIn/Kafka repo, we can see one of their most recent PRs
> >> <https://github.com/linkedin/kafka/pull/500/checks> only took < 9
> >> min(unit
> >> test) + < 12 min (integration-test) + < 9 (code check) = < 30 min to
> >> finish
> >> all the checks:
> >>
> >>    1. Similar to what David(mumrah) has mentioned/experimented with, the
> >>    LinkedIn team used GitHub Actions, which displayed the results in a
> >> cleaner
> >>    way directly from GitHub.
> >>    2. Each top-level package is checked separately to increase the
> >>    concurrency. To further boost the speed for integration tests, the
> >> tests
> >>    inside one package are divided into sub-groups(A-Z) based on their
> >>    names(see this job
> >>    <https://github.com/linkedin/kafka/actions/runs/7303478151/> for
> >>    details).
> >>    3. Once the tests are running at a smaller granularity with a decent
> >>    runner, heavy integration tests are less likely to be flaky, and flaky
> >>    tests are easier to catch.
> >>
> >>
> >> --
> >> Qichao
> >>
> >>
> >> On Wed, Jan 10, 2024 at 2:57 PM Divij Vaidya <di...@gmail.com>
> >> wrote:
> >>
> >> > Hey folks
> >> >
> >> > We seem to have a handle on the OOM issues with the multiple fixes
> >> > community members made. In
> >> > https://issues.apache.org/jira/browse/KAFKA-16052,
> >> > you can see the "before" profile in the description and the "after"
> >> profile
> >> > in the latest comment to see the difference. To prevent future
> >> recurrence,
> >> > we have an ongoing solution at
> >> https://github.com/apache/kafka/pull/15101
> >> > and after that we will start another once to get rid of mockito mocks at
> >> > the end of every test suite using a similar extension. Note that this
> >> > doesn't solve the flaky test problems in the trunk but it removes the
> >> > aspect of build failures due to OOM (one of the many problems).
> >> >
> >> > To fix the flaky test problem, we probably need to run our tests in a
> >> > separate CI environment (like Apache Beam does) instead of sharing the 3
> >> > hosts that run our CI with many many other Apache projects. This
> >> assumption
> >> > is based on the fact that the tests are less flaky when running on
> >> laptops
> >> > / powerful EC2 machines. One of the avenues to get funding for these
> >> > Kafka-only hosts is
> >> >
> >> >
> >> https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/
> >> > . I will start the conversation on this one with AWS & Apache Infra in
> >> the
> >> > next 1-2 months.
> >> >
> >> > --
> >> > Divij Vaidya
> >> >
> >> >
> >> >
> >> > On Tue, Jan 9, 2024 at 9:21 PM Colin McCabe <cm...@apache.org> wrote:
> >> >
> >> > > Sorry, but to put it bluntly, the current build setup isn't good
> >> enough
> >> > at
> >> > > partial rebuilds that build caching would make sense. All Kafka devs
> >> have
> >> > > had the experience of needing to clean the build directory in order to
> >> > get
> >> > > a valid build. The scala code esspecially seems to have this issue.
> >> > >
> >> > > regards,
> >> > > Colin
> >> > >
> >> > >
> >> > > On Tue, Jan 2, 2024, at 07:00, Nick Telford wrote:
> >> > > > Addendum: I've opened a PR with what I believe are the changes
> >> > necessary
> >> > > to
> >> > > > enable Remote Build Caching, if you choose to go that route:
> >> > > > https://github.com/apache/kafka/pull/15109
> >> > > >
> >> > > > On Tue, 2 Jan 2024 at 14:31, Nick Telford <ni...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > >> Hi everyone,
> >> > > >>
> >> > > >> Regarding building a "dependency graph"... Gradle already has this
> >> > > >> information, albeit fairly coarse-grained. You might be able to get
> >> > some
> >> > > >> considerable improvement by configuring the Gradle Remote Build
> >> Cache.
> >> > > It
> >> > > >> looks like it's currently disabled explicitly:
> >> > > >> https://github.com/apache/kafka/blob/trunk/settings.gradle#L46
> >> > > >>
> >> > > >> The trick is to have trunk builds write to the cache, and PR builds
> >> > only
> >> > > >> read from it. This way, any PR based on trunk should be able to
> >> cache
> >> > > not
> >> > > >> only the compilation, but also the tests from dependent modules
> >> that
> >> > > >> haven't changed (e.g. for a PR that only touches the
> >> connect/streams
> >> > > >> modules).
> >> > > >>
> >> > > >> This would probably be preferable to having to hand-maintain some
> >> > > >> rules/dependency graph in the CI configuration, and it's quite
> >> > > >> straight-forward to configure.
> >> > > >>
> >> > > >> Bonus points if the Remote Build Cache is readable publicly,
> >> enabling
> >> > > >> contributors to benefit from it locally.
> >> > > >>
> >> > > >> Regards,
> >> > > >> Nick
> >> > > >>
> >> > > >> On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy <
> >> lbruts...@confluent.io
> >> > > .invalid>
> >> > > >> wrote:
> >> > > >>
> >> > > >>> Thanks for all the work that has already been done on this in the
> >> > past
> >> > > >>> days!
> >> > > >>>
> >> > > >>> Have we considered running our test suite with
> >> > > >>> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as
> >> > > >>> Jenkins build artifacts? This could speed up debugging. Even if we
> >> > > >>> store them only for a day and do it only for trunk, I think it
> >> could
> >> > > >>> be worth it. The heap dumps shouldn't contain any secrets, and I
> >> > > >>> checked with the ASF infra team, and they are not concerned about
> >> the
> >> > > >>> additional disk usage.
> >> > > >>>
> >> > > >>> Cheers,
> >> > > >>> Lucas
> >> > > >>>
> >> > > >>> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya <
> >> > divijvaidy...@gmail.com>
> >> > > >>> wrote:
> >> > > >>> >
> >> > > >>> > I have started to perform an analysis of the OOM at
> >> > > >>> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel
> >> > free
> >> > > to
> >> > > >>> > contribute to the investigation.
> >> > > >>> >
> >> > > >>> > --
> >> > > >>> > Divij Vaidya
> >> > > >>> >
> >> > > >>> >
> >> > > >>> >
> >> > > >>> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan
> >> > > >>> <jo...@confluent.io.invalid>
> >> > > >>> > wrote:
> >> > > >>> >
> >> > > >>> > > I am still seeing quite a few OOM errors in the builds and I
> >> was
> >> > > >>> curious if
> >> > > >>> > > folks had any ideas on how to identify the cause and fix the
> >> > > issue. I
> >> > > >>> was
> >> > > >>> > > looking in gradle enterprise and found some info about memory
> >> > > usage,
> >> > > >>> but
> >> > > >>> > > nothing detailed enough to help figure the issue out.
> >> > > >>> > >
> >> > > >>> > > OOMs sometimes fail the build immediately and in other cases I
> >> > see
> >> > > it
> >> > > >>> get
> >> > > >>> > > stuck for 8 hours. (See
> >> > > >>> > >
> >> > > >>> > >
> >> > > >>>
> >> > >
> >> >
> >> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12
> >> > > >>> > > )
> >> > > >>> > >
> >> > > >>> > > I appreciate all the work folks are doing here and I will
> >> > continue
> >> > > to
> >> > > >>> try
> >> > > >>> > > to help as best as I can.
> >> > > >>> > >
> >> > > >>> > > Justine
> >> > > >>> > >
> >> > > >>> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur
> >> > > >>> > > <da...@confluent.io.invalid> wrote:
> >> > > >>> > >
> >> > > >>> > > > S2. We’ve looked into this before, and it wasn’t possible at
> >> > the
> >> > > >>> time
> >> > > >>> > > with
> >> > > >>> > > > JUnit. We commonly set a timeout on each test class
> >> (especially
> >> > > >>> > > integration
> >> > > >>> > > > tests). It is probably worth looking at this again and
> >> seeing
> >> > if
> >> > > >>> > > something
> >> > > >>> > > > has changed with JUnit (or our usage of it) that would
> >> allow a
> >> > > >>> global
> >> > > >>> > > > timeout.
> >> > > >>> > > >
> >> > > >>> > > >
> >> > > >>> > > > S3. Dedicated infra sounds nice, if we can get it. It would
> >> at
> >> > > least
> >> > > >>> > > remove
> >> > > >>> > > > some variability between the builds, and hopefully eliminate
> >> > the
> >> > > >>> > > > infra/setup class of failures.
> >> > > >>> > > >
> >> > > >>> > > >
> >> > > >>> > > > S4. Running tests for what has changed sounds nice, but I
> >> think
> >> > > it
> >> > > >>> is
> >> > > >>> > > risky
> >> > > >>> > > > to implement broadly. As Sophie mentioned, there are
> >> probably
> >> > > some
> >> > > >>> lines
> >> > > >>> > > we
> >> > > >>> > > > could draw where we feel confident that only running a
> >> subset
> >> > of
> >> > > >>> tests is
> >> > > >>> > > > safe. As a start, we could probably work towards skipping CI
> >> > for
> >> > > >>> non-code
> >> > > >>> > > > PRs.
> >> > > >>> > > >
> >> > > >>> > > >
> >> > > >>> > > > ---
> >> > > >>> > > >
> >> > > >>> > > >
> >> > > >>> > > > As an aside, I experimented with build caching and running
> >> > > affected
> >> > > >>> > > tests a
> >> > > >>> > > > few months ago. I used the opportunity to play with Github
> >> > > Actions,
> >> > > >>> and I
> >> > > >>> > > > quite liked it. Here’s the workflow I used:
> >> > > >>> > > >
> >> > > >>>
> >> > https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml.
> >> > > I
> >> > > >>> > > > was trying to see if we could use a build cache to reduce
> >> the
> >> > > >>> compilation
> >> > > >>> > > > time on PRs. A nightly/periodic job would build trunk and
> >> > > populate a
> >> > > >>> > > Gradle
> >> > > >>> > > > build cache. PR builds would read from that cache which
> >> would
> >> > > >>> enable them
> >> > > >>> > > > to only compile changed code. The same idea could be
> >> extended
> >> > to
> >> > > >>> tests,
> >> > > >>> > > but
> >> > > >>> > > > I didn’t get that far.
> >> > > >>> > > >
> >> > > >>> > > >
> >> > > >>> > > > As for Github Actions, the idea there is that ASF would
> >> provide
> >> > > >>> generic
> >> > > >>> > > > Action “runners” that would pick up jobs from the Github
> >> Action
> >> > > >>> build
> >> > > >>> > > queue
> >> > > >>> > > > and run them. It is also possible to self-host runners to
> >> > expand
> >> > > the
> >> > > >>> > > build
> >> > > >>> > > > capacity of the project (i.e., other organizations could
> >> donate
> >> > > >>> > > > build capacity). The advantage of this is that we would have
> >> > more
> >> > > >>> control
> >> > > >>> > > > over our build/reports and not be “stuck” with whatever ASF
> >> > > Jenkins
> >> > > >>> > > offers.
> >> > > >>> > > > The Actions workflows are very customizable and it would
> >> let us
> >> > > >>> create
> >> > > >>> > > our
> >> > > >>> > > > own custom plugins. There is also a substantial marketplace
> >> of
> >> > > >>> plugins. I
> >> > > >>> > > > think it’s worth exploring this more, I just haven’t had
> >> time
> >> > > >>> lately.
> >> > > >>> > > >
> >> > > >>> > > > On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman <
> >> > > >>> > > sop...@responsive.dev
> >> > > >>> > > > >
> >> > > >>> > > > wrote:
> >> > > >>> > > >
> >> > > >>> > > > > Regarding:
> >> > > >>> > > > >
> >> > > >>> > > > > S-4. Separate tests ran depending on what module is
> >> changed.
> >> > > >>> > > > > >
> >> > > >>> > > > > - This makes sense although is tricky to implement
> >> > > successfully,
> >> > > >>> as
> >> > > >>> > > > > > unrelated tests may expose problems in an unrelated
> >> change
> >> > > (e.g
> >> > > >>> > > > changing
> >> > > >>> > > > > > core stuff like clients, the server, etc)
> >> > > >>> > > > >
> >> > > >>> > > > >
> >> > > >>> > > > > Imo this avenue could provide a massive improvement to dev
> >> > > >>> productivity
> >> > > >>> > > > > with very little effort or investment, and if we do it
> >> right,
> >> > > >>> without
> >> > > >>> > > > even
> >> > > >>> > > > > any risk. We should be able to draft a simple dependency
> >> > graph
> >> > > >>> between
> >> > > >>> > > > > modules and then skip the tests for anything that is
> >> clearly,
> >> > > >>> provably
> >> > > >>> > > > > unrelated and/or upstream of the target changes. This has
> >> the
> >> > > >>> potential
> >> > > >>> > > > to
> >> > > >>> > > > > substantially speed up and improve the developer
> >> experience
> >> > in
> >> > > >>> modules
> >> > > >>> > > at
> >> > > >>> > > > > the end of the dependency graph, which I believe is worth
> >> > doing
> >> > > >>> even if
> >> > > >>> > > > it
> >> > > >>> > > > > unfortunately would not benefit everyone equally.
> >> > > >>> > > > >
> >> > > >>> > > > > For example, we can save a lot of grief with just a simple
> >> > set
> >> > > of
> >> > > >>> rules
> >> > > >>> > > > > that are easy to check. I'll throw out a few to start
> >> with:
> >> > > >>> > > > >
> >> > > >>> > > > >    1. A pure docs PR (ie that only touches files under the
> >> > > docs/
> >> > > >>> > > > directory)
> >> > > >>> > > > >    should be allowed to skip the tests of all modules
> >> > > >>> > > > >    2. Connect PRs (that only touch connect/) only need to
> >> run
> >> > > the
> >> > > >>> > > Connect
> >> > > >>> > > > >    tests -- ie they can skip the tests for core, clients,
> >> > > >>> streams, etc
> >> > > >>> > > > >    3. Similarly, Streams PRs should only need to run the
> >> > > Streams
> >> > > >>> tests
> >> > > >>> > > --
> >> > > >>> > > > >    but again, only if all the changes are contained within
> >> > > >>> streams/
> >> > > >>> > > > >
> >> > > >>> > > > > I'll let others chime in on how or if we can construct
> >> some
> >> > > safe
> >> > > >>> rules
> >> > > >>> > > as
> >> > > >>> > > > > to which modules can or can't be skipped between the core,
> >> > > >>> clients,
> >> > > >>> > > raft,
> >> > > >>> > > > > storage, etc
> >> > > >>> > > > >
> >> > > >>> > > > > And over time we could in theory build up a literal
> >> > dependency
> >> > > >>> graph
> >> > > >>> > > on a
> >> > > >>> > > > > more granular level so that, for example, changes to the
> >> > > >>> co
[message truncated...]

RE: Re: Kafka trunk test & build stability

Reply via email to