Re: Kafka trunk test & build stability

Justine Olshan Thu, 25 Jan 2024 10:22:48 -0800

Hey folks -- I noticed some builds have been running for a day or more. I
thought we limited builds to 8 hours. Any ideas why this is happening?


https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/activity/
I tried to abort the build for PR-15257, and it also still seems to be
running.

Justine

On Sun, Jan 14, 2024 at 6:25 AM Qichao Chu <[email protected]> wrote:

> Hi Divij and all,
>
> Regarding the speeding up of the build & de-flaking tests, LinkedIn has
> done some great work which we probably can borrow ideas from.
> In the LinkedIn/Kafka repo, we can see one of their most recent PRs
> <https://github.com/linkedin/kafka/pull/500/checks> only took < 9 min(unit
> test) + < 12 min (integration-test) + < 9 (code check) = < 30 min to finish
> all the checks:
>
>    1. Similar to what David(mumrah) has mentioned/experimented with, the
>    LinkedIn team used GitHub Actions, which displayed the results in a
> cleaner
>    way directly from GitHub.
>    2. Each top-level package is checked separately to increase the
>    concurrency. To further boost the speed for integration tests, the tests
>    inside one package are divided into sub-groups(A-Z) based on their
>    names(see this job
>    <https://github.com/linkedin/kafka/actions/runs/7303478151/> for
>    details).
>    3. Once the tests are running at a smaller granularity with a decent
>    runner, heavy integration tests are less likely to be flaky, and flaky
>    tests are easier to catch.
>
>
> --
> Qichao
>
>
> On Wed, Jan 10, 2024 at 2:57 PM Divij Vaidya <[email protected]>
> wrote:
>
> > Hey folks
> >
> > We seem to have a handle on the OOM issues with the multiple fixes
> > community members made. In
> > https://issues.apache.org/jira/browse/KAFKA-16052,
> > you can see the "before" profile in the description and the "after"
> profile
> > in the latest comment to see the difference. To prevent future
> recurrence,
> > we have an ongoing solution at
> https://github.com/apache/kafka/pull/15101
> > and after that we will start another once to get rid of mockito mocks at
> > the end of every test suite using a similar extension. Note that this
> > doesn't solve the flaky test problems in the trunk but it removes the
> > aspect of build failures due to OOM (one of the many problems).
> >
> > To fix the flaky test problem, we probably need to run our tests in a
> > separate CI environment (like Apache Beam does) instead of sharing the 3
> > hosts that run our CI with many many other Apache projects. This
> assumption
> > is based on the fact that the tests are less flaky when running on
> laptops
> > / powerful EC2 machines. One of the avenues to get funding for these
> > Kafka-only hosts is
> >
> >
> https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/
> > . I will start the conversation on this one with AWS & Apache Infra in
> the
> > next 1-2 months.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Tue, Jan 9, 2024 at 9:21 PM Colin McCabe <[email protected]> wrote:
> >
> > > Sorry, but to put it bluntly, the current build setup isn't good enough
> > at
> > > partial rebuilds that build caching would make sense. All Kafka devs
> have
> > > had the experience of needing to clean the build directory in order to
> > get
> > > a valid build. The scala code esspecially seems to have this issue.
> > >
> > > regards,
> > > Colin
> > >
> > >
> > > On Tue, Jan 2, 2024, at 07:00, Nick Telford wrote:
> > > > Addendum: I've opened a PR with what I believe are the changes
> > necessary
> > > to
> > > > enable Remote Build Caching, if you choose to go that route:
> > > > https://github.com/apache/kafka/pull/15109
> > > >
> > > > On Tue, 2 Jan 2024 at 14:31, Nick Telford <[email protected]>
> > > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> Regarding building a "dependency graph"... Gradle already has this
> > > >> information, albeit fairly coarse-grained. You might be able to get
> > some
> > > >> considerable improvement by configuring the Gradle Remote Build
> Cache.
> > > It
> > > >> looks like it's currently disabled explicitly:
> > > >> https://github.com/apache/kafka/blob/trunk/settings.gradle#L46
> > > >>
> > > >> The trick is to have trunk builds write to the cache, and PR builds
> > only
> > > >> read from it. This way, any PR based on trunk should be able to
> cache
> > > not
> > > >> only the compilation, but also the tests from dependent modules that
> > > >> haven't changed (e.g. for a PR that only touches the connect/streams
> > > >> modules).
> > > >>
> > > >> This would probably be preferable to having to hand-maintain some
> > > >> rules/dependency graph in the CI configuration, and it's quite
> > > >> straight-forward to configure.
> > > >>
> > > >> Bonus points if the Remote Build Cache is readable publicly,
> enabling
> > > >> contributors to benefit from it locally.
> > > >>
> > > >> Regards,
> > > >> Nick
> > > >>
> > > >> On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy <[email protected]
> > > .invalid>
> > > >> wrote:
> > > >>
> > > >>> Thanks for all the work that has already been done on this in the
> > past
> > > >>> days!
> > > >>>
> > > >>> Have we considered running our test suite with
> > > >>> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as
> > > >>> Jenkins build artifacts? This could speed up debugging. Even if we
> > > >>> store them only for a day and do it only for trunk, I think it
> could
> > > >>> be worth it. The heap dumps shouldn't contain any secrets, and I
> > > >>> checked with the ASF infra team, and they are not concerned about
> the
> > > >>> additional disk usage.
> > > >>>
> > > >>> Cheers,
> > > >>> Lucas
> > > >>>
> > > >>> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya <
> > [email protected]>
> > > >>> wrote:
> > > >>> >
> > > >>> > I have started to perform an analysis of the OOM at
> > > >>> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel
> > free
> > > to
> > > >>> > contribute to the investigation.
> > > >>> >
> > > >>> > --
> > > >>> > Divij Vaidya
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan
> > > >>> <[email protected]>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > I am still seeing quite a few OOM errors in the builds and I
> was
> > > >>> curious if
> > > >>> > > folks had any ideas on how to identify the cause and fix the
> > > issue. I
> > > >>> was
> > > >>> > > looking in gradle enterprise and found some info about memory
> > > usage,
> > > >>> but
> > > >>> > > nothing detailed enough to help figure the issue out.
> > > >>> > >
> > > >>> > > OOMs sometimes fail the build immediately and in other cases I
> > see
> > > it
> > > >>> get
> > > >>> > > stuck for 8 hours. (See
> > > >>> > >
> > > >>> > >
> > > >>>
> > >
> >
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12
> > > >>> > > )
> > > >>> > >
> > > >>> > > I appreciate all the work folks are doing here and I will
> > continue
> > > to
> > > >>> try
> > > >>> > > to help as best as I can.
> > > >>> > >
> > > >>> > > Justine
> > > >>> > >
> > > >>> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur
> > > >>> > > <[email protected]> wrote:
> > > >>> > >
> > > >>> > > > S2. We’ve looked into this before, and it wasn’t possible at
> > the
> > > >>> time
> > > >>> > > with
> > > >>> > > > JUnit. We commonly set a timeout on each test class
> (especially
> > > >>> > > integration
> > > >>> > > > tests). It is probably worth looking at this again and seeing
> > if
> > > >>> > > something
> > > >>> > > > has changed with JUnit (or our usage of it) that would allow
> a
> > > >>> global
> > > >>> > > > timeout.
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > S3. Dedicated infra sounds nice, if we can get it. It would
> at
> > > least
> > > >>> > > remove
> > > >>> > > > some variability between the builds, and hopefully eliminate
> > the
> > > >>> > > > infra/setup class of failures.
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > S4. Running tests for what has changed sounds nice, but I
> think
> > > it
> > > >>> is
> > > >>> > > risky
> > > >>> > > > to implement broadly. As Sophie mentioned, there are probably
> > > some
> > > >>> lines
> > > >>> > > we
> > > >>> > > > could draw where we feel confident that only running a subset
> > of
> > > >>> tests is
> > > >>> > > > safe. As a start, we could probably work towards skipping CI
> > for
> > > >>> non-code
> > > >>> > > > PRs.
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > ---
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > As an aside, I experimented with build caching and running
> > > affected
> > > >>> > > tests a
> > > >>> > > > few months ago. I used the opportunity to play with Github
> > > Actions,
> > > >>> and I
> > > >>> > > > quite liked it. Here’s the workflow I used:
> > > >>> > > >
> > > >>>
> > https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml.
> > > I
> > > >>> > > > was trying to see if we could use a build cache to reduce the
> > > >>> compilation
> > > >>> > > > time on PRs. A nightly/periodic job would build trunk and
> > > populate a
> > > >>> > > Gradle
> > > >>> > > > build cache. PR builds would read from that cache which would
> > > >>> enable them
> > > >>> > > > to only compile changed code. The same idea could be extended
> > to
> > > >>> tests,
> > > >>> > > but
> > > >>> > > > I didn’t get that far.
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > As for Github Actions, the idea there is that ASF would
> provide
> > > >>> generic
> > > >>> > > > Action “runners” that would pick up jobs from the Github
> Action
> > > >>> build
> > > >>> > > queue
> > > >>> > > > and run them. It is also possible to self-host runners to
> > expand
> > > the
> > > >>> > > build
> > > >>> > > > capacity of the project (i.e., other organizations could
> donate
> > > >>> > > > build capacity). The advantage of this is that we would have
> > more
> > > >>> control
> > > >>> > > > over our build/reports and not be “stuck” with whatever ASF
> > > Jenkins
> > > >>> > > offers.
> > > >>> > > > The Actions workflows are very customizable and it would let
> us
> > > >>> create
> > > >>> > > our
> > > >>> > > > own custom plugins. There is also a substantial marketplace
> of
> > > >>> plugins. I
> > > >>> > > > think it’s worth exploring this more, I just haven’t had time
> > > >>> lately.
> > > >>> > > >
> > > >>> > > > On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman <
> > > >>> > > [email protected]
> > > >>> > > > >
> > > >>> > > > wrote:
> > > >>> > > >
> > > >>> > > > > Regarding:
> > > >>> > > > >
> > > >>> > > > > S-4. Separate tests ran depending on what module is
> changed.
> > > >>> > > > > >
> > > >>> > > > > - This makes sense although is tricky to implement
> > > successfully,
> > > >>> as
> > > >>> > > > > > unrelated tests may expose problems in an unrelated
> change
> > > (e.g
> > > >>> > > > changing
> > > >>> > > > > > core stuff like clients, the server, etc)
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > Imo this avenue could provide a massive improvement to dev
> > > >>> productivity
> > > >>> > > > > with very little effort or investment, and if we do it
> right,
> > > >>> without
> > > >>> > > > even
> > > >>> > > > > any risk. We should be able to draft a simple dependency
> > graph
> > > >>> between
> > > >>> > > > > modules and then skip the tests for anything that is
> clearly,
> > > >>> provably
> > > >>> > > > > unrelated and/or upstream of the target changes. This has
> the
> > > >>> potential
> > > >>> > > > to
> > > >>> > > > > substantially speed up and improve the developer experience
> > in
> > > >>> modules
> > > >>> > > at
> > > >>> > > > > the end of the dependency graph, which I believe is worth
> > doing
> > > >>> even if
> > > >>> > > > it
> > > >>> > > > > unfortunately would not benefit everyone equally.
> > > >>> > > > >
> > > >>> > > > > For example, we can save a lot of grief with just a simple
> > set
> > > of
> > > >>> rules
> > > >>> > > > > that are easy to check. I'll throw out a few to start with:
> > > >>> > > > >
> > > >>> > > > >    1. A pure docs PR (ie that only touches files under the
> > > docs/
> > > >>> > > > directory)
> > > >>> > > > >    should be allowed to skip the tests of all modules
> > > >>> > > > >    2. Connect PRs (that only touch connect/) only need to
> run
> > > the
> > > >>> > > Connect
> > > >>> > > > >    tests -- ie they can skip the tests for core, clients,
> > > >>> streams, etc
> > > >>> > > > >    3. Similarly, Streams PRs should only need to run the
> > > Streams
> > > >>> tests
> > > >>> > > --
> > > >>> > > > >    but again, only if all the changes are contained within
> > > >>> streams/
> > > >>> > > > >
> > > >>> > > > > I'll let others chime in on how or if we can construct some
> > > safe
> > > >>> rules
> > > >>> > > as
> > > >>> > > > > to which modules can or can't be skipped between the core,
> > > >>> clients,
> > > >>> > > raft,
> > > >>> > > > > storage, etc
> > > >>> > > > >
> > > >>> > > > > And over time we could in theory build up a literal
> > dependency
> > > >>> graph
> > > >>> > > on a
> > > >>> > > > > more granular level so that, for example, changes to the
> > > >>> core/storage
> > > >>> > > > > module are allowed to skip any Streams tests that don't use
> > an
> > > >>> embedded
> > > >>> > > > > broker, ie all unit tests and TopologyTestDriver-based
> > > integration
> > > >>> > > tests.
> > > >>> > > > > The danger here would be in making sure this graph is kept
> up
> > > to
> > > >>> date
> > > >>> > > as
> > > >>> > > > > tests are added and changed, but my point is just that
> > there's
> > > a
> > > >>> way to
> > > >>> > > > > extend the benefit of this tactic to those who work
> primarily
> > > on
> > > >>> the
> > > >>> > > core
> > > >>> > > > > module as well. Personally, I think we should just start
> out
> > > with
> > > >>> the
> > > >>> > > > > example ruleset listed above, workshop it a bit since there
> > > might
> > > >>> be
> > > >>> > > > other
> > > >>> > > > > obvious rules I left out, and try to implement it.
> > > >>> > > > >
> > > >>> > > > > Thoughts?
> > > >>> > > > >
> > > >>> > > > > On Tue, Dec 26, 2023 at 2:25 AM Stanislav Kozlovski
> > > >>> > > > > <[email protected]> wrote:
> > > >>> > > > >
> > > >>> > > > > > Great discussion!
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > Greg, that was a good call out regarding the two
> > long-running
> > > >>> > > builds. I
> > > >>> > > > > > missed that 90d view.
> > > >>> > > > > >
> > > >>> > > > > > My takeaway from that is that our average build time for
> > > tests
> > > >>> is
> > > >>> > > > between
> > > >>> > > > > > 3-4 hours. Which in of itself seems large.
> > > >>> > > > > >
> > > >>> > > > > > But then reconciling this with Sophie's statement - is it
> > > >>> possible
> > > >>> > > that
> > > >>> > > > > > these timed-out 8-hour builds don't get captured in that
> > > view?
> > > >>> > > > > >
> > > >>> > > > > > It is weird that people are reporting these things and
> > Gradle
> > > >>> > > > Enterprise
> > > >>> > > > > > isn't showing them.
> > > >>> > > > > >
> > > >>> > > > > > ---
> > > >>> > > > > >
> > > >>> > > > > > > I think that these particularly nasty builds could be
> > > >>> explained by
> > > >>> > > > > > long-tail slowdowns causing arbitrary tests to take an
> > > >>> excessive time
> > > >>> > > > to
> > > >>> > > > > > execute.
> > > >>> > > > > >
> > > >>> > > > > > I'm not sure I understood that. If the tests have
> timeouts,
> > > >>> where
> > > >>> > > would
> > > >>> > > > > the
> > > >>> > > > > > slowdown come from? Problems in tearing down the test?
> > > >>> > > > > >
> > > >>> > > > > > ---
> > > >>> > > > > >
> > > >>> > > > > > David, thanks for the great work in identifying and even
> > > fixing
> > > >>> those
> > > >>> > > > two
> > > >>> > > > > > top offenders! And thank you for cherry-picking to 3.7
> > > >>> > > > > >
> > > >>> > > > > > --
> > > >>> > > > > >
> > > >>> > > > > > All in all, from this thread I can summarize a few
> > potential
> > > >>> > > solutions:
> > > >>> > > > > >
> > > >>> > > > > > S-1. Dedicated work identifying and fixing some of the
> > issues
> > > >>> (e.g.
> > > >>> > > > what
> > > >>> > > > > > David did).
> > > >>> > > > > > - Should help alleviate the issues as it can be
> speculated
> > > that
> > > >>> it's
> > > >>> > > > > > frequently 1 or 2 tests causing the majority of issues.
> > > >>> > > > > > - With regards to that, KAFKA-16045 seems open for taking
> > if
> > > >>> there
> > > >>> > > are
> > > >>> > > > > any
> > > >>> > > > > > volunteers
> > > >>> > > > > > - Sophie's list also contains good candidates
> > > >>> > > > > >
> > > >>> > > > > > S-2. Global 10-minute timeout for tests.
> > > >>> > > > > > - Should lay the foundation for a strong catch-all for
> any
> > > >>> > > misbehaving
> > > >>> > > > > > tests. I like this idea since it's guaranteed to save
> each
> > > >>> > > contributor
> > > >>> > > > > many
> > > >>> > > > > > hours of waiting for an 8hr+ time out build.
> > > >>> > > > > > - Luke already has a PR out for this:
> > > >>> > > > > > https://github.com/apache/kafka/pull/15065
> > > >>> > > > > >
> > > >>> > > > > > S-3. Separate infrastructure for our CI
> > > >>> > > > > > - This would help with Greg's comment about the developer
> > > >>> machine
> > > >>> > > being
> > > >>> > > > > > 2-20 times faster than the CI.
> > > >>> > > > > > - Requires volunteer funding from external companies. If
> > > every
> > > >>> > > > > contributor
> > > >>> > > > > > would bring up the idea with their employer, we may be
> able
> > > to
> > > >>> stitch
> > > >>> > > > > > something together.
> > > >>> > > > > >
> > > >>> > > > > > S-4. Separate tests ran depending on what module is
> > changed.
> > > >>> > > > > > - This makes sense although is tricky to implement
> > > >>> successfully, as
> > > >>> > > > > > unrelated tests may expose problems in an unrelated
> change
> > > (e.g
> > > >>> > > > changing
> > > >>> > > > > > core stuff like clients, the server, etc)
> > > >>> > > > > >
> > > >>> > > > > > S-5. Greater committer diligence when merging PRs
> > > >>> > > > > > - This should always be there. Unfortunately it is a bit
> > of a
> > > >>> > > > > > self-perpetuating effect in that when the builds get
> worse,
> > > >>> people
> > > >>> > > are
> > > >>> > > > > > incentivized to be less diligent (slowed down while in a
> > > rush to
> > > >>> > > merge,
> > > >>> > > > > > recency bias of failed builds, etc.)
> > > >>> > > > > >
> > > >>> > > > > > On Fri, Dec 22, 2023 at 4:16 PM Justine Olshan
> > > >>> > > > > > <[email protected]>
> > > >>> > > > > > wrote:
> > > >>> > > > > >
> > > >>> > > > > > > Thanks David! I think this should help a lot!
> > > >>> > > > > > >
> > > >>> > > > > > > While we should include these improvements, I think it
> is
> > > >>> also good
> > > >>> > > > to
> > > >>> > > > > > > remind folks that a lot of these issues come from
> merging
> > > on
> > > >>> builds
> > > >>> > > > > that
> > > >>> > > > > > > regress the CI.
> > > >>> > > > > > > I know I'm not perfect at this (and have merged on
> flaky
> > > and
> > > >>> > > failing
> > > >>> > > > > > > tests), but let's all be super careful going forward.
> > There
> > > >>> were a
> > > >>> > > > few
> > > >>> > > > > > > times I retried the build 10+ times and thought it was
> > > other
> > > >>> issues
> > > >>> > > > > with
> > > >>> > > > > > > the CI but the failed builds were actually due to the
> > > changes
> > > >>> I
> > > >>> > > > > wrote/was
> > > >>> > > > > > > reviewing.
> > > >>> > > > > > >
> > > >>> > > > > > > We all need to work together on this to ensure the
> builds
> > > stay
> > > >>> > > > healthy!
> > > >>> > > > > > > Thanks all for being concerned about our builds!
> > > >>> > > > > > >
> > > >>> > > > > > > Justine
> > > >>> > > > > > >
> > > >>> > > > > > > On Fri, Dec 22, 2023 at 6:02 AM David Jacot <
> > > >>> [email protected]
> > > >>> > > >
> > > >>> > > > > > wrote:
> > > >>> > > > > > >
> > > >>> > > > > > > > I just merged both PRs.
> > > >>> > > > > > > >
> > > >>> > > > > > > > Cheers,
> > > >>> > > > > > > > David
> > > >>> > > > > > > >
> > > >>> > > > > > > > Le ven. 22 déc. 2023 à 14:38, David Jacot <
> > > >>> [email protected]
> > > >>> > > >
> > > >>> > > > a
> > > >>> > > > > > > écrit
> > > >>> > > > > > > > :
> > > >>> > > > > > > >
> > > >>> > > > > > > > > Hey folks,
> > > >>> > > > > > > > >
> > > >>> > > > > > > > > I believe that my two PRs will fix most of the
> > issues.
> > > I
> > > >>> have
> > > >>> > > > also
> > > >>> > > > > > > > tweaked
> > > >>> > > > > > > > > the configuration of Jenkins to fix the issues
> > > relating to
> > > >>> > > > cloning
> > > >>> > > > > > the
> > > >>> > > > > > > > > repo. There may be other issues but the overall
> > > situation
> > > >>> > > should
> > > >>> > > > be
> > > >>> > > > > > > much
> > > >>> > > > > > > > > better when I merge those two.
> > > >>> > > > > > > > >
> > > >>> > > > > > > > > I will update this thread when I merge them.
> > > >>> > > > > > > > >
> > > >>> > > > > > > > > Cheers,
> > > >>> > > > > > > > > David
> > > >>> > > > > > > > >
> > > >>> > > > > > > > > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya <
> > > >>> > > > > [email protected]>
> > > >>> > > > > > a
> > > >>> > > > > > > > > écrit :
> > > >>> > > > > > > > >
> > > >>> > > > > > > > >> Hey folks
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> I think David (dajac) has some fixes lined-up to
> > > improve
> > > >>> CI
> > > >>> > > such
> > > >>> > > > > as
> > > >>> > > > > > > > >> https://github.com/apache/kafka/pull/15063 and
> > > >>> > > > > > > > >> https://github.com/apache/kafka/pull/15062.
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> I have some bandwidth for the next two days to
> work
> > on
> > > >>> fixing
> > > >>> > > > the
> > > >>> > > > > > CI.
> > > >>> > > > > > > > Let
> > > >>> > > > > > > > >> me start by taking a look at the list that Sophie
> > > shared
> > > >>> here.
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> --
> > > >>> > > > > > > > >> Divij Vaidya
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen <
> > > >>> [email protected]>
> > > >>> > > > > > wrote:
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> > Hi Sophie and Philip and all,
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >> > I share the same pain as you.
> > > >>> > > > > > > > >> > I've been waiting for a CI build result in a PR
> > for
> > > >>> days.
> > > >>> > > > > > > > >> Unfortunately, I
> > > >>> > > > > > > > >> > can only get 1 result each day because it takes
> 8
> > > >>> hours for
> > > >>> > > > each
> > > >>> > > > > > > run,
> > > >>> > > > > > > > >> and
> > > >>> > > > > > > > >> > with failed results. :(
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >> > I've looked into the 8 hour timeout build issue
> > and
> > > >>> would
> > > >>> > > like
> > > >>> > > > > to
> > > >>> > > > > > > > >> propose
> > > >>> > > > > > > > >> > to set a global test timeout as 10 mins using
> the
> > > >>> junit5
> > > >>> > > > feature
> > > >>> > > > > > > > >> > <
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> > .
> > > >>> > > > > > > > >> > This way, we can fail those long running tests
> > > quickly
> > > >>> > > without
> > > >>> > > > > > > > impacting
> > > >>> > > > > > > > >> > other tests.
> > > >>> > > > > > > > >> > PR: https://github.com/apache/kafka/pull/15065
> > > >>> > > > > > > > >> > I've tested in my local environment and it works
> > as
> > > >>> > > expected.
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >> > Any feedback is welcome.
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >> > Thanks.
> > > >>> > > > > > > > >> > Luke
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee <
> > > >>> > > > [email protected]
> > > >>> > > > > >
> > > >>> > > > > > > > wrote:
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >> > > Hey Sophie - I've gotten 2 inflight PRs each
> > with
> > > >>> more
> > > >>> > > than
> > > >>> > > > 15
> > > >>> > > > > > > > >> retries...
> > > >>> > > > > > > > >> > > Namely:
> > > https://github.com/apache/kafka/pull/15023
> > > >>> and
> > > >>> > > > > > > > >> > > https://github.com/apache/kafka/pull/15035
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> > > justin filed a flaky test report here though:
> > > >>> > > > > > > > >> > >
> > https://issues.apache.org/jira/browse/KAFKA-16045
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> > > P
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie
> > > Blee-Goldman <
> > > >>> > > > > > > > >> > [email protected]
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > wrote:
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> > > > On a related note, has anyone else had
> trouble
> > > >>> getting
> > > >>> > > > even
> > > >>> > > > > a
> > > >>> > > > > > > > single
> > > >>> > > > > > > > >> > run
> > > >>> > > > > > > > >> > > > with no build failures lately? I've had
> > multiple
> > > >>> > > pure-docs
> > > >>> > > > > PRs
> > > >>> > > > > > > > >> blocked
> > > >>> > > > > > > > >> > > for
> > > >>> > > > > > > > >> > > > days or even weeks because of miscellaneous
> > > infra,
> > > >>> test,
> > > >>> > > > and
> > > >>> > > > > > > > timeout
> > > >>> > > > > > > > >> > > > failures. I know we just had a discussion
> > about
> > > >>> whether
> > > >>> > > > it's
> > > >>> > > > > > > > >> acceptable
> > > >>> > > > > > > > >> > > to
> > > >>> > > > > > > > >> > > > ever merge with a failing build, and the
> > > consensus
> > > >>> > > (which
> > > >>> > > > I
> > > >>> > > > > > > agree
> > > >>> > > > > > > > >> with)
> > > >>> > > > > > > > >> > > was
> > > >>> > > > > > > > >> > > > NO -- but seriously, this is getting
> > ridiculous.
> > > >>> The
> > > >>> > > build
> > > >>> > > > > > might
> > > >>> > > > > > > > be
> > > >>> > > > > > > > >> the
> > > >>> > > > > > > > >> > > > worst I've ever seen it, and it just makes
> it
> > > >>> really
> > > >>> > > > > difficult
> > > >>> > > > > > > to
> > > >>> > > > > > > > >> > > maintain
> > > >>> > > > > > > > >> > > > good will with external contributors.
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > Take for example this small docs PR:
> > > >>> > > > > > > > >> > > > https://github.com/apache/kafka/pull/14949
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > It's on its 7th replay, with the first 6
> runs
> > > all
> > > >>> having
> > > >>> > > > (at
> > > >>> > > > > > > > least)
> > > >>> > > > > > > > >> one
> > > >>> > > > > > > > >> > > > build that failed completely. The issues I
> saw
> > > on
> > > >>> this
> > > >>> > > one
> > > >>> > > > > PR
> > > >>> > > > > > > are
> > > >>> > > > > > > > a
> > > >>> > > > > > > > >> > good
> > > >>> > > > > > > > >> > > > summary of what I've been seeing elsewhere,
> so
> > > >>> here's
> > > >>> > > the
> > > >>> > > > > > > > briefing:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > 1. gradle issue:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > > * What went wrong:
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > > > Gradle could not start your build.
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > > > > Cannot create service of type
> > > >>> > > > BuildSessionActionExecutor
> > > >>> > > > > > > using
> > > >>> > > > > > > > >> > method
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >>
> > > >>> > > > > > >
> > > >>> > > > >
> > > >>> > >
> > > >>>
> > >
> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor()
> > > >>> > > > > > > > >> > > > as
> > > >>> > > > > > > > >> > > > > there is a problem with parameter #21 of
> > type
> > > >>> > > > > > > > >> > > > FileSystemWatchingInformation.
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > > >    > Cannot create service of type
> > > >>> > > > > > > > >> > BuildLifecycleAwareVirtualFileSystem
> > > >>> > > > > > > > >> > > > > using method
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem()
> > > >>> > > > > > > > >> > > > > as there is a problem with parameter #7 of
> > > type
> > > >>> > > > > > > > >> GlobalCacheLocations.
> > > >>> > > > > > > > >> > > > >       > Cannot create service of type
> > > >>> > > > GlobalCacheLocations
> > > >>> > > > > > > using
> > > >>> > > > > > > > >> > method
> > > >>> > > > > > > > >> > > > >
> > > >>> > > GradleUserHomeScopeServices.createGlobalCacheLocations()
> > > >>> > > > > as
> > > >>> > > > > > > > there
> > > >>> > > > > > > > >> is
> > > >>> > > > > > > > >> > a
> > > >>> > > > > > > > >> > > > > problem with parameter #1 of type
> > > >>> List<GlobalCache>.
> > > >>> > > > > > > > >> > > > >          > Could not create service of
> type
> > > >>> > > > > > > > FileAccessTimeJournal
> > > >>> > > > > > > > >> > using
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > GradleUserHomeScopeServices.createFileAccessTimeJournal().
> > > >>> > > > > > > > >> > > > >             > Timeout waiting to lock
> > journal
> > > >>> cache
> > > >>> > > > > > > > >> > > > > (/home/jenkins/.gradle/caches/journal-1).
> It
> > > is
> > > >>> > > > currently
> > > >>> > > > > in
> > > >>> > > > > > > use
> > > >>> > > > > > > > >> by
> > > >>> > > > > > > > >> > > > another
> > > >>> > > > > > > > >> > > > > Gradle instance.
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > 2. git issue:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > > ERROR: Error cloning remote repo 'origin'
> > > >>> > > > > > > > >> > > > > hudson.plugins.git.GitException:
> > > >>> java.io.IOException:
> > > >>> > > > > Remote
> > > >>> > > > > > > > call
> > > >>> > > > > > > > >> on
> > > >>> > > > > > > > >> > > > > builds43 failed
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > 3. storage test calling System.exit (I
> think)
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > > * What went wrong:
> > > >>> > > > > > > > >> > > > >  Execution failed for task
> ':storage:test'.
> > > >>> > > > > > > > >> > > > >  > Process 'Gradle Test Executor 73'
> > finished
> > > >>> with
> > > >>> > > > > non-zero
> > > >>> > > > > > > exit
> > > >>> > > > > > > > >> > value
> > > >>> > > > > > > > >> > > 1
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >     This problem might be caused by
> incorrect
> > > test
> > > >>> > > process
> > > >>> > > > > > > > >> > configuration.
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > 4.  3/4 builds aborted suddenly for no clear
> > > reason
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > 5. 1 build was aborted, 1 build failed due
> to
> > a
> > > >>> > > gradle(?)
> > > >>> > > > > > issue
> > > >>> > > > > > > > >> with a
> > > >>> > > > > > > > >> > > > storage test:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > Failed to map supported failure
> > > >>> > > > > > > > >> 'org.opentest4j.AssertionFailedError:
> > > >>> > > > > > > > >> > > > > Failed to observe commit callback before
> > > >>> timeout' with
> > > >>> > > > > > mapper
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea
> > > >>> > > > > > > > >> > > > ':
> > > >>> > > > > > > > >> > > > > null
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > * What went wrong:
> > > >>> > > > > > > > >> > > > > Execution failed for task ':storage:test'.
> > > >>> > > > > > > > >> > > > > > Process 'Gradle Test Executor 73'
> finished
> > > with
> > > >>> > > > non-zero
> > > >>> > > > > > > exit
> > > >>> > > > > > > > >> > value 1
> > > >>> > > > > > > > >> > > > >   This problem might be caused by
> incorrect
> > > test
> > > >>> > > process
> > > >>> > > > > > > > >> > configuration.
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > 6.  Unknown issue with a core test:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > > Unexpected exception thrown.
> > > >>> > > > > > > > >> > > > >
> > > >>> > > org.gradle.internal.remote.internal.MessageIOException:
> > > >>> > > > > > Could
> > > >>> > > > > > > > not
> > > >>> > > > > > > > >> > read
> > > >>> > > > > > > > >> > > > > message from '/127.0.0.1:46952'.
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> java.base/java.lang.Thread.run(Thread.java:1583)
> > > >>> > > > > > > > >> > > > > Caused by:
> > java.lang.IllegalArgumentException
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
> > > >>> > > > > > > > >> > > > > ... 6 more
> > > >>> > > > > > > > >> > > > >
> > > >>> org.gradle.internal.remote.internal.ConnectException:
> > > >>> > > > > Could
> > > >>> > > > > > > not
> > > >>> > > > > > > > >> > connect
> > > >>> > > > > > > > >> > > > to
> > > >>> > > > > > > > >> > > > > server
> [1d62bf97-6a3e-441d-93b6-093617cbbea9
> > > >>> > > port:41289,
> > > >>> > > > > > > > >> addresses:[/
> > > >>> > > > > > > > >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1
> ].
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
> > > >>> > > > > > > > >> > > > > Caused by: java.net.ConnectException:
> > > Connection
> > > >>> > > refused
> > > >>> > > > > > > > >> > > > >   at java.base/sun.nio.ch.Net
> > > .pollConnect(Native
> > > >>> > > > Method)
> > > >>> > > > > > > > >> > > > >   at java.base/sun.nio.ch.Net
> > > >>> > > > > .pollConnectNow(Net.java:682)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > > java.base/sun.nio.ch
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > >
> > > >>> .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > > java.base/sun.nio.ch
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > >
> > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233)
> > > >>> > > > > > > > >> > > > >   at java.base/sun.nio.ch
> > > >>> > > > > > > > >> > > .SocketAdaptor.connect(SocketAdaptor.java:102)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
> > > >>> > > > > > > > >> > > > >   at
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
> > > >>> > > > > > > > >> > > > > ... 5 more
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > >  * What went wrong:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > Execution failed for task ':core:test'.
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > > Process 'Gradle Test Executor 104'
> finished
> > > with
> > > >>> > > > non-zero
> > > >>> > > > > > exit
> > > >>> > > > > > > > >> value
> > > >>> > > > > > > > >> > 1
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >   This problem might be caused by incorrect
> > test
> > > >>> process
> > > >>> > > > > > > > >> configuration.
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > I've seen almost all of the above issues
> > > multiple
> > > >>> times,
> > > >>> > > > so
> > > >>> > > > > it
> > > >>> > > > > > > > might
> > > >>> > > > > > > > >> > be a
> > > >>> > > > > > > > >> > > > good list to start with to focus any efforts
> > on
> > > >>> > > improving
> > > >>> > > > > the
> > > >>> > > > > > > > build.
> > > >>> > > > > > > > >> > That
> > > >>> > > > > > > > >> > > > said, I'm not sure what we can really do
> about
> > > >>> most of
> > > >>> > > > > these,
> > > >>> > > > > > > and
> > > >>> > > > > > > > >> not
> > > >>> > > > > > > > >> > > sure
> > > >>> > > > > > > > >> > > > how to narrow down the root cause in the
> more
> > > >>> mysterious
> > > >>> > > > > cases
> > > >>> > > > > > > of
> > > >>> > > > > > > > >> > aborted
> > > >>> > > > > > > > >> > > > builds and the builds that end with
> "finished
> > > with
> > > >>> > > > non-zero
> > > >>> > > > > > exit
> > > >>> > > > > > > > >> value
> > > >>> > > > > > > > >> > 1
> > > >>> > > > > > > > >> > > "
> > > >>> > > > > > > > >> > > > with no additional context (that I could
> find)
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > If nothing else, there seems to be something
> > > >>> happening
> > > >>> > > in
> > > >>> > > > > one
> > > >>> > > > > > > (or
> > > >>> > > > > > > > >> more)
> > > >>> > > > > > > > >> > > of
> > > >>> > > > > > > > >> > > > the storage tests, because by far the most
> > > common
> > > >>> > > failure
> > > >>> > > > > I've
> > > >>> > > > > > > > seen
> > > >>> > > > > > > > >> is
> > > >>> > > > > > > > >> > > that
> > > >>> > > > > > > > >> > > > in 3 & 5. Unfortunately it's not really
> clear
> > to
> > > >>> me how
> > > >>> > > to
> > > >>> > > > > > tell
> > > >>> > > > > > > > >> which
> > > >>> > > > > > > > >> > is
> > > >>> > > > > > > > >> > > > the offending test, so I'm not even sure
> what
> > to
> > > >>> file a
> > > >>> > > > > ticket
> > > >>> > > > > > > for
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot
> > > >>> > > > > > > > >> > <[email protected]
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > wrote:
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > > > > The slowness of the CI is definitely
> causing
> > > us
> > > >>> a lot
> > > >>> > > of
> > > >>> > > > > > > pain. I
> > > >>> > > > > > > > >> > wonder
> > > >>> > > > > > > > >> > > > if
> > > >>> > > > > > > > >> > > > > we should move to a dedicated CI
> > > infrastructure
> > > >>> for
> > > >>> > > > Kafka.
> > > >>> > > > > > Our
> > > >>> > > > > > > > >> > > > integration
> > > >>> > > > > > > > >> > > > > tests are quite heavy and ASF's CI is not
> > > really
> > > >>> tuned
> > > >>> > > > for
> > > >>> > > > > > > them.
> > > >>> > > > > > > > >> We
> > > >>> > > > > > > > >> > > could
> > > >>> > > > > > > > >> > > > > tune it for our needs and this would also
> > > allow
> > > >>> > > external
> > > >>> > > > > > > > >> companies to
> > > >>> > > > > > > > >> > > > > sponsor more workers. I heard that we
> have a
> > > few
> > > >>> cloud
> > > >>> > > > > > > providers
> > > >>> > > > > > > > >> in
> > > >>> > > > > > > > >> > > > > the community ;). I think that we should
> > > consider
> > > >>> > > this.
> > > >>> > > > > What
> > > >>> > > > > > > do
> > > >>> > > > > > > > >> you
> > > >>> > > > > > > > >> > > > think?
> > > >>> > > > > > > > >> > > > > I already discussed this with the INFRA
> > team.
> > > I
> > > >>> could
> > > >>> > > > > > continue
> > > >>> > > > > > > > if
> > > >>> > > > > > > > >> we
> > > >>> > > > > > > > >> > > > > believe that it is a way forward.
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > > > Best,
> > > >>> > > > > > > > >> > > > > David
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav
> > > >>> Kozlovski
> > > >>> > > > > > > > >> > > > > <[email protected]> wrote:
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > > > > Hey Николай,
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > > Apologies about this - I wasn't aware of
> > > this
> > > >>> > > > behavior.
> > > >>> > > > > I
> > > >>> > > > > > > have
> > > >>> > > > > > > > >> made
> > > >>> > > > > > > > >> > > all
> > > >>> > > > > > > > >> > > > > the
> > > >>> > > > > > > > >> > > > > > gists public.
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg
> > Harris
> > > >>> > > > > > > > >> > > > > <[email protected]
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > wrote:
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > > > Hey Stan,
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > Thanks for opening the discussion. I
> > > haven't
> > > >>> been
> > > >>> > > > > > looking
> > > >>> > > > > > > at
> > > >>> > > > > > > > >> > > overall
> > > >>> > > > > > > > >> > > > > > > build duration recently, so it's good
> > that
> > > >>> you are
> > > >>> > > > > > calling
> > > >>> > > > > > > > it
> > > >>> > > > > > > > >> > out.
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > I worry about us over-indexing on this
> > one
> > > >>> build,
> > > >>> > > > > which
> > > >>> > > > > > > > itself
> > > >>> > > > > > > > >> > > > appears
> > > >>> > > > > > > > >> > > > > > > to be an outlier. I only see one other
> > > build
> > > >>> [1]
> > > >>> > > > above
> > > >>> > > > > > 6h
> > > >>> > > > > > > > >> overall
> > > >>> > > > > > > > >> > > in
> > > >>> > > > > > > > >> > > > > > > the last 90 days in this view: [2]
> > > >>> > > > > > > > >> > > > > > > And I don't see any overlap of failed
> > > tests
> > > >>> in
> > > >>> > > these
> > > >>> > > > > two
> > > >>> > > > > > > > >> builds,
> > > >>> > > > > > > > >> > > > which
> > > >>> > > > > > > > >> > > > > > > makes it less likely that these
> > particular
> > > >>> failed
> > > >>> > > > > tests
> > > >>> > > > > > > are
> > > >>> > > > > > > > >> the
> > > >>> > > > > > > > >> > > > causes
> > > >>> > > > > > > > >> > > > > > > of long build times.
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > Separately, I've been investigating
> > build
> > > >>> > > > environment
> > > >>> > > > > > > > >> slowness,
> > > >>> > > > > > > > >> > and
> > > >>> > > > > > > > >> > > > > > > trying to connect it with test
> failures
> > > [3].
> > > >>> I
> > > >>> > > > > observed
> > > >>> > > > > > > that
> > > >>> > > > > > > > >> the
> > > >>> > > > > > > > >> > CI
> > > >>> > > > > > > > >> > > > > > > build environment is 2-20 times slower
> > > than
> > > >>> my
> > > >>> > > > > developer
> > > >>> > > > > > > > >> machine
> > > >>> > > > > > > > >> > > (M1
> > > >>> > > > > > > > >> > > > > > > mac).
> > > >>> > > > > > > > >> > > > > > > When I simulate a similar slowdown
> > > locally,
> > > >>> there
> > > >>> > > > are
> > > >>> > > > > > > tests
> > > >>> > > > > > > > >> which
> > > >>> > > > > > > > >> > > > > > > become significantly more flakey,
> often
> > > due
> > > >>> to
> > > >>> > > > > > hard-coded
> > > >>> > > > > > > > >> > timeouts.
> > > >>> > > > > > > > >> > > > > > > I think that these particularly nasty
> > > builds
> > > >>> could
> > > >>> > > > be
> > > >>> > > > > > > > >> explained
> > > >>> > > > > > > > >> > by
> > > >>> > > > > > > > >> > > > > > > long-tail slowdowns causing arbitrary
> > > tests
> > > >>> to
> > > >>> > > take
> > > >>> > > > an
> > > >>> > > > > > > > >> excessive
> > > >>> > > > > > > > >> > > time
> > > >>> > > > > > > > >> > > > > > > to execute.
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > Rather than trying to find signals in
> > > these
> > > >>> rare
> > > >>> > > > test
> > > >>> > > > > > > > >> failures, I
> > > >>> > > > > > > > >> > > > > > > think we should find tests that have
> > these
> > > >>> sorts
> > > >>> > > of
> > > >>> > > > > > > failures
> > > >>> > > > > > > > >> more
> > > >>> > > > > > > > >> > > > > > > regularly.
> > > >>> > > > > > > > >> > > > > > > There are lots of builds in the 5-6h
> > > duration
> > > >>> > > > bracket,
> > > >>> > > > > > > which
> > > >>> > > > > > > > >> is
> > > >>> > > > > > > > >> > > > > > > certainly unacceptably long. We should
> > > look
> > > >>> into
> > > >>> > > > these
> > > >>> > > > > > > > builds
> > > >>> > > > > > > > >> to
> > > >>> > > > > > > > >> > > find
> > > >>> > > > > > > > >> > > > > > > improvements and optimizations.
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > [1]
> > > https://ge.apache.org/s/ygh4gbz4uma6i/
> > > >>> > > > > > > > >> > > > > > > [2]
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
> > > >>> > > > > > > > >> > > > > > > [3]
> > > >>> https://github.com/apache/kafka/pull/15008
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > Thanks for looking into this!
> > > >>> > > > > > > > >> > > > > > > Greg
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM
> Николай
> > > >>> Ижиков <
> > > >>> > > > > > > > >> > > [email protected]>
> > > >>> > > > > > > > >> > > > > > > wrote:
> > > >>> > > > > > > > >> > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > Hello, Stanislav.
> > > >>> > > > > > > > >> > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > Can you, please, make the gist
> public.
> > > >>> > > > > > > > >> > > > > > > > Private gists not available for some
> > > GitHub
> > > >>> > > users
> > > >>> > > > > even
> > > >>> > > > > > > if
> > > >>> > > > > > > > >> link
> > > >>> > > > > > > > >> > > are
> > > >>> > > > > > > > >> > > > > > known.
> > > >>> > > > > > > > >> > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > 19 дек. 2023 г., в 17:33,
> Stanislav
> > > >>> Kozlovski
> > > >>> > > <
> > > >>> > > > > > > > >> > > > > > [email protected]>
> > > >>> > > > > > > > >> > > > > > > написал(а):
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > Hey everybody,
> > > >>> > > > > > > > >> > > > > > > > > I've heard various complaints that
> > > build
> > > >>> times
> > > >>> > > > in
> > > >>> > > > > > > trunk
> > > >>> > > > > > > > >> are
> > > >>> > > > > > > > >> > > > taking
> > > >>> > > > > > > > >> > > > > > too
> > > >>> > > > > > > > >> > > > > > > > > long, some taking as much as 8
> hours
> > > (the
> > > >>> > > > > timeout) -
> > > >>> > > > > > > and
> > > >>> > > > > > > > >> this
> > > >>> > > > > > > > >> > > is
> > > >>> > > > > > > > >> > > > > > > slowing us
> > > >>> > > > > > > > >> > > > > > > > > down from being able to meet the
> > code
> > > >>> freeze
> > > >>> > > > > > deadline
> > > >>> > > > > > > > for
> > > >>> > > > > > > > >> > 3.7.
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > I took it upon myself to gather up
> > > some
> > > >>> data
> > > >>> > > in
> > > >>> > > > > > Gradle
> > > >>> > > > > > > > >> > > Enterprise
> > > >>> > > > > > > > >> > > > > to
> > > >>> > > > > > > > >> > > > > > > see if
> > > >>> > > > > > > > >> > > > > > > > > there are any outlier tests that
> are
> > > >>> causing
> > > >>> > > > this
> > > >>> > > > > > > > >> slowness.
> > > >>> > > > > > > > >> > > Turns
> > > >>> > > > > > > > >> > > > > out
> > > >>> > > > > > > > >> > > > > > > there
> > > >>> > > > > > > > >> > > > > > > > > are a few, in this particular
> build
> > -
> > > >>> > > > > > > > >> > > > > > >
> https://ge.apache.org/s/un2hv7n6j374k/
> > > >>> > > > > > > > >> > > > > > > > > - which took 10 hours and 29
> minutes
> > > in
> > > >>> total.
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > I have compiled the tests that
> took
> > a
> > > >>> > > > > > > disproportionately
> > > >>> > > > > > > > >> > large
> > > >>> > > > > > > > >> > > > > amount
> > > >>> > > > > > > > >> > > > > > > of
> > > >>> > > > > > > > >> > > > > > > > > time (20m+), alongside their time,
> > > error
> > > >>> > > message
> > > >>> > > > > > and a
> > > >>> > > > > > > > >> link
> > > >>> > > > > > > > >> > to
> > > >>> > > > > > > > >> > > > > their
> > > >>> > > > > > > > >> > > > > > > full
> > > >>> > > > > > > > >> > > > > > > > > log output here -
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > >
> >
> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > It includes failures from core,
> > > streams,
> > > >>> > > storage
> > > >>> > > > > and
> > > >>> > > > > > > > >> clients.
> > > >>> > > > > > > > >> > > > > > > > > Interestingly, some other tests
> that
> > > >>> don't
> > > >>> > > fail
> > > >>> > > > > also
> > > >>> > > > > > > > take
> > > >>> > > > > > > > >> a
> > > >>> > > > > > > > >> > > long
> > > >>> > > > > > > > >> > > > > time
> > > >>> > > > > > > > >> > > > > > > in
> > > >>> > > > > > > > >> > > > > > > > > what is apparently the test
> harness
> > > >>> framework.
> > > >>> > > > See
> > > >>> > > > > > the
> > > >>> > > > > > > > >> gist
> > > >>> > > > > > > > >> > for
> > > >>> > > > > > > > >> > > > > more
> > > >>> > > > > > > > >> > > > > > > > > information.
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > I am starting this thread with the
> > > >>> intention
> > > >>> > > of
> > > >>> > > > > > > getting
> > > >>> > > > > > > > >> the
> > > >>> > > > > > > > >> > > > > > discussion
> > > >>> > > > > > > > >> > > > > > > > > started and brainstorming what we
> > can
> > > do
> > > >>> to
> > > >>> > > get
> > > >>> > > > > the
> > > >>> > > > > > > > build
> > > >>> > > > > > > > >> > times
> > > >>> > > > > > > > >> > > > > back
> > > >>> > > > > > > > >> > > > > > > under
> > > >>> > > > > > > > >> > > > > > > > > control.
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > >
> > > >>> > > > > > > > >> > > > > > > > > --
> > > >>> > > > > > > > >> > > > > > > > > Best,
> > > >>> > > > > > > > >> > > > > > > > > Stanislav
> > > >>> > > > > > > > >> > > > > > > >
> > > >>> > > > > > > > >> > > > > > >
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > > > --
> > > >>> > > > > > > > >> > > > > > Best,
> > > >>> > > > > > > > >> > > > > > Stanislav
> > > >>> > > > > > > > >> > > > > >
> > > >>> > > > > > > > >> > > > >
> > > >>> > > > > > > > >> > > >
> > > >>> > > > > > > > >> > >
> > > >>> > > > > > > > >> >
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > --
> > > >>> > > > > > Best,
> > > >>> > > > > > Stanislav
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > --
> > > >>> > > > -David
> > > >>> > > >
> > > >>> > >
> > > >>>
> > > >>>
> > >
> >
>

Re: Kafka trunk test & build stability

Reply via email to