Hey Stan,

Thanks for opening the discussion. I haven't been looking at overall
build duration recently, so it's good that you are calling it out.

I worry about us over-indexing on this one build, which itself appears
to be an outlier. I only see one other build [1] above 6h overall in
the last 90 days in this view: [2]
And I don't see any overlap of failed tests in these two builds, which
makes it less likely that these particular failed tests are the causes
of long build times.

Separately, I've been investigating build environment slowness, and
trying to connect it with test failures [3]. I observed that the CI
build environment is 2-20 times slower than my developer machine (M1
mac).
When I simulate a similar slowdown locally, there are tests which
become significantly more flakey, often due to hard-coded timeouts.
I think that these particularly nasty builds could be explained by
long-tail slowdowns causing arbitrary tests to take an excessive time
to execute.

Rather than trying to find signals in these rare test failures, I
think we should find tests that have these sorts of failures more
regularly.
There are lots of builds in the 5-6h duration bracket, which is
certainly unacceptably long. We should look into these builds to find
improvements and optimizations.

[1] https://ge.apache.org/s/ygh4gbz4uma6i/
[2] 
https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
[3] https://github.com/apache/kafka/pull/15008

Thanks for looking into this!
Greg

On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <nizhi...@apache.org> wrote:
>
> Hello, Stanislav.
>
> Can you, please, make the gist public.
> Private gists not available for some GitHub users even if link are known.
>
> > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski 
> > <stanis...@confluent.io.INVALID> написал(а):
> >
> > Hey everybody,
> > I've heard various complaints that build times in trunk are taking too
> > long, some taking as much as 8 hours (the timeout) - and this is slowing us
> > down from being able to meet the code freeze deadline for 3.7.
> >
> > I took it upon myself to gather up some data in Gradle Enterprise to see if
> > there are any outlier tests that are causing this slowness. Turns out there
> > are a few, in this particular build - https://ge.apache.org/s/un2hv7n6j374k/
> > - which took 10 hours and 29 minutes in total.
> >
> > I have compiled the tests that took a disproportionately large amount of
> > time (20m+), alongside their time, error message and a link to their full
> > log output here -
> > https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> >
> > It includes failures from core, streams, storage and clients.
> > Interestingly, some other tests that don't fail also take a long time in
> > what is apparently the test harness framework. See the gist for more
> > information.
> >
> > I am starting this thread with the intention of getting the discussion
> > started and brainstorming what we can do to get the build times back under
> > control.
> >
> >
> > --
> > Best,
> > Stanislav
>

Reply via email to