Hey Stan, Thanks for opening the discussion. I haven't been looking at overall build duration recently, so it's good that you are calling it out.
I worry about us over-indexing on this one build, which itself appears to be an outlier. I only see one other build [1] above 6h overall in the last 90 days in this view: [2] And I don't see any overlap of failed tests in these two builds, which makes it less likely that these particular failed tests are the causes of long build times. Separately, I've been investigating build environment slowness, and trying to connect it with test failures [3]. I observed that the CI build environment is 2-20 times slower than my developer machine (M1 mac). When I simulate a similar slowdown locally, there are tests which become significantly more flakey, often due to hard-coded timeouts. I think that these particularly nasty builds could be explained by long-tail slowdowns causing arbitrary tests to take an excessive time to execute. Rather than trying to find signals in these rare test failures, I think we should find tests that have these sorts of failures more regularly. There are lots of builds in the 5-6h duration bracket, which is certainly unacceptably long. We should look into these builds to find improvements and optimizations. [1] https://ge.apache.org/s/ygh4gbz4uma6i/ [2] https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York [3] https://github.com/apache/kafka/pull/15008 Thanks for looking into this! Greg On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <nizhi...@apache.org> wrote: > > Hello, Stanislav. > > Can you, please, make the gist public. > Private gists not available for some GitHub users even if link are known. > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski > > <stanis...@confluent.io.INVALID> написал(а): > > > > Hey everybody, > > I've heard various complaints that build times in trunk are taking too > > long, some taking as much as 8 hours (the timeout) - and this is slowing us > > down from being able to meet the code freeze deadline for 3.7. > > > > I took it upon myself to gather up some data in Gradle Enterprise to see if > > there are any outlier tests that are causing this slowness. Turns out there > > are a few, in this particular build - https://ge.apache.org/s/un2hv7n6j374k/ > > - which took 10 hours and 29 minutes in total. > > > > I have compiled the tests that took a disproportionately large amount of > > time (20m+), alongside their time, error message and a link to their full > > log output here - > > https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2 > > > > It includes failures from core, streams, storage and clients. > > Interestingly, some other tests that don't fail also take a long time in > > what is apparently the test harness framework. See the gist for more > > information. > > > > I am starting this thread with the intention of getting the discussion > > started and brainstorming what we can do to get the build times back under > > control. > > > > > > -- > > Best, > > Stanislav >