Progress has continued over the past weeks to improve our builds. Thanks to everyone who has contributed to this effort so far. Here is an update.
*Jenkins Build Disabled* We have removed the Jenkinsfile from trunk which has disabled the Jenkins build. There was much rejoicing throughout the land. The Jenkins build remains in place for all previous release branches including 3.9. This may change eventually. *Trunk Build Failures* The issue causing build timeouts was fixed and since then we have had zero timeouts. We had only one build failure due to excessive failing tests back on Sept 13. Last week we had an incident where a test regression was introduced on trunk which caused a number of failures. Gladly, it was quickly identified and fixed. *Thread Dumps* While debugging the timeout issue mentioned above, we added a feature to our build where thread dumps will be taken just before the build times out. This was invaluable in diagnosing the timeout issue since no one could manage to reproduce it locally. No special action is needed for this feature, it just happens automatically for both PRs and trunk builds. *Green Status Checks* With the removal of Jenkins, we now should expect all green status checks on Pull Requests. For those who have never seen such an anomaly, it looks like this [image: image.png] We are at the point now that *every failing build should be investigated* to confirm it is not an issue. *Performance* Our trunk build remains around 2 hours long. This has actually decreased slightly over the past few days as we have started removing ZooKeeper related code and tests. I expect this trend will continue until the ZK code is fully removed. On PRs, our average build time is 1h40m which is around 15% faster than trunk. This reduction is thanks to the Gradle Build Cache. According to Develocity reports, we are saving around 1h15m of serial execution time for every PR build (from ~7 hours to ~6 hours) It was well known that the Kafka build was a major burden on the Jenkins infrastructure. According to the Infra team, our GitHub Actions usage: > you're not even showing up as anything but a small blip on GHA *Better Caching* A script was committed just today that creates a local branch named "trunk-cached". The HEAD of this branch is the latest commit on trunk that also exists in the build cache. When creating branches for PRs, we can use this local branch as a base ref instead of trunk's HEAD. Similarly, when pulling in new changes from trunk, we can merge in this cached ref. Doing so will increase our cache utilization in GitHub Actions and reduce our build times even further. Here's an example of updating a feature branch: $ git checkout my-feature-branch $ git fetch origin $ ./committer-tools/update-cache.sh Local branch 'trunk-cached' successfully updated to 35f55a84fe (from 16 hours ago). $ git merge trunk-cached $ git push mumrah my-feature-branch Obviously, if you need commits from trunk's HEAD, then you can merge that in. It just may increase your cache misses. *Automated Workflow Run Approvals* A new tag called "ci-approved" was added recently. This tag is used to automatically approve pending workflow runs for a PR. GitHub has a rather inflexible policy for workflow run approvals. The default policy is to require an approval for *every* workflow run from an outside contributor. The other option is to only require a single approval for the *first* workflow run ever and then that contributor is white-listed. The latter policy would be easy to exploit by an adversarial party. The former (default) policy creates a lot of toil for committers. The labelling solution allows us to whitelist a PR rather than a contributor. *Flaky Tests* 91% of our trunk builds had flaky tests since my last email. This number hasn't really changed much since we began this build improvement effort. However, now that the build itself is in good shape, we can really focus on the test suite. KIP-1090 is being voted on now, and will hopefully bring many improvements to this area. Today, and after KIP-1090 has been delivered, we need help from the community to address known flaky tests. -David A