Progress has continued over the past weeks to improve our builds. Thanks to
everyone who has contributed to this effort so far. Here is an update.

*Jenkins Build Disabled*
We have removed the Jenkinsfile from trunk which has disabled the Jenkins
build. There was much rejoicing throughout the land.

The Jenkins build remains in place for all previous release branches
including 3.9. This may change eventually.

*Trunk Build Failures*
The issue causing build timeouts was fixed and since then we have had zero
timeouts.

We had only one build failure due to excessive failing tests back on Sept
13. Last week we had an incident where a test regression was introduced on
trunk which caused a number of failures. Gladly, it was quickly identified
and fixed.

*Thread Dumps*
While debugging the timeout issue mentioned above, we added a feature to
our build where thread dumps will be taken just before the build times out.
This was invaluable in diagnosing the timeout issue since no one could
manage to reproduce it locally. No special action is needed for this
feature, it just happens automatically for both PRs and trunk builds.

*Green Status Checks*
With the removal of Jenkins, we now should expect all green status checks
on Pull Requests. For those who have never seen such an anomaly, it looks
like this

[image: image.png]

We are at the point now that *every failing build should be investigated*
to confirm it is not an issue.

*Performance*
Our trunk build remains around 2 hours long. This has actually decreased
slightly over the past few days as we have started removing ZooKeeper
related code and tests. I expect this trend will continue until the ZK code
is fully removed.

On PRs, our average build time is 1h40m which is around 15% faster than
trunk. This reduction is thanks to the Gradle Build Cache. According to
Develocity reports, we are saving around 1h15m of serial execution time for
every PR build (from ~7 hours to ~6 hours)

It was well known that the Kafka build was a major burden on the Jenkins
infrastructure. According to the Infra team, our GitHub Actions usage:

> you're not even showing up as anything but a small blip on GHA

*Better Caching*
A script was committed just today that creates a local branch named
"trunk-cached". The HEAD of this branch is the latest commit on trunk that
also exists in the build cache.

When creating branches for PRs, we can use this local branch as a base ref
instead of trunk's HEAD. Similarly, when pulling in new changes from trunk,
we can merge in this cached ref. Doing so will increase our cache
utilization in GitHub Actions and reduce our build times even further.

Here's an example of updating a feature branch:

$ git checkout my-feature-branch

$ git fetch origin

$ ./committer-tools/update-cache.sh

Local branch 'trunk-cached' successfully updated to 35f55a84fe (from 16
hours ago).

$ git merge trunk-cached

$ git push mumrah my-feature-branch

Obviously, if you need commits from trunk's HEAD, then you can merge that
in. It just may increase your cache misses.

*Automated Workflow Run Approvals*
A new tag called "ci-approved" was added recently. This tag is used to
automatically approve pending workflow runs for a PR.

GitHub has a rather inflexible policy for workflow run approvals. The
default policy is to require an approval for *every* workflow run from an
outside contributor. The other option is to only require a single approval
for the *first* workflow run ever and then that contributor is
white-listed. The latter policy would be easy to exploit by an adversarial
party. The former (default) policy creates a lot of toil for committers.

The labelling solution allows us to whitelist a PR rather than a
contributor.

*Flaky Tests*
91% of our trunk builds had flaky tests since my last email. This number
hasn't really changed much since we began this build improvement effort.
However, now that the build itself is in good shape, we can really focus on
the test suite. KIP-1090 is being voted on now, and will hopefully bring
many improvements to this area.

Today, and after KIP-1090 has been delivered, we need help from the
community to address known flaky tests.

-David A

Reply via email to