Re: [DISCUSS] GitHub CI

David Arthur Wed, 04 Sep 2024 10:02:54 -0700

(I had to re-send this without most of the screenshots)

Now that we've had both builds running for a little while, I thought it
would be good to do a comparison.


Since we don't have much signal from PRs yet, we'll just be looking at
JDK17 trunk builds between August 15 and today.

Jenkins:
https://ge.apache.org/scans/performance?performance.focusedBuild=kvp54miluq6bm&performance.metric=buildTime&performance.offset=68&performance.pageSize=133&search.rootProjectNames=kafka&search.startTimeMax=1725459590692&search.startTimeMin=1723694400000&search.tags=jenkins,trunk,JDK17&search.tasks=test&search.timeZoneId=America%2FNew_York

GitHub:
https://ge.apache.org/scans/performance?performance.metric=buildTime&search.names=Git%20repository%2CCI%20workflow&search.rootProjectNames=kafka&search.startTimeMax=1725459590692&search.startTimeMin=1723694400000&search.tags=trunk%2Cgithub%2CJDK17&search.tasks=test&search.timeZoneId=America%2FNew_York&search.values=https:%2F%2Fgithub.com%2Fapache%2Fkafka%2CCI


Two notes on the above:
1) The GitHub build has a timeout of 3 hours. Any build exceeding this
limit will not publish a build scan, so a lot of "bad" builds are excluded
from the GH data
2) 158 commits have been made to trunk since Aug 15. Many of these builds
include multiple commits.


If we expand the search of Jenkins builds to look at PR builds (JDK21 in
this case), we can see a lot more variability in the build times

https://ge.apache.org/scans/performance?performance.offset=186&search.rootProjectNames=kafka&search.startTimeMax=1725459590692&search.startTimeMin=1723694400000&search.tags=jenkins%2CJDK21&search.tasks=test&search.timeZoneId=America%2FNew_York

Interestingly, the Jenkins PR builds have better 5th percentile times than
trunk. In this data ^ the 5th percentile is 1h12m.


It's hard to directly compare these results due to the 3hr timeout set on
the GH build. If we do some hand-wavy analysis, we can try to come up with
an interpretation. The 25th percentile for PR Jenkins builds is 2h23m and
the 50th percentile is 3h59m. Here is the same graph as above with a line
added around the 3hr mark.
[image: image.png]

Interpreting the percentiles, we can see that less than 75% but more than
50% of Jenkins builds have build times exceeding 3 hours.

We can look at the "check" build scans for GH to get an idea of how many
"test" build scans failed to be published due to timeouts. For example, the
GH trunk JDK17 build published 63 "check" build scans but only 56 "test"
build scans. The results are:

* GH trunk JDK17 had 11% build timeouts
* GH trunk JDK11 had 22% build timeouts


Overall, it seems that the GitHub build is more stable than Jenkins. In the
best case, Jenkins builds are running between 1h15m and 1h30m, but more
often than not the Jenkins builds are running in excess of 3 or 4 hours.

Next steps I'd like to take

1) Fully enable the GH workflows for all PRs (not just ones with gh- prefix)
2) Continue investigating the build cache (
https://issues.apache.org/jira/browse/KAFKA-17479)
3) Prioritize fixes for the worst flaky tests
4) Identify tests which are causing build timeouts

As always, feedback is very welcome.

-David A

On Sun, Aug 25, 2024 at 2:51 PM David Arthur <mum...@gmail.com> wrote:

> Hey folks, I think we have enough in place now to start testing out the
> Github Actions CI more broadly. For now, the new CI is opt-in for each PR.
>
> *To enable the new Github Actions workflow on your PR, use a branch name
> starting with "gh-"*
>
> Here's the current state of things:
>
> * Each PR, regardless of name, will run the "compile and check" jobs. You
> probably have already noticed these
> * If a PR's branch name starts with "gh-", the JUnit tests will be run
> with Github Actions
> * Trunk is already configured to run the new workflow alongside the
> existing Jenkins CI
> * PRs from non-committers must be manually approved before the Github
> Actions will run -- this is due to a default ASF Infra policy which we can
> relax if we want
>
> Build scans to ge.apache.org are working as expected on trunk. If a
> committer wants their PR to publish a build scan, they will need to push
> their branch to apache/kafka rather than their fork.
>
> One important note is that the Gradle cache has been enabled with the
> Actions workflows. For now, each trunk build will populate the cache and
> PRs will read from the cache.
>
> Thanks to Chia-Ping Tsai for all the reviews so far!
>
> -David
>
>
> On Thu, Aug 22, 2024 at 3:04 PM David Arthur <mum...@gmail.com> wrote:
>
>> The Github public runners (which we are using) only offer windows, mac,
>> and linux (x86_64). It is possible to set up dedicated "self-hosted"
>> runners for a project (or org) which would allow whatever architecture is
>> desired. Looks like someone has done this before for ppc64le
>> https://medium.com/@mayurwaghmode/github-actions-self-hosted-runners-on-ppc64le-architectures-902b8f826557.
>> Personally, I have done this for a Raspberry Pi on a different project.
>> There's a lot of flexibility with self-hosted.
>>
>> There has been some discussion of Infra setting up "self-hosted" runners
>> to supplement the existing Github runners. I'm not sure what the concrete
>> plans are, if any.
>>
>> So, to answer your specific question
>>
>> > I'm wondering if we also get access to other architectures via GitHub
>> actions?
>>
>> Yes, but only if someone sets up a self-hosted runner with that
>> architecture
>>
>> Cheers,
>> David
>>
>> On Thu, Aug 22, 2024 at 5:45 AM Mickael Maison <mickael.mai...@gmail.com>
>> wrote:
>>
>>> Hi David,
>>>
>>> Thanks for taking a look at this. Anything that can improve the
>>> feedback loop and ease of use is very welcome.
>>>
>>> One question I have is about the supported architectures. For example
>>> a while back we voted KIP-942 to add ppc64le to the Jenkins CI. Due to
>>> significant performance issues with the ppc64le environments this is
>>> still not properly enabled yet. See
>>> https://ci-builds.apache.org/job/Kafka/job/Kafka%20PowerPC%20Daily/
>>> and https://issues.apache.org/jira/browse/INFRA-26011 if you are
>>> interested in the details.
>>>
>>> I'm wondering if we also get access to other architectures via GitHub
>>> actions?
>>>
>>> Thanks,
>>> Mickael
>>>
>>> On Fri, Aug 16, 2024 at 6:02 PM David Arthur <mum...@gmail.com> wrote:
>>> >
>>> > Josep,
>>> >
>>> > > By having CI commenting on the PR
>>> > everyone watching the PR (author and reviewers) will get notified when
>>> it's
>>> > done.
>>> >
>>> > Faster feedback is an immediate improvement I'd like to pursue. Even
>>> having
>>> > a separate PR status check for "compile + validate" would save the
>>> author a
>>> > trip digging through logs. Doing this with GH Actions is pretty
>>> > straightforward.
>>> >
>>> > David,
>>> >
>>> > 1. I will bring this up with Infra. They probably have some idea of my
>>> > intentions, due to all my questions, but I'll raise it directly.
>>> >
>>> > 2. I can think of two approaches for this. First, we can write a script
>>> > that produces the desired output given the junit XML reports. This can
>>> then
>>> > be used to leave a comment on the PR. Another is to add a summary
>>> block to
>>> > the workflow run. For example in this workflow:
>>> > https://github.com/mumrah/kafka/actions/runs/10409319037?pr=5 below
>>> the
>>> > workflow graph, there are summary sections. These are produced by
>>> steps of
>>> > the workflow.
>>> >
>>> > There are also Action plugins that render junit reports in various
>>> ways.
>>> >
>>> > ---
>>> >
>>> > Here is a PR that adds the action I've been experimenting with
>>> > https://github.com/apache/kafka/pull/16895. I've restricted it to
>>> only run
>>> > on pushes to branches named "gh-" to avoid suddenly overwhelming the
>>> ASF
>>> > runner pool. I have split the workflow into two jobs which are
>>> reported as
>>> > separate status checks (see https://github.com/mumrah/kafka/pull/5 for
>>> > example).
>>> >
>>> >
>>> >
>>> > On Fri, Aug 16, 2024 at 9:00 AM David Jacot
>>> <dja...@confluent.io.invalid>
>>> > wrote:
>>> >
>>> > > Hi David,
>>> > >
>>> > > Thanks for working on this. Overall, I am supportive. I have two
>>> > > questions/comments.
>>> > >
>>> > > 1. I wonder if we should discuss with the infra team in order to
>>> ensure
>>> > > that they have enough capacity for us to use the action runners. Our
>>> CI is
>>> > > pretty greedy in general. We could also discuss with them whether
>>> they
>>> > > could move the capacity that we used in Jenkins to the runners. I
>>> think
>>> > > that Kafka was one of the most, if not the most, heavy users of the
>>> shared
>>> > > Jenkins infra. I think that they will appreciate the heads up.
>>> > >
>>> > > 2. Would it be possible to improve how failed tests are reported? For
>>> > > instance, the tests in your PR failed with `1448 tests completed, 2
>>> > > failed`. First it is quite hard to see it because the logs are long.
>>> Second
>>> > > it is almost impossible to find those two failed tests. In my
>>> opinion, we
>>> > > can not use it in the current state to merge pull requests. Do you
>>> know if
>>> > > there are ways to improve this?
>>> > >
>>> > > Best,
>>> > > David
>>> > >
>>> > > On Fri, Aug 16, 2024 at 2:44 PM 黃竣陽 <s7133...@gmail.com> wrote:
>>> > >
>>> > > > Hello David,
>>> > > >
>>> > > > I find the Jenkins UI to be quite unfriendly for developers, and
>>> the
>>> > > > Apache Jenkins instance is often unreliable.
>>> > > > On the other hand, the new GitHub Actions UI is much more
>>> appealing to
>>> > > me.
>>> > > > If GitHub Actions proves to be more
>>> > > > stable than Jenkins, I believe it would be a worthwhile change to
>>> switch
>>> > > > to GitHub Actions.
>>> > > >
>>> > > > Thank you.
>>> > > >
>>> > > > Best Regards,
>>> > > > Jiunn Yang
>>> > > > > Josep Prat <josep.p...@aiven.io.INVALID> 於 2024年8月16日 下午4:57 寫道：
>>> > > > >
>>> > > > > Hi David,
>>> > > > > One of the enhancements we can have with this change (it's
>>> easier to do
>>> > > > > with GH actions) is to write back the result of the CI run as a
>>> comment
>>> > > > on
>>> > > > > the PR itself. I believe not needing to periodically check CI to
>>> see if
>>> > > > the
>>> > > > > run finished would be a great win. By having CI commenting on
>>> the PR
>>> > > > > everyone watching the PR (author and reviewers) will get
>>> notified when
>>> > > > it's
>>> > > > > done.
>>> > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> > --
>>> > David Arthur
>>>
>>
>>
>> --
>> David Arthur
>>
>
>
> --
> David Arthur
>


-- 
David Arthur

Re: [DISCUSS] GitHub CI

Reply via email to