Re: Fwd: [CI] What are the troubles projects face with CI and Infra

Mick Semb Wever Mon, 03 Feb 2020 13:38:33 -0800

Nate, I leave it to you to forward what-you-chose to the board@'s thread.

> Are there still troubles and what are they?

TL;DR
  the ASF could provide the Cassandra community with an isolated jenkins 
installation: so that we can manage and control the Jenkins master,  as well as 
ensure all donated hardware for Jenkins agents are dedicated and isolated to us.

The long writeup…

For Cassandra's use of ASF's Jenkins I see the following problems.

** Lack of trust (aka reliability)

The Jenkins agents re-use their workspaces, as opposed to using new containers 
per test run, leading to broken agents, disks, git clones, etc. One broken test 
run, or a broken agent, too easily affects subsequent test executions.

The complexity (and flakiness) around our tests is a real problem.  CI on a 
project like Cassandra is a beast and the community is very limited in what it 
can do, it really needs the help of larger companies. Effort is required in 
fixing the broken, the flakey, and the ignored tests. Parallelising the tests 
will help by better isolating failures, but tests (and their execution scripts) 
also need to be better at cleaning up after themselves, or a more container 
approach needs to be taken.

Another issue is that other projects sometimes using the agents, and Infra 
sometimes edits our build configurations (out of necessity).

** Lack of resources (throughput and response)

Having only 9 agents: none of which can run the large dtests; is a problem. All 
9 are from Instaclustr, much kudos! Three companies recently have said they 
will donate resources, this is work in progress.

We have four release branches where we would like to provide per-commit 
post-commit testing. Each complete test execution currently take 24hr+. 
Parallelising tests atm won't help much as the agents are generally saturated 
(with the pipelines doing the top-level parallelisation). Once we get more 
hardware in place: for the sake of improving throughput; it will make sense to 
look into parallelising the tests more.

The throughput of tests will also improve with effort put into 
removing/rewriting long running and inefficient tests. Also, and i think this 
is LHF, throughput could be improved by using (or taking inspiration from) 
Apache Yetus so to only run tests on what it relevant in the patch/commit. Ref: 
http://yetus.apache.org/documentation/0.11.1/precommit-basic/ 

** Difficulty in use

Jenkins is clumsy to use compared to the CI systems we use more often today: 
Travis, CircleCI, GH Actions.

One of the complaints has been that only committers can kick off CI for patches 
(ie pre-commit CI runs).  But I don't believe this to be a crucial issue for a 
number of reasons. 

1. Thorough CI testing of a patch only needs to happen during the review 
process, to which a committer needs to be involved in anyway.
2.  We don't have enough jenkins agents to handle the amount of throughput that 
automated branch/patch/pull-request testing would require.
3. Our tests could allow unknown contributors to take ownership of the agent 
servers (eg via the execution of bash scripts).
4. We have CircleCI working that provides basic testing for work-in-progress 
patches.

Focusing on post-commit CI and having canonical results for our release 
branches, i think then it boils down to the stability and throughput of tests, 
and the persistence and permanence of results.

The persistence and permanence of results is a bug bear for me. It has been 
partially addressed with posting the build results to the builds@ ML. But this 
only provides a (pretty raw) summary of the results. I'm keen to take the next 
step of the posting of CI results back to committed jira tickets (but am 
waiting on seeing Jenkins run stable for a while).  If we had our own Jenkins 
master we could then look into retaining more/all build results. Being able to 
see the longer term trends of test results and well as execution times I hope 
would add the incentive to get more folk involved.

Looping back to the ASF and what they could do: it would help us a lot in 
improving the stability and usability issues by providing us an isolated 
jenkins. Having our own master would simplify the setup, use and debugging, of 
Jenkins. It would still require some sunk cost but hopefully we'd end up with 
something better tailored to our needs. And with isolated agents help restore 
confidence.

regards,
Mick

PS i really want to hear from those that were involved in the past with cassci, 
your skills and experience on this topic surpass anything i got.

On Sun, 2 Feb 2020, at 22:51, Nate McCall wrote:
> Hi folks,
> The board is looking for feedback on CI infrastructure. I'm happy to take
> some (constructive) comments back. (Shuler, Mick and David Capwell
> specifically as folks who've most recently wrestled with this a fair bit).
> 
> Thanks,
> -Nate
> 
> ---------- Forwarded message ---------
> From: Dave Fisher <w...@apache.org>
> Date: Mon, Feb 3, 2020 at 8:58 AM
> Subject: [CI] What are the troubles projects face with CI and Infra
> To: Apache Board <bo...@apache.org>
> 
> 
> Hi -
> 
> It has come to the attention of the board through looking at past board
> reports that some projects are having problems with CI infrastructure.
> 
> Are there still troubles and what are they?
> 
> Regards,
> Dave
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Fwd: [CI] What are the troubles projects face with CI and Infra

Reply via email to