Re: Serious doubts about Gitlab CI

Daniel P . Berrangé Tue, 30 Mar 2021 04:20:14 -0700

On Mon, Mar 29, 2021 at 03:10:36PM +0100, Stefan Hajnoczi wrote:
> Hi,
> I wanted to follow up with a summary of the CI jobs:
> 
> 1. Containers & Containers Layer2 - ~3 minutes/job x 39 jobs
> 2. Builds - ~50 minutes/job x 61 jobs
> 3. Tests - ~12 minutes/job x 20 jobs
> 4. Deploy - 52 minutes x 1 job
> 
> The Builds phase consumes the most CI minutes. If we can optimize this
> phase then we'll achieve the biggest impact.
> 
> In the short term builds could be disabled. However, in the long term I
> think full build coverage is desirable to prevent merging code that
> breaks certain host OSes/architectures (e.g. stable Linux distros,
> macOS, etc).


The notion of "full build coverage" doesn't really exist in reality.
The number of platforms that QEMU is targetting, combined with the
number of features that can be turned on/off in QEMU configure
means that the matrix for "full build coverage" is too huge to ever
contemplate.

So far we've been adding new jobs whenever we hit some situation
where we found a build problem that wasn't previously detected by
CI. In theory this is more reasonable as a strategy, than striving
for full build coverage, as it targets only places where we've hit
real world problems. I think we're seeing though, that even the
incremental new coverage approach is not sustainable in the real
world. Or rather it is only sustainable if CI resources are
essentially free.


Traditionally the biggest amount of testing would be done in a
freeze period leading upto a release. WIth GitLab CI we've tried
to move to a model where testing is continuous, such that we
have git master in a so called "always ready" state. This is
very good in general, but it comes with significant hardware
resource costs. We've relied on free service for this and this
is being less viable.



I think a challenges we have with our incremental approach is that
we're not really taking into account relative importance of the
different build scenarios, and often don't look at the big picture
of what the new job adds in terms of quality, compared to existing
jobs.

eg Consider we have

  build-system-alpine:
  build-system-ubuntu:
  build-system-debian:
  build-system-fedora:
  build-system-centos:
  build-system-opensuse:

  build-trace-multi-user:
  build-trace-ftrace-system:
  build-trace-ust-system:

I'd question whether we really need any of those 'build-trace'
jobs. Instead, we could have build-system-ubuntu pass
--enable-trace-backends=log,simple,syslog, build-system-debian
pass --enable-trace-backends=ust and build-system-fedora
pass --enable-trace-backends=ftrace, etc. 

Another example, is that we test builds on centos7 with
three different combos of crypto backend settings. This was
to exercise bugs we've seen in old crypto packages in RHEL-7
but in reality, it is probably overkill, because downstream
RHEL-7 only cares about one specific combination.

We don't really have a clearly defined plan to identify what
the most important things are in our testing coverage, so we
tend to accept anything without questioning its value add.
This really feeds back into the idea I've brought up many
times in the past, that we need to better define what we aim
to support in QEMU and its quality level, which will influence
what are the scenarios we care about testing.


> Traditionally ccache (https://ccache.dev/) was used to detect
> recompilation of the same compiler input files. This is trickier to do
> in GitLab CI since it would be necessary to share and update a cache,
> potentially between untrusted users. Unfortunately this shifts the
> bottleneck from CPU to network in a CI-as-a-Service environment since
> the cached build output needs to be accessed by the linker on the CI
> runner but is stored remotely.

Our docker containers install ccache already and I could have sworn
that we use that in gitlab, but now I'm not so sure. We're only
saving the "build/" directory as an artifact between jobs, and I'm
not sure that directory holds the ccache cache.

> A complementary approach is avoiding compilation altogether when code
> changes do not affect a build target. For example, a change to
> qemu-storage-daemon.c does not require rebuilding the system emulator
> targets. Either the compiler or the build system could produce a
> manifest of source files that went into a build target, and that
> information is what's needed to avoid compiling unchanged targets.

I think we want to be pretty wary of making the CI jobs too complex
in what they do. We want them to accurately reflect the way that our
developers and end users build the system in general. Trying to add
clever logic to the CI system to skip building certain pieces will
make the CI system more complex and fragile which will increase the
burden of keeping CI working reliably.

> Ideally the CI would look at the code changes and only launch jobs that
> were affected. Those jobs would use a C compiler cache to avoid
> rebuilding compiler input that has not changed. Basically, we need
> incremental builds.

If we want to consider "code changes" between CI runs, then we need
to establish as baseline. If we're triggering GitLab jobs on "push"
events, then the baseline is whatever content already exists in
the remote server. eg if you have a branch with 10 commits delta
on top of "master", but 8 of those commits already exist in the
branch on gitlab, then the push event baseline is those 8 commits,
so it'll only look at changes in the 2 top commits, rather than
the entire 10 commits of that branch.  This is generally *not*
what we want for testing, because we can't assume that the 8
commits which already exist have successfully passed CI. We've
seen this cause us problems for CI already, when we tried to
filter out jobs rebuilding container images, so they only ran
when a tests/docker/* file was modified. 

If we want to consider code changes where "master" is the baseline,
then we need to trigger CI pipelines from merge requests, because
merge requests have an explicit baseline associated with them. Of
course this means we need to be using merge requests in some way
which is a big can of worms.

> This is as far as I've gotten with thinking about CI efficiency. Do you
> think these optimizations are worth investigating or should we keep it
> simple and just disable many builds by default?

ccache is a no-brainer and assuming it isn't already working with
our gitlab jobs, we must fix that asap.


Aside from optimizing CI, we should consider whether there's more we
can do to optimize build process itself. We've done alot of work, but
there's still plenty of stuff we build multiple times, once for each
target. Perhaps there's scope for cutting this down in some manner ?

I'm unclear how many jobs in CI are build submodules, but if there's
more scope for using the pre-built distro packages that's going to
be beneficial in build time.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: Serious doubts about Gitlab CI

Reply via email to