On Mon, Mar 29, 2021 at 03:10:36PM +0100, Stefan Hajnoczi wrote: > Hi, > I wanted to follow up with a summary of the CI jobs: > > 1. Containers & Containers Layer2 - ~3 minutes/job x 39 jobs > 2. Builds - ~50 minutes/job x 61 jobs > 3. Tests - ~12 minutes/job x 20 jobs > 4. Deploy - 52 minutes x 1 job > > The Builds phase consumes the most CI minutes. If we can optimize this > phase then we'll achieve the biggest impact. > > In the short term builds could be disabled. However, in the long term I > think full build coverage is desirable to prevent merging code that > breaks certain host OSes/architectures (e.g. stable Linux distros, > macOS, etc).
The notion of "full build coverage" doesn't really exist in reality. The number of platforms that QEMU is targetting, combined with the number of features that can be turned on/off in QEMU configure means that the matrix for "full build coverage" is too huge to ever contemplate. So far we've been adding new jobs whenever we hit some situation where we found a build problem that wasn't previously detected by CI. In theory this is more reasonable as a strategy, than striving for full build coverage, as it targets only places where we've hit real world problems. I think we're seeing though, that even the incremental new coverage approach is not sustainable in the real world. Or rather it is only sustainable if CI resources are essentially free. Traditionally the biggest amount of testing would be done in a freeze period leading upto a release. WIth GitLab CI we've tried to move to a model where testing is continuous, such that we have git master in a so called "always ready" state. This is very good in general, but it comes with significant hardware resource costs. We've relied on free service for this and this is being less viable. I think a challenges we have with our incremental approach is that we're not really taking into account relative importance of the different build scenarios, and often don't look at the big picture of what the new job adds in terms of quality, compared to existing jobs. eg Consider we have build-system-alpine: build-system-ubuntu: build-system-debian: build-system-fedora: build-system-centos: build-system-opensuse: build-trace-multi-user: build-trace-ftrace-system: build-trace-ust-system: I'd question whether we really need any of those 'build-trace' jobs. Instead, we could have build-system-ubuntu pass --enable-trace-backends=log,simple,syslog, build-system-debian pass --enable-trace-backends=ust and build-system-fedora pass --enable-trace-backends=ftrace, etc. Another example, is that we test builds on centos7 with three different combos of crypto backend settings. This was to exercise bugs we've seen in old crypto packages in RHEL-7 but in reality, it is probably overkill, because downstream RHEL-7 only cares about one specific combination. We don't really have a clearly defined plan to identify what the most important things are in our testing coverage, so we tend to accept anything without questioning its value add. This really feeds back into the idea I've brought up many times in the past, that we need to better define what we aim to support in QEMU and its quality level, which will influence what are the scenarios we care about testing. > Traditionally ccache (https://ccache.dev/) was used to detect > recompilation of the same compiler input files. This is trickier to do > in GitLab CI since it would be necessary to share and update a cache, > potentially between untrusted users. Unfortunately this shifts the > bottleneck from CPU to network in a CI-as-a-Service environment since > the cached build output needs to be accessed by the linker on the CI > runner but is stored remotely. Our docker containers install ccache already and I could have sworn that we use that in gitlab, but now I'm not so sure. We're only saving the "build/" directory as an artifact between jobs, and I'm not sure that directory holds the ccache cache. > A complementary approach is avoiding compilation altogether when code > changes do not affect a build target. For example, a change to > qemu-storage-daemon.c does not require rebuilding the system emulator > targets. Either the compiler or the build system could produce a > manifest of source files that went into a build target, and that > information is what's needed to avoid compiling unchanged targets. I think we want to be pretty wary of making the CI jobs too complex in what they do. We want them to accurately reflect the way that our developers and end users build the system in general. Trying to add clever logic to the CI system to skip building certain pieces will make the CI system more complex and fragile which will increase the burden of keeping CI working reliably. > Ideally the CI would look at the code changes and only launch jobs that > were affected. Those jobs would use a C compiler cache to avoid > rebuilding compiler input that has not changed. Basically, we need > incremental builds. If we want to consider "code changes" between CI runs, then we need to establish as baseline. If we're triggering GitLab jobs on "push" events, then the baseline is whatever content already exists in the remote server. eg if you have a branch with 10 commits delta on top of "master", but 8 of those commits already exist in the branch on gitlab, then the push event baseline is those 8 commits, so it'll only look at changes in the 2 top commits, rather than the entire 10 commits of that branch. This is generally *not* what we want for testing, because we can't assume that the 8 commits which already exist have successfully passed CI. We've seen this cause us problems for CI already, when we tried to filter out jobs rebuilding container images, so they only ran when a tests/docker/* file was modified. If we want to consider code changes where "master" is the baseline, then we need to trigger CI pipelines from merge requests, because merge requests have an explicit baseline associated with them. Of course this means we need to be using merge requests in some way which is a big can of worms. > This is as far as I've gotten with thinking about CI efficiency. Do you > think these optimizations are worth investigating or should we keep it > simple and just disable many builds by default? ccache is a no-brainer and assuming it isn't already working with our gitlab jobs, we must fix that asap. Aside from optimizing CI, we should consider whether there's more we can do to optimize build process itself. We've done alot of work, but there's still plenty of stuff we build multiple times, once for each target. Perhaps there's scope for cutting this down in some manner ? I'm unclear how many jobs in CI are build submodules, but if there's more scope for using the pre-built distro packages that's going to be beneficial in build time. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|