Wainer dos Santos Moschetta <waine...@redhat.com> writes:
> Hi all, <snip> >> Conclusion >> ========== >> >> I think generally the state of QEMU's CI has improved over the last few >> years but we still have a number of challenges caused by its distributed >> nature and test stability. We are still re-active to failures rather >> than having something fast and reliable enough to gate changes going >> into the code base. This results in fairly long periods when one >> or more parts of the testing mosaic are stuck on red waiting for fixes >> to finally get merged back into master. >> >> So what do people think? Have I missed anything out? What else can we do >> to improve the situation? >> >> Let the discussion begin ;-) > > I want to help on improve QEMU CI, and in fact I can commit some time > to do so. But since I'm new to the community and made just a few > contributions, I'm in the position of only try to understand what we > have in place now. > > So allow me to put this in a different perspective. I took some notes > in terms of CI workflows we have. It goes below along with some > comments and questions: > > ---- > Besides being distributed across CI providers, there are different CI > workflows being executed on each stages of the development process. > > - Developer tests before send the patch to the mailing-list > Each developer has its own recipe. > Can be as simple as `make check[-TEST-SUITE]` locally. Or > Docker-based `make docker-*` tests. The make docker-* tests mostly cover building on other distros where there might be subtle differences. The tests themselves are the same make check-FOO as before. > It seems not widely used but some may also use push to GitHub/GibLab > + triggers to the cloud provider. > > What kind of improvements we can make here? > Perhaps (somehow) automate the github/githab + triggers to cloud > provider workflow? We have a mechanism that can already do that with patchew. But I'm not sure how much automation can be done for developers given they need to have accounts on the relevant services. Once that is done however it really is just a few git pushes. > Allow to reproduce a failure that happens on cloud provider locally, > when it comes to failures that occurred on next stages of development > (see below) seems highly appreciated. In theory yes, in practice it seems our CI providers are quite good at producing failures under load. I've run tests that fail on travis 10's thousands of times locally without incident. The reproductions I've done recently have all been on VMs where I've constrained memory and vCPUs and then very heavily loaded them. It seems like most developers are blessed with beefy boxes that rarely show up these problems. What would be more useful is being able to debug the failure that occurred on the CI system. Either by: a) having some sort of access to the failed system The original Travis setup didn't really support that but I think there may be options now. I haven't really looked into the other CI setups yet. They may be better off. Certainly if we can augment CI with our own runners they are easier to give developers access to. b) upload the failure artefacts *somewhere* Quite a lot of these failures should be dumping core. Maybe if we can upload the core, associated binary, config.log and commit id to something we can then do a bit more post-mortem on what went wrong. c) dump more information in the CI logs An alternative to uploading would be some sort of clean-up script which could at least dump backtraces of cores in the logs. > > - Developer sends a patch to the mailing-list > Patchew pushes the patch to GitHub, run tests (checkpatch, asan, > docker-clang@ubuntu, docker-mingw@fedora) > Reports to ML on failure. Shouldn't send an email on success as well > so that it creates awareness about CI? Patchew has been a little inconsistent of late with it's notifications. Maybe a simple email with a "Just so you know patchew has run all it's tests on this and it's fine" wouldn't be considered too noisy? > - Maintainer tests its branch before the pull-request > Alike developers, it seems each one sits on its own recipe that may > (or may not) trigger on an CI provider. Usually the same set of normal checks plus any particular hand-crafted tests that might be appropriate for the patches included. For example for all of Emilio's scaling patches I ran lot of stress tests by hand. They are only semi-automated because it's not something I'd do for most branches. > - Maintainer sends a pull-request to the mailing-list > Again patchew gets in. It seems it runs the same tests. Am I right? > Also send the email to mailing-list only on failure. Yes - although generally a PR is collection of patches so it's technically a new tree state to test. > - Peter runs tests for each PR > IIUC not integrated to any CI provider yet. > Likely here we have the most complete scenario in terms of coverage > (several hosts, targets, build configs, etc). > Maybe the area that needs more care. Peter does catch stuff the CI tests don't so I don't think we are ready to replace him with a robot just yet ;-) However he currently has access to a wider a range of other architectures than just about anybody at the moment. > - Post-merged on GitHub branches (master and stable-*) > Ubuntu x86_64 and MacOS at Travis > Reports success/failure to qemu's irc channel > Cross compilers (Debian docker image) at shippable > FreeBSD at cirrus > Debian x86_64 at GitLab > > Is LAVA setup still in use? Yes although it needs some work. I basically runs the RISU tests for AArch64 although it does have the ability to run on a bunch of interesting platforms - probably more relevant to testing KVM type stuff as it can replace rootfs and kernels. > Shouldn't we (and are we able?) use a single CI provider for the > sake of maintainability? If so it seems Gitlab CI is a strong > candidate. I would certainly like a more unified view of the state of a given branch but the distributed nature does have some benefits in terms of scaling and redundancy. GitLab has some promise but given how much of a pain building an arm64 runner has been it's not quite there yet. -- Alex Bennée