This is a longish mail, so [tldr] I propose that we require subsystem maintainers to run GitLab CI pipeline /before/ sending pull requests to qemu-devel. This will
* reduce the burden on the person merging pull requests by making it less likely merge tests will fail, thus avoiding v2/v3/etc which add to volume of pending pulls. * reduce the burden on CI maintainers by avoiding breakage of our public CI post merge. * reduce the burden on subsystem maintainers by catching problems more quickly, avoiding delays in their pulls being merged Overall it will distribute the testing burden more equitably across our community, instead of today where the testing burden is focused on a small number of people. [/tldr] Now the longer back-story / justification... When sending pull requests to qemu-devel for merge into master, there are no formally defined expectations around the level of testing that has been performed by the maintainer. It is up to the subsystem maintainer's judgement as to the amount of testing that is appropriate. This is positive for the individual subsystem maintainer when considering sending one pull request in isolation, as the bar is quite low. IMHO this is actually a negative thing for the QEMU project as a whole. * Pull requests that fail testing at time of applying to master waste the time of the gate keeper (Peter), because they are detecting problems that the subsystem maintainers could often have detected themselves. * Pull requests that fail testing at time of applying to master waste the time of the subsystem maintainer too, as it is a very slow / inefficient feedback loop between sending the PULL, getting the test failure results back, fixing the problem, and sending a new PULL. * Pull requests that pass testing at time of applying to master, but none the less break our public CI (GitLab, Travis, Cirrus) waste the time of our CI maintainers. Mostly the problem here is that we are not using our public CI as part of the gating tests. Overall the big problem is that testing and monitoring of test results is centralized onto too small a set of people (the person merging to git master, and the people babysitting the public CI systems). This is not a scalable approach and ultimately either limits the velocity of everything sent for merge into master, and/or leads to burnout of people dealing with failing CI. The testing burden needs to be more equitably distributed across the QEMU community. The big reason that we've not placed specific testing pre-requisites for the sending of pull requests is that, historically, the test environment is only accessible to the person doing the merge to master, not the subsystem maintainers, nor individual contributors. Maintainers were only expected to have access to the specific OS used on their dev machine. With our increasing use of automated CI platforms and provision of standardized docker container & VM recipes as build environments, the situation has already changed significantly and will continue to do so. It is trivial for any contributor to get comprehensive testing cross many platform combinations (build + tests across all Linux distros on x86 with many different build config scenarios, build Windows via mingw32/64, build Debian on non-x86 via cross compilers) using just GitLab CI pipelines. If adding Cirrus CI, and Travis CI, even more combinations are added (macOS, FreeBSD, native Windows, and some native non-x86). In the not too distant future, the current CI setup for gating merges to master, will be replaced by a system built on GitLab CI, so we'll get much better alignment between our post-merge CI and our gating CI, preventing more post merge CI failures and reducing burden on our CI maintainers. The high frequency with which post-merge CI jobs break today, show that subsystem maintainers are mostly only testing a narrow set of scenarios before sending their pull requests. When gating merges to master is done by GitLab CI, the number of configuration scenarios and platforms we'll be testing will increase significantly. This is going to increase the frequency with which pull requests are rejected due to failing tests. We will have more v2, v3, pull requests posted which will increase the burden on the person handling them. This is not going to be sustainable and will limit our velocity unless we change our testing expectations. We need to distribute the burden of testing more broadly, by introducing a requirement for all pull requests to pass GitLab CI *before* being sent as a PULL to the list. This might sound like a scary burden, but don't be afraid, as it is much easier to achieve than one might realize. Testing with GitLab CI merely requires pushing the queued patches to a personal fork of the QEMU repository hosted on gitlab.com. Any time patches are pushed to a fork they trigger a pipeline whose result are presented at a URL https://gitlab.com/YOUR-USER-NAME/qemu/pipelines There is a "scripts/ci/gitlab-pipeline-status" script that can be used to monitor results from the terminal if web browsers are not your ideal UI. IOW, in terms of workflow for a subsystem maintainer, the current process is: 1. Push to an arbitrary public git repo host 2. Send patches to qemu-devel With a requirement to pass GitLab CI before sending a pull, the workflow would be more like 1. Push patches to a personal gitlab.com QEMU fork 2. Wait for CI results to arrive (typically 45-60 mins) 3. If CI failed (a) fix problems, go to (1) else if CI passed (b) Send patches to qemu-devel How will we verify that a pull request has a successful pipeline run before merging it ? We could require that cover letter always point to a gitlab.com repo, instead of arbitrary public git repo host. Or we could just require the cover letter have a pointer to the GitLab pipeline results page that matches the git commit ref. The former would be simpler, as we could use scripts/ci/gitlab-pipeline-status to validate it. If this is done by all subsystem maintainers, then the number of PULL requests that fail to pass gating tests should be significantly reduced. It won't be zero because the gating tests will still run more jobs than are available via the public runners. Also git master can move between the time the maintainer runs CI and when their PULL is actually applied to master, which can trigger test failures. Pulls should have a quicker turn around time if we have less v2/v3/v4s clogging up the queue of pending merges. Earlier I mentioned that we have three CI systems (GitLab, Cirrus, Travis). It is possible for maintainers to send to all three and wait for results. This has a larger upfront setup burden in terms account creation and configuration, as well as a larger burden in monitoring results. It is better if we start off simple and only focus on GitLab CI, because that runs the vast majority of our jobs, and is also what will soon be used as a gating test for master. IOW, if subsystem mintainers want to test on Cirrus/Travis, great, but lets consider that strictly optional. If we want to take this a step further, then subsystem maintainers may wish to have individual contributors to show that they have run CI tests before queuing patches in a staging tree. This is probably unreasonable for small patches especially from occassional contributors, but for large or complex patch series, or regular contributors it is likely a net win to set expectation that they should be testing via GitLab CI. IOW, we should update our contributing guidelines to encourage, but not mandate, patches be sent through GitLab CI before submission for review. This should catch many of the silly mistakes contributors make in their code, and reduce the burden on reviewers, and reduce problems hit later by maintainers or gating CI. Finally, some people may have noticed that GitLab is planning to reduce the number of free CI minutes they provide to users. This has the potential to be a significant problem to running QEMU GitLab CI tests. Many open source projects have given them feedback about the negative impact it would have and their response has been positive about wanting to find a mutually acceptable solution to support OSS projects. We don't know the outcome yet, but I'm somewhat positive there will be a viable solution. Worst case we can go back to current situation with minimal testing of pull requests, but I'm hopeful it won't come to that. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|