On 5/16/21 3:31 AM, Amos Jeffries wrote: > On 4/05/21 2:29 am, Alex Rousskov wrote: >> On 5/3/21 12:41 AM, Francesco Chemolli wrote: >>> - we want our QA environment to match what users will use. For this >>> reason, it is not sensible that we just stop upgrading our QA nodes, >> >> I see flaws in reasoning, but I do agree with the conclusion -- yes, we >> should upgrade QA nodes. Nobody has proposed a ban on upgrades AFAICT! >> >> The principles I have proposed allow upgrades that do not violate key >> invariants. For example, if a proposed upgrade would break master, then >> master has to be changed _before_ that upgrade actually happens, not >> after. Upgrades must not break master. > > So ... a node is added/upgraded. It runs and builds master fine. Then > added to the matrices some of the PRs start failing.
It is easy to misunderstand what is going on because there is no good visualization of complex PR-master-Jenkins_nodes-Jenkins_failures relationships. Several kinds of PR test failures are possible. I will describe the two most relevant to your email: * PR test failures due to problems introduced by PRs should be welcomed at any time. CI improvements are allowed to find new bugs in open PRs. Such findings, even when discovered at the "last minute", should be seen as an overall positive event or progress -- our CI was able to identify a problem before it got officially accepted! I do not recall anybody complaining about such failures recently. * PR test failures due to the existing master code are not welcomed. They represent a CI failure. In these cases, if the latest master code is tested with the same test after the problematic CI change, then that master test will fail. Nothing a PR can do in this situation can fix this kind of failure because it is not PR changes that are causing the failure -- CI changes broke the master branch, not just the PR! This kind of failures are the responsibility of CI administrators, and PR authors should complain about them, especially when there are no signs of CI administrators aware of and working on addressing the problem. A good example of a failure of the second kind a -Wrange-loop-construct error in a PR that does not touch any range loops (Jenkins conveniently deleted the actual failed test, but my GitHub comment and PR contents may be enough to restore what happened): https://github.com/squid-cache/squid/pull/806#issuecomment-827924821 [I am skipping the exaggerations (about other people exaggerations) that were, AFAICT, driven by mis-classifying the failures of the second kind as regular PR failures of the first kind.] >> What this means in terms of sysadmin steps for doing upgrades is up to >> you. You are doing the hard work here, so you can optimize it the way >> that works best for _you_. If really necessary, I would not even object >> to trial upgrades (that may break master for an hour or two) as long as >> you monitor the results and undo the breaking changes quickly and >> proactively (without relying on my pleas to fix Jenkins to detect >> breakages). I do not know what is feasible and what the best options >> are, but, again, it is up to _you_ how to optimize this (while observing >> the invariants). > Uhm. Respectfully, from my perspective the above paragraph conflicts > directly with actions taken. My paragraph is not really about any taken actions so there cannot be any such conflict AFAICT. > From what I can tell kinkie (as sysadmin) *has* been making a new node > and testing it first. Not just against master but the main branches and > most active PRs before adding it for the *post-merge* matrix testing > snapshot production. The above summary does not match the (second kind of) PR test failures that I have observed and asked Francesco to address. Those failures were triggered by the latest master code, not PR changes. No PR changes would have fixed those test failures. In fact, an "empty" PR that introduces no code changes at all would have failed as well. See above for one recent example. > But still threads like this one with complaints appear. This squid-dev thread did not contain any complaints AFAICT. Francesco is trying to get consensus on how to handle CI upgrades/changes. He is doing the right thing and nobody is complaining about it. > I understand there is some specific pain you have encountered to trigger > the complaint. Can we get down to documenting as exactly as possible > what the particular pain was? Well, I apparently do not know what you call "complaint", so I cannot answer this exact question, but if you are looking for a recent PR test failure that triggered this squid-dev thread, then please see the GitHub link above. > Test designs that do not fit into our merge and release process sequence > have proven time and again to be broken and painful to Alex when they > operate as-designed. For the rest of us it is this constant re-build of > automation which is the painful part. While I am not sure what you are talking about specifically, I think that any design that causes pain should be changed. Your phrasing suggests some dissatisfaction with that natural (IMO) expectation, so I am probably missing something. Please rephrase if this is important. > B. PR submission testing > - which OS for master (5-pr-test) ? > - which OS for beta (5-pr-test) ? > - which OS for stable (5-pr-test) ? > > Are all of those sets the same identical OS+compilers? no. > Why are they forced to be the same matrix test? I do not understand the question. Are you asking why Jenkins uses the same 5-pr-test configuration for all three branches (master, beta, _and_ stable)? I do not know the answer. > IIRC, policy forced on sysadmin with previous pain complaints. Complaints, even legitimate ones, should not shape a policy. Goals and principles should do that. I remember one possibly related discussion where we were trying to reduce Jenkins/PR wait times by changing which tests are run at what PR merging stages, but that is probably a different issue because your question appears to be about a single merging stage. > C. merge testing > - which OS for master (5-pr-auto) ? > - which OS for beta (5-pr-auto) ? > - which OS for stable (5-pr-auto) ? > NP: maintainer does manual override on beta/stable merges. > > Are all of those sets the same identical OS+compilers? no. > Why are they forced to be the same matrix test? Anubis This is too cryptic for me to understand, but Anubis does not force any tests on anybody -- it simply checks that the required tests have passed. I am not aware of any Anubis bugs in this area, but please correct me if I am wrong. > D. pre-release testing (snapshots + formal) > - which OS for master (trunk-matrix) ? > - which OS for beta (5-matrix) ? > - which OS for stable (4-matrix) ? > > Are all of those sets the same identical OS+compilers? no. > Are we forcing them to use the same matrix test? no. > Are we getting painful experiences from this? maybe. > Most loud complaints have been about "breaking master" which is the > most volatile branch testing on the most volatile OS. FWIW, I think you misunderstood what those "complaints" where about. I do not know how that relates to the above questions/answers though. > FTR: the reason all those matrices have '5-' prefix is because several > redesigns ago the system was that master/trunk had a matrix which the > sysadmin added nodes to as OS upgraded. During branching vN the > maintainer would clone/freeze that matrix into an N-foo which would be > used to test the code against OS+compilers which the code in the vN > branch was designed to build on. I think the above description implies that some time ago we were (more) careful about (not) adding new nodes when testing stable branches. We did not want a CI change to break a stable branch. That sounds like the right principle to me (and it should apply to beta and master as well). How that specific principle is accomplished is not important (to me) so CI admins should propose whatever technique they think is best. > Can we have the people claiming pain specify exactly what the pain is > coming from, and let the sysadmin/developer(s) with specialized > knowledge of the automation in that area decide how best to fix it? We can, and that is exactly what is going on in this thread AFAICT. This particular thread was caused by CI changes breaking master, and Francesco was discussing how to avoid such breakages in the future. There are other goals/principles to observe, of course, and it is possible that Francesco is proposing more changes to optimize something else as well, but that is something only he can clarify (if needed). AFAICT, Francesco and I are on the same page regarding not breaking master anymore -- he graciously agreed to prevent such breakages in the future, and I am very thankful that he did. Based on your comments discussing several cases where such master breakage is, in your opinion, OK, you currently disagree with that principle. I do not know why. >> I believe we should focus on the first two tiers for our merge workflow, >> but then expect devs to fix any breakages in the third and fourth tiers >> if caused by their PR, >> The rules I have in mind use two natural tiers: >> >> * If a PR cannot pass a required CI test, that PR has to change before >> it can be merged. >> >> * If a PR cannot pass an optional CI test, it is up to PR author and >> reviewers to decide what to do next. > That is already the case. Already well documented and understood. > I see no need to change anything based on those criteria. I hope so! I stated those rules in direct response to, what seemed to me like, a far more complex 4-tier proposal by Francesco. I did not state them to change some Project policy. I state them in hope to understand (and/or simplify) his proposal. >> These are very simple rules that do not require developer knowledge of >> any complex test node tiers that we might define/use internally. > This is the first I've heard about dev having to have such knowledge. Perhaps you interpreted the email I was replying to differently than I did. I hope the above paragraph clarifies the situation. >> Needless to say, the rules assume that the tests themselves are correct. >> If not, the broken tests need to be fixed (by the Squid Project) before >> the first bullet/rule above can be meaningfully applied (the second one >> is flexible enough to allow PR author and reviewers to ignore optional >> test failures). > There is a hidden assumption here too. About the test being applied > correctly. For me, test application (i.e. what tests are applied to what software) correctness is a part of the overall tests correctness (which naturally has many components). From that point of view, there is no hidden assumption here. > I posit that is the real bug we need to sort out. We could keep on > "correcting" the node sets (aka tests) back and forward between being > suitable for master or suitable for release branches. That just shuffles > the pain from one end of the system to the other. > Make Anubis and Jenkins use different matrix for each branch at the B > and C process stages above. Only then will discussion of what nodes to > add to what test/matrix actually make progress. Anubis is pretty much a red herring here AFAICT -- IIRC, Anubis just checks that enough required tests have passed. The test results are supplied by Jenkins and Semaphore CI. If you think that each target branch should have its own set of Jenkins tests for a staged commit, then I see no problem with that per se. You or Francesco should configure GitHub and Jenkins accordingly. I recommend grouping all those tests under one name (that single name can be unique to each target branch), so that Anubis can be told how many tests to expect using a single number (rather than many branch-specific settings). If that grouping becomes a serious problem, we can change Anubis to support a different number of tests for different target branches. > The principle ("invariant" in Alex terminology?) with nodes is that they > represent the OS environment a typical developer can be assumed to be > running on that OS version+compiler combination. FWIW, that principle or invariant (I often use those two words interchangeably) feels a bit unfortunate to me: Given scarce resources, our CI tests should not target Squid developers. They should target typical Squid _deployments_ instead. Fortunately, the difference is often minor. > Distros release security updates to their "stable" versions. Therefore > to stay true to the goal we require constant small upgrades as an > ongoing part of sysadmin maintenance. Yes, I think we are all in agreement that nodes should be upgraded. Deployments often do not upgrade their OSes as often as distros do, even for "security" reasons, but we have to pick the lesser of two evils and upgrade as often as we can (correctly). The only possible disagreement that I know of is about _prerequisites_ for an upgrade, as revisited below. > Adding new nodes with next distro release versions is a manual process > not related to keeping existing nodes up to date (which is automated?). Sure. And new nodes should not be added if that addition breaks master. > From time to time distros break their own ability to compile things. > It does not indicate "broken master" nor "broken CI" in any way. Distro changes cannot indicate "broken master" or "broken CI" _until_ we actually apply them to our CI nodes. If upgrading a node would break or broke master (as already defined many times), then that upgrade should not be done or should be undone (until the master is changed to allow the upgrade to proceed). Keeping master tests successful is a key upgrade precondition IMO. >> There are many ways to break CI and detect those breakages, of course, >> but if master cannot pass required tests after a CI change, then the >> change broke CI. > I have yet to see the code in master be corrupted by CI changes in such > a way that it could not build on peoples development machines. FWIW, I do not see how the above assertion (or anything about peoples development machines) is relevant to this discussion about Project CI and the official merge process. > What we do have going on is network timeouts, DNS resolution, CPU wait > timeouts, and rarely _automated_ CI upgrades all causing short-term > failure to pass a test. Intermittent failures (DNS and such) are out of scope here. CI upgrades, automated or not, should not break master. If they accidentally do, they should be reverted. If reversal is not possible, they should not be attempted in the first place. > A PR fixing newly highlighted bugs gets around the latter. Any pain (eg > master blocked for 2 days waiting on the fix PR to merge) is a normal > problem with that QA process and should not be attributed to the CI change. I do not understand why a problem caused by the CI change should not be attributed to that CI change. > We do need to stop the practice of just dropping > support for any OS where attempting to build finds existing bugs in > master (aka "breaks master, sky falling"). More focus on fixing those > bugs to increase portability and grow the Squid community beyond the > subset of RHEL and Ubuntu users. If somebody can add support for more environments (and adding that support does not make things worse for Squid), then they should certainly add that support! I see no relationship with this discussion -- nobody here wants to prohibit such useful activities, and the "do not break master" invariant can be upheld while adding such support. Alex. _______________________________________________ squid-dev mailing list squid-dev@lists.squid-cache.org http://lists.squid-cache.org/listinfo/squid-dev