This is a great point, and I still have no idea what caused the linux 32/64 machines to change on July 30th. It appeared to be a gradual rollout (indicates a machine issue which was picked up on reboot or something similar). For running talos tests we pin to a specific revision in the talos repository, this avoids any issues of pulling from tip. As a side note, we are really close to landing talos in tree, which will reduce one level of complexity in understanding this.
Regarding the regression cpearce mentioned- we had done retriggers on the revision and the previous revision after receiving the alert (in the future by a day in this case) to rule out any infrastructure changes. This would be equivalent to doing a try push at that time in point. Sadly enough the numbers only solidified the fact that we had a regression. Yet a couple days later pushing to try and doing retriggers showed a completely different set of numbers. This happens about twice a year historically. Given the stated policy I am fairly certain this would have been backed out (even with a try push) and we would have relanded. We backout patches many times a day for unittest or build failures, and sometimes incorrectly if we mistake the root cause or it is not clear. We have been trying to use similar practices to that of the main tree sheriffs and this 48 hour policy gets us a lot closer. I do think it is valid for us to push to try to verify that the backout fixes the regression. The danger becomes when we have to way 20 hours for try results (now getting to 72 hours) and then other dependencies on the patch in question land. This is why I am skeptical about adding a try push in if we have enough data on the main tree already. I guess if we cannot trust what is on the tree including retriggered jobs to show a trend, then we wouldn't be able to trust try. Do chime in if I am missing something outside of the once or twice a year a core infrastructure issue gives us false data. Thanks for bringing up your concerns so far- I look forward to making future regression bugs more reliable and trustworthy! On Wed, Aug 19, 2015 at 1:44 AM, L. David Baron <dba...@dbaron.org> wrote: > On Wednesday 2015-08-19 10:43 +1200, Chris Pearce wrote: > > We recently had a false positive Talos regression on our team, which > turned > > out to be caused by a change to the test machine coinciding with our > push. > > This took up a bunch of energy and time away from our team, which we > really > > can't afford. > > I though we'd collectively learned this lesson a number of times in > the past, but it seems to keep happening. Machine configuration > changes need to either happen in the repository or happen as part of > a tree closure in which all runs complete, the configuration change > is made, a dummy changeset is pushed to all trees, and the trees > reopened. > > I think this is in violation of > > https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures > (see the first bullet point). > > -David > > -- > 𝄞 L. David Baron http://dbaron.org/ 𝄂 > 𝄢 Mozilla https://www.mozilla.org/ 𝄂 > Before I built a wall I'd ask to know > What I was walling in or walling out, > And to whom I was like to give offense. > - Robert Frost, Mending Wall (1914) > _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform