Re: New policy: 48-hour backouts for major Talos regressions

Joel Maher Wed, 19 Aug 2015 05:53:42 -0700

This is a great point, and I still have no idea what caused the linux 32/64
machines to change on July 30th.  It appeared to be a gradual rollout
(indicates a machine issue which was picked up on reboot or something
similar).  For running talos tests we pin to a specific revision in the
talos repository, this avoids any issues of pulling from tip.  As a side
note, we are really close to landing talos in tree, which will reduce one
level of complexity in understanding this.

Regarding the regression cpearce mentioned- we had done retriggers on the
revision and the previous revision after receiving the alert (in the future
by a day in this case) to rule out any infrastructure changes.  This would
be equivalent to doing a try push at that time in point.  Sadly enough the
numbers only solidified the fact that we had a regression.  Yet a couple
days later pushing to try and doing retriggers showed a completely
different set of numbers.  This happens about twice a year historically.
Given the stated policy I am fairly certain this would have been backed out
(even with a try push) and we would have relanded.  We backout patches many
times a day for unittest or build failures, and sometimes incorrectly if we
mistake the root cause or it is not clear.  We have been trying to use
similar practices to that of the main tree sheriffs and this 48 hour policy
gets us a lot closer.

I do think it is valid for us to push to try to verify that the backout
fixes the regression.  The danger becomes when we have to way 20 hours for
try results (now getting to 72 hours) and then other dependencies on the
patch in question land.  This is why I am skeptical about adding a try push
in if we have enough data on the main tree already.  I guess if we cannot
trust what is on the tree including retriggered jobs to show a trend, then
we wouldn't be able to trust try.  Do chime in if I am missing something
outside of the once or twice a year a core infrastructure issue gives us
false data.

Thanks for bringing up your concerns so far- I look forward to making
future regression bugs more reliable and trustworthy!

On Wed, Aug 19, 2015 at 1:44 AM, L. David Baron <dba...@dbaron.org> wrote:

> On Wednesday 2015-08-19 10:43 +1200, Chris Pearce wrote:
> > We recently had a false positive Talos regression on our team, which
> turned
> > out to be caused by a change to the test machine coinciding with our
> push.
> > This took up a bunch of energy and time away from our team, which we
> really
> > can't afford.
>
> I though we'd collectively learned this lesson a number of times in
> the past, but it seems to keep happening.  Machine configuration
> changes need to either happen in the repository or happen as part of
> a tree closure in which all runs complete, the configuration change
> is made, a dummy changeset is pushed to all trees, and the trees
> reopened.
>
> I think this is in violation of
>
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures
> (see the first bullet point).
>
> -David
>
> --
> 𝄞   L. David Baron                         http://dbaron.org/   𝄂
> 𝄢   Mozilla                          https://www.mozilla.org/   𝄂
>              Before I built a wall I'd ask to know
>              What I was walling in or walling out,
>              And to whom I was like to give offense.
>                - Robert Frost, Mending Wall (1914)
>
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: New policy: 48-hour backouts for major Talos regressions

Reply via email to