Our pools of test slaves are often at or over capacity, and this has the
effect of increasing job coalescing and test wait times. This, in turn,
can lead to longer tree closures caused by test bustage, and can cause
try runs to be very slow to complete.
One of the easiest ways to mitigate this is to run tests less often.
To assess the impact of doing this, we will be performing an experiment
the week of August 25, in which we will run debug tests on
mozilla-inbound on most desktop platforms every other run, instead of
every run as we do now. Debug tests on linux64 will continue to run
every time. Non-desktop platforms and trees other than mozilla-inbound
will not be affected.
This approach is based on the premise that the number of debug-only
platform-specific failures on desktop is low enough to be manageable,
and that the extra burden this imposes on the sheriffs will be small
enough compared to the improvement in test slave metrics to justify the
cost.
While this experiment is in progress, we will be monitoring job
coalescing and test wait times, as well as impacts on sheriffs and
developers. If the experiment causes sheriffs to be unable to perform
their job effectively, it can be terminated prematurely.
We intend to use the data we collect during the experiment to inform
decisions about additional tooling we need to make this or a similar
plan permanent at some point in the future, as well as validating the
premise on which this experiment is based.
After the conclusion of this experiment, a follow-up post will be made
which will discuss our findings. If you have any concerns, feel free to
reach out to me.
Jonathan
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform