Reminder on Try usage and infrastructure resources

Stuart Philp Thu, 14 Sep 2017 08:36:21 -0700

Hello all,

As we near 57 the Firefox CI group felt it was important to send out a bit
of a reminder regarding infrastructure usage when you push.


*tl;dr* There is a real cost (both time and $) to using the 'all' flags in
pushes. They are there if you need them, but please remember to think about
what platforms and test suites you need to execute before you push, and
limit the scope of execution if you can.

A bit of background, our build and test infrastructure is a mix of physical
hardware and AWS cloud instances. AWS scales dynamically to our load, but
our physical hardware is limited. Occasionally you might see wait times and
queues build up, this is typically due to our hardware being overwhelmed.
When it gets really bad, we sometimes have to close the trees to allow the
machines to catch up. Obviously, that's not good for anyone. Specifically,
over the last few weeks we have seen a few long backlogs on our OSX
machines, once requiring tree closure. We never want to have to close
trees, it's a last resort, especially this close to beta.

Because of the physical hardware limitation, this is particularly
concerning for performance tests and tests that run on OSX (OSX builds are
now cross-compiled on Linux and not really affected). If you don't need to
run perf or OSX tests, please consider excluding them from your pushes.
ahal sent mail a few weeks ago about the new fuzzy
<https://ahal.ca/blog/2017/mach-try-fuzzy/> matching tool, which can be
useful here to help you figure out what to select.

To give you an idea of scale, we average 1000 pushes per week on
integration branches (excluding try). Our desktop tests alone (excluding
numbers for android, build jobs, and a handful of others) use roughly 900
machine hours per push. 900k machine hours per week combined. Including try
and those other configurations you can roughly double these numbers.
Needless to say that's a lot of machine time, and so any savings we can get
can really add up.

We are continuously monitoring our capacity requirements for today and for
the future (new platforms, updated OSes, new experiments, new tests, etc).
But it's a dynamic problem, and sometimes things pile up. While we accept
that today, it's a problem we want to further limit in the future. There
are a lot of interesting things we're working on here, such as selective
test execution, intermittent reduction strategies, smarter tooling, and
smarter infrastructure allocation that will hopefully go a long way to
reducing these issues. We'll continue to update everyone here as we make
those improvements.

In the mean time, just a reminder to be diligent with what platforms and
test suites you are running.

If you have any questions feel free to reach out.

Thanks!
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reminder on Try usage and infrastructure resources

Reply via email to