tl:dr: We recently installed system monitoring software on our buildbot masters, build-not-test slaves, and various other RelEng machines. IT want to continue this rollout, deploying monitoring software onto RelEng production test machines, which raises a concern about possible impact to performance numbers. If you see any production impact, please let us know.
====== We are being asked by IT to deploy monitoring tools onto all build, unittest and performance testing machines. These are to help gather system level statistics about CPU, memory, disk utilization, etc. This is so IT can monitor efficiency of production jobs run on these systems. This monitoring software has already been installed on buildbot masters, linux+mac builders, and some misc other servers. As those changes were zero-risk to production, we didn't need to forewarn these newsgroups. However, installing this software on production win32/64 builders and win/mac/linux performance testers has a small-but-non-zero risk that the act of running these tools will change the timing results in performance test jobs. Hence this advance notice. Exact timing of this rollout is waiting on some unrelated win64 toolchain builder fixes to finish being deployed into production. We all agreed that adding these monitoring tools *at the same time* as doing windows toolchain upgrade, would unnecessarily complicate problem detection. Once everything is ready for final deploy, another post will be sent to newsgroup (and sheriffs), to help with any possible after-the-fact regression range hunting. If there are any performance result wobble because of these changes, I've been told we can tolerate minor performance result disruption for a week or so, without impacting releases. Currently, this experiment is slated to run for 2 weeks, but obviously, if this monitoring introduces larger disruption, we will disable them asap. Sheriffs and RelEng buildduty will be monitoring closely, but as always, if you see anything weird, please make sure they know asap. No downtime is required, as our systems will pick up these changes between test runs as machines reboot. The curious can follow along in bug#920626 (deploy collectd to RelEng mac+linux test systems) and bug#920629 (deploy graphite client to RelEng Windows build and test systems). If you've any questions, or concerns, please let me know. John.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform