hi kats (cross-posting to dev-b2g); tl:dr; we think all is ok again, details below. To avoid this happening again this week, we're changing tryserver to reduce the number of Android-tests-run-on-tegra-by-default. If you specifically want tegra testing on tryserver, you will need to state that when pushing to try.
Yes, load on try was unusually heavy today (the b2g workweek is in full force). All other platforms, including our pool of panda Android4 test boards were handling this heavy load just fine. However, our pool of tegra boards is small (hard to get boards), had more-then-usual percentage offline, and was not able to keep up. We were distracted by an unrelated 64bit windows problem, but thats no excuse, we should have detected this earlier. After this is all calm, we'll postmortem. 1) As of earlier this afternoon, 976 of the pending 1078 tegra jobs are from try. This was no single "abuse of try", this was simply an accumulation of a lot of pushes-to-try spread across the day. 2) We manually repoked every one of the offline tegras, and most are now working correctly again. As of now, our tegra pool is much healthier size again, back up to as-good-as-we-can-hope-for-with-poolsize, and quickly chewing through the remaining backlog. At this time, we are down to 240 pending jobs, and dropping fast. http://builddata.pub.build.mozilla.org/reports/pending/pending.html Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915457 "Triage tegras with no completed jobs within last 24 hours" 3) We cancelled all pending tegra test jobs on try that were waiting over 6 hours (the longest was 10hours). Note: we did not cancel panda test jobs, and we did not cancel tegra test jobs on other branches. If we cancelled a tegra test job on try that you do still need run, please let us know in #releng, and we'll sort it out. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915481 "Stop try tegra jobs pending for >6 hours" 4) We are changing the default when you push to try. Until now, by default Try will generate Android builds and then run *all* unittests+talos for Android2.2 (tegras) *and* Android4 (pandas). We are now testing a change to default as follows: 4a) Android builds: no change, still built by default 4b) Android pandas tests: no change, still run unittest and talos by default 4c) Android tegras tests: unittests would not be run by default, but talos would still be run by default. Because of details in how TryServer works, talos would still be run on tegras by default, in order to keep scheduling talos on pandas also. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915465 "Pushes to try should not trigger tegra test jobs by default". Again, this is just changing the default: anyone who wants Android2.2 specific unittests run on tegras on try can still get that by specify it explicitly when pushing to try. Note: this change is for default on try *only*. There is no change to what any other (non-try) branches do for android testing on tegras, those remain as-is. As mhommey and joel discussed earlier in this thread, changing try default to test on pandas, but not also run all tests on tegras, does introduce a slim-but-non-zero risk of missing a problem that only a tegra would have caught with that try push. Note that we are still running tegra testing on non-try branches, as usual, so even if a problem like this is missed on try, it will be caught the first time it lands on any other branch that has Android coverage. After this workweek, we can revisit whether this default setting needs to be reverted. Let me know if you want any further info, ok? John. ====== On 9/11/13 2:31 PM, Kartikaya Gupta wrote: > Earlier today the backlog on Android build jobs was on the order of > 1300. It seems to be coming down a little now but for a while there I > was worried it was going to grow unboundedly. Try jobs from over 10 > hours ago still have pending jobs - as I'm sure you all know, having a > 10-hour turnaround on try pushes is something of a productivity killer. > > I brought this up in #releng and one of the proposed solutions was to > try to tweak the prioritization of jobs between Try and Inbound a little > bit. I personally do like that Inbound jobs are prioritized above Try, > but perhaps they don't need to be prioritized quite so much. However, > changing this will affect a number of people, so it was suggested I > bring the discussion here to get other people's comments. > > So, anybody have thoughts on a good way to solve this problem? > > Cheers, > kats > _______________________________________________ > dev-platform mailing list > dev-platform@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform
signature.asc
Description: OpenPGP digital signature
_______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform