On Tue, Dec 3, 2013 at 5:02 PM, Nicholas Nethercote <n.netherc...@gmail.com> wrote: > > I'm working on getting Valgrind (Linux64-only) runs visible on TBPL. > > In order to understand what needs to be done, I looked at the "Requirements > for > being shown in the default TBPL view", from From > https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy. > > 3) Runs on mozilla-central and all trees that merge into it > 4) Scheduled on every push
Chris Atlee kindly implemented the required buildbot configuration changes, so the Valgrind runs now occur on every push to m-c, m-i, fx-team, etc. This is great. Unfortunately, on December 5th, just before the changes were enabled, something else happened that caused the Valgrind jobs to start timing out all the time. This is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=948145. The Valgrind-on-TBPL jobs do an --enable-valgrind build (using the mozconfig at browser/config/mozconfigs/linux64/valgrind), and then run scripts/valgrind/valgrind.sh (from the tools/ repo), which invokes the |make pgo-profile-run| target under Valgrind, which runs build/pgo/profileserver.py. At some point during the |make pgo-profile-run| target, we get this: command timed out: 1200 seconds without output, attempting to kill Because the Valgrind invocation is defined via the valgrind.sh script which is within the tools/ repo, instead of mozilla-central/, I've been able to do some interesting experiments -- I can alter the Valgrind invocation in valgrind.sh and re-trigger old jobs. Here's what I have learned. - It's not Valgrind that's causing the problem. If I change valgrind.sh to not invoke Valgrind at all, but instead run the test natively, I still get the timeouts. Here's sample output: > + make pgo-profile-run > /builds/slave/m-cen-l64-valgrind-00000000000/objdir/_virtualenv/bin/python . ./src/build/pgo/profileserver.py > > command timed out: 1200 seconds without output, attempting to kill - Something changed on Dec 5. In the 3am job on Dec 5 on m-c, if I re-run the test under Valgrind it succeeds. If I re-run it natively, it times out. I've done numerous repeats and the behaviour is reliable. - In the 3am job on Dec 6 on m-c (the next available run), it times out consistently when I re-run it under Valgrind *or* natively. So something was fishy but it wasn't manifesting under Valgrind runs... and then something changed on Dec 5 that caused it to manifest all the time. The regression range is http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1401e4b394ad&toch ange=42b2a2adda8f - The timeout sometimes happens very early -- when running under Valgrind, it sometimes happens before Valgrind's start-up message is even printed. But sometimes it happens later. - When a job times out, the output produced by Valgrind is usually chopped off in the middle of a line. In fact, it's often chopped off in the middle of a line that would have been produced by a single write() system call. So it feels like the test harness is somehow abruptly stopping its recording of the output. - The runs that timeout take about 20 minutes longer to complete than those that don't. So the timeout-detection appears to be working correctly -- something is genuinely hanging for 20 minutes, AFAICT. I'd love to hear any ideas people might have about this, esp. relevant changes that landed on Dec 5. I have access to a Linux build slave but I'm not yet sure what I'll do with it. Nick _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform