On Tue, Dec 3, 2013 at 5:02 PM, Nicholas Nethercote
<n.netherc...@gmail.com> wrote:
>
> I'm working on getting Valgrind (Linux64-only) runs visible on TBPL.
>
> In order to understand what needs to be done, I looked at the "Requirements 
> for
> being shown in the default TBPL view", from From
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy.
>
> 3) Runs on mozilla-central and all trees that merge into it
> 4) Scheduled on every push

Chris Atlee kindly implemented the required buildbot configuration changes, so
the Valgrind runs now occur on every push to m-c, m-i, fx-team, etc.  This is
great.

Unfortunately, on December 5th, just before the changes were enabled, something
else happened that caused the Valgrind jobs to start timing out all the time.
This is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=948145.

The Valgrind-on-TBPL jobs do an --enable-valgrind build (using the mozconfig at
browser/config/mozconfigs/linux64/valgrind), and then run
scripts/valgrind/valgrind.sh (from the tools/ repo), which invokes the |make
pgo-profile-run| target under Valgrind, which runs build/pgo/profileserver.py.

At some point during the |make pgo-profile-run| target, we get this:

  command timed out: 1200 seconds without output, attempting to kill

Because the Valgrind invocation is defined via the valgrind.sh script which is
within the tools/ repo, instead of mozilla-central/, I've been able to do some
interesting experiments -- I can alter the Valgrind invocation in valgrind.sh
and re-trigger old jobs.  Here's what I have learned.

- It's not Valgrind that's causing the problem.  If I change valgrind.sh to not
  invoke Valgrind at all, but instead run the test natively, I still get the
  timeouts.  Here's sample output:

  > + make pgo-profile-run
  > /builds/slave/m-cen-l64-valgrind-00000000000/objdir/_virtualenv/bin/python .
./src/build/pgo/profileserver.py
  >
  > command timed out: 1200 seconds without output, attempting to kill

- Something changed on Dec 5.  In the 3am job on Dec 5 on m-c, if I re-run the
  test under Valgrind it succeeds.  If I re-run it natively, it times out.
  I've done numerous repeats and the behaviour is reliable.

- In the 3am job on Dec 6 on m-c (the next available run), it times out
  consistently when I re-run it under Valgrind *or* natively.  So something was
  fishy but it wasn't manifesting under Valgrind runs... and then something
  changed on Dec 5 that caused it to manifest all the time.  The regression
  range is
  http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1401e4b394ad&toch
ange=42b2a2adda8f

- The timeout sometimes happens very early -- when running under Valgrind, it
  sometimes happens before Valgrind's start-up message is even printed.  But
  sometimes it happens later.

- When a job times out, the output produced by Valgrind is usually chopped off
  in the middle of a line.  In fact, it's often chopped off in the middle of
  a line that would have been produced by a single write() system call.  So it
  feels like the test harness is somehow abruptly stopping its recording of the
  output.

- The runs that timeout take about 20 minutes longer to complete than those
  that don't.  So the timeout-detection appears to be working correctly --
  something is genuinely hanging for 20 minutes, AFAICT.

I'd love to hear any ideas people might have about this, esp. relevant changes
that landed on Dec 5.  I have access to a Linux build slave but I'm not yet
sure what I'll do with it.

Nick
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to