On 2013-08-14 16:10, Matthew Treinish wrote:
On Wed, Aug 14, 2013 at 11:05:35AM -0500, Ben Nemec wrote:
On 2013-08-13 16:39, Clark Boylan wrote:
>On Tue, Aug 13, 2013 at 1:25 PM, Matthew Treinish
><mtrein...@kortar.org> wrote:
>>
>>Hi everyone,
>>
>>So for the past month or so I've been working on getting tempest
>>to work stably
>>with testr in parallel. As part of this you may have noticed the
>>testr-full
>>jobs that get run on the zuul check queue. I was using that job
>>to debug some
>>of the more obvious race conditions and stability issues with
>>running tempest
>>in parallel. After a bunch of fixes to tempest and finding some
>>real bugs in
>>some of the projects things seem to have smoothed out.
>>
>>So I pushed the testr-full run to the gate queue earlier today.
>>I'll be keeping
>>track of the success rate of this job vs the serial job and use
>>this as the
>>determining factor before we push this live to be the default
>>for all tempest
>>runs. So assuming that the success rate matches up well enough
>>with serial job
>>on the gate queue then I will push out the change that will
>>migrate all the
>>voting jobs to run in parallel hopefully either Friday afternoon
>>or early next
>>week. Also, if anyone has any input on what threshold they feel
>>is good enough
>>for this I'd welcome any input on that. For example, do we want
>>to ensure
>>a >= 1:1 match for job success? Or would something like 90% as
>>stable as the
>>serial job be good enough considering the speed advantage. (The
>>parallel runs
>>take about half as much time as a full serial run, the parallel
>>job normally
>>finishes in ~25-30min) Since this affects almost every project I
>>don't want to
>>define this threshold without input from everyone.
>>
>>After there is some more data for the gate queue's parallel job
>>I'll have some
>>pretty graphite graphs that I can share comparing the success
>>trends between
>>the parallel and serial jobs.
>>
>>So at this point we're in the home stretch and I'm asking for
>>everyone's help
>>in getting this merged. So, if everyone who is reviewing and
>>pushing commits
>>could watch the results from these non-voting jobs and if things
>>fail on the
>>parallel job but not the serial job please investigate the
>>failure and open a
>>bug if necessary. If it turns out to be a bug in tempest please
>>link it against
>>this blueprint:
>>
>>https://blueprints.launchpad.net/tempest/+spec/speed-up-tempest
>>
>>so that I'll give it the attention it deserves. I'd hate to get
>>this close to
>>getting this merged and have a bit of racy code get merged at
>>the last second
>>and block us for another week or two.
>>
>>I feel that we need to get this in before the H3 rush starts up
>>as it will help
>>everyone get through the extra review load faster.
>>
>Getting this in before the H3 rush would be very helpful. When we made
>the switch with Nova's unittests we fixed as many of the test bugs
>that we could find, merged the change to switch the test runner, then
>treated all failures as very high priority bugs that received
>immediate attention. Getting this in before H3 will give everyone a
>little more time to debug any potential new issues exposed by Jenkins
>or people running the tests locally.
>
>I think we should be bold here and merge this as soon as we have good
>numbers that indicate the trend is for these tests to pass. Graphite
>can give us the pass to fail ratios over time, as long as these trends
>are similar for both the old nosetest jobs and the new testr job I say
>we go for it. (Disclaimer: most of the projecst I work on are not
>affected by the tempest jobs; however, I am often called upon to help
>sort out issues in the gate).
I'm inclined to agree. It's not as if we don't have transient
failures now, and if we're looking at a 50% speedup in
recheck/verify times then as long as the new version isn't
significantly less stable it should be a net improvement.
Of course, without hard numbers we're kind of discussing in a vacuum
here.
I also would like to get this in sooner rather than later and fix the
bugs as
they come in. But, I'm wary of doing this because there isn't a proven
success
history yet. No one likes gate resets, and I've only been running it on
the
gate queue for a day now.
So here is the graphite graph that I'm using to watch parallel vs
serial in the
gate queue:
https://tinyurl.com/pdfz93l
Okay, so what are the y-axis units on this? Because just guessing I
would say that it's percentage of failing runs, in which case it looks
like we're already within the 95% as accurate range (it never dips below
-.05). Am I reading it right?
On that graph the blue and yellow shows the number of jobs that
succeeded
grouped together in per hour buckets. (yellow being parallel and blue
serial)
Then the red line is showing failures, a horizontal bar means that
there is no
difference in the number of failures between serial and parallel. When
it dips
negative it is showing a failure in parallel that wasn't on serial a
serial run
at the same time. When it goes positive it showing a failure on serial
that
doesn't occur on parallel at the same time. But, because the serial
runs take
longer the failures happen at an offset. So if the plot shows parallel
fails
followed closely by a serial failure than that is probably on the same
commit
and not a parallel specific issue.
Based on the results so far it looks like there is probably still a
race or 2
which would cause gate resets more than once in a day if we move to
parallel.
But, it's getting closer, what does everyone think?
My only concern is the time it takes me to track down these and get the
fixes
merged something new will pop up. For example, the last time it got
almost this
close a nova patch got merged that broke almost all the aggregates
tests in
tempest when running in parallel. Which prevented any run from
succeeding.
Thanks,
Matt Treinish
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev