Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

Ben Nemec Wed, 19 Apr 2017 09:32:40 -0700

TLDR: We have the capacity to do this. One scale job can be absorbedinto our existing test infrastructure with minimal impact.


On 04/19/2017 07:50 AM, Flavio Percoco wrote:

On 18/04/17 14:28 -0400, Emilien Macchi wrote:

On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick
<[email protected]> wrote:

Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.


What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.


Wait no, I know some folks at 3am on a Saturday night who use TripleO
CI (ok that was a joke).


Jokes apart, it really depends on the TZ and when you schedule it. 3:00
UTC on a
Sunday is Monday 13:00 in Sydney :) Saturdays might work better but
remember
that some countries work on Sundays.

With the exception of the brief period where the ovb jobs were runningat full capacity 24 hours a day, there has always been a lull inactivity during early morning UTC. Yes, there are people working duringthat time, but generally far fewer and the load on TripleO CI is at itslowest point. Honestly I'd be okay running this scale job every night,not just on the weekend. A week of changes is a lot to sift through ifa scaling issue creeps into one of the many, many projects that affectsuch things in TripleO.

Also, I should note that we're not currently being constrained byabsolute hardware limits in rh1. The reason I haven't scaled ourconcurrent jobs higher is that there is already performance degradationwhen we have a full 70 jobs running at once. This type of scale jobwould require a lot of theoretical resources, but those 30 compute nodesare mostly going to be sitting there idle while the controller(s) getdeployed, so in reality their impact on the infrastructure is going tobe less than if we just added more concurrent jobs that used 30additional nodes. And we do have the memory/cpu/disk to spare in rh1 tospin up more vms.

We could also take advantage of heterogeneous OVB environments now sothat the compute nodes are only 3 GB VMs instead of 8 as they are now.That would further reduce the impact of this sort of job. It wouldrequire some tweaks to how the testenvs are created, but that shouldn'tbe a problem.

To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.


Yes this proposal is really cool. There is an alternative to run this
periodic scenario outside TripleO CI and send results via email maybe.
But it is something we need to discuss with RDO Cloud people and see
if we would have such resources to make it on a weekly frequency.

Thanks for bringing this up, it's crucial for us to have this kind of
feedback, now let's take actions.


+1

Flavio



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

Reply via email to