On 24 August 2016 at 19:11, James Slagle <james.sla...@gmail.com> wrote: > The latest recurring problem that is failing a lot of the nonha ssl > jobs in tripleo-ci is: > > https://bugs.launchpad.net/tripleo/+bug/1616144 > tripleo-ci: nonha jobs failing with Unable to establish connection to > https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1 > > This error happens while polling for events from the overcloud stack > by tripleoclient. > > I can reproduce this error very easily locally by deploying with an > ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap, > something gets OOM killed. If I do enable swap, swap gets used (< 1GB) > and then I hit this error almost every time. > > The stack keeps deploying but the client has died, so the job fails. > My investigation so far has only pointed out that it's the swap > allocation that is delaying things enough to cause the failure. > > We do not see this error in the ha job even though it deploys more > nodes. As of now, my only suspect is that it's the overhead of the > initial SSL connections causing the error. > > If I test with 6GB ram and 4 vcpus I can't reproduce the error, > although much more swap is used due to the increased number of default > workers for each API service. > > However, I suggest we just raise the undercloud specs in our jobs to > 8GB ram and 4 vcpus. These seem reasonable to me because those are the > default specs used by infra in all of their devstack single and > multinode jobs spawned on all their other cloud providers. Our own > multinode job for the undercloud/overcloud and undercloud only job are > running on instances of these sizes. > > Yes, this is just sidestepping the problem by throwing more resources > at it. The reality is that we do not prioritize working on optimizing > for speed/performance/resources. We prioritize feature work that > indirectly (or maybe it's directly?) makes everything slower, > especially at this point in the development cycle.
Yup, I couldn't agree with this more it is exactly what happens. And as long as everybody remains driven by particular features its going to be the case. Ideally we'd have somebody who's driving force is simply to take what we have at any particular point in time profile certain pain points and make improvements where they can be made tune things etc.... > > We should therefore expect to have to continue to provide more and > more resources to our CI jobs until we prioritize optimizing them to > run with less. > > Let me know if there is any disagreement on making these changes. If > there isn't, I'll apply them in the next day or so. If there are any > other ideas on how to address this particular bug for some immediate > short term relief, please let me know. Not disagreeing but just a reminder to double check quota's and over-commit ratios (for vCPU) so things will still fit where the should be. Also its worth noting that act of increasing the number of vCPU's available to the undercloud will not only increase the memory requirements of the undercloud (we know this happens) but the extra services even if unused may cause additional cpu usage on the host so this is worth monitoring. > > -- > -- James Slagle > -- > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev