On Wed, Aug 24, 2016 at 02:11:32PM -0400, James Slagle wrote: > The latest recurring problem that is failing a lot of the nonha ssl > jobs in tripleo-ci is: > > https://bugs.launchpad.net/tripleo/+bug/1616144 > tripleo-ci: nonha jobs failing with Unable to establish connection to > https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1 > > This error happens while polling for events from the overcloud stack > by tripleoclient. > > I can reproduce this error very easily locally by deploying with an > ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap, > something gets OOM killed. If I do enable swap, swap gets used (< 1GB) > and then I hit this error almost every time. > > The stack keeps deploying but the client has died, so the job fails. > My investigation so far has only pointed out that it's the swap > allocation that is delaying things enough to cause the failure. > > We do not see this error in the ha job even though it deploys more > nodes. As of now, my only suspect is that it's the overhead of the > initial SSL connections causing the error. > > If I test with 6GB ram and 4 vcpus I can't reproduce the error, > although much more swap is used due to the increased number of default > workers for each API service. > > However, I suggest we just raise the undercloud specs in our jobs to > 8GB ram and 4 vcpus. These seem reasonable to me because those are the > default specs used by infra in all of their devstack single and > multinode jobs spawned on all their other cloud providers. Our own > multinode job for the undercloud/overcloud and undercloud only job are > running on instances of these sizes. > Close, our current flavors are 8vCPU, 8GB RAM, 80GB HDD. I'd recommend doing that for the undercloud just to be consistent.
[1] http://docs.openstack.org/infra/system-config/contribute-cloud.html > Yes, this is just sidestepping the problem by throwing more resources > at it. The reality is that we do not prioritize working on optimizing > for speed/performance/resources. We prioritize feature work that > indirectly (or maybe it's directly?) makes everything slower, > especially at this point in the development cycle. > > We should therefore expect to have to continue to provide more and > more resources to our CI jobs until we prioritize optimizing them to > run with less. > I actually believe these problem highlights how large tripleo-ci has grown, and in need of a refactor. While we won't solve this problem today, I do think tripleo-ci is to monolithic today. I believe there is some discussion on breaking jobs into different scenarios, but I haven't had a chance to read up on that. I'm hoping in Barcelona we can have a topic on CI pipelines and how better to optimize our runs. > Let me know if there is any disagreement on making these changes. If > there isn't, I'll apply them in the next day or so. If there are any > other ideas on how to address this particular bug for some immediate > short term relief, please let me know. > > -- > -- James Slagle > -- > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev