On Wed, Aug 24, 2016 at 02:11:32PM -0400, James Slagle wrote:
> The latest recurring problem that is failing a lot of the nonha ssl
> jobs in tripleo-ci is:
> 
> https://bugs.launchpad.net/tripleo/+bug/1616144
> tripleo-ci: nonha jobs failing with Unable to establish connection to
> https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1
> 
> This error happens while polling for events from the overcloud stack
> by tripleoclient.
> 
> I can reproduce this error very easily locally by deploying with an
> ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap,
> something gets OOM killed. If I do enable swap, swap gets used (< 1GB)
> and then I hit this error almost every time.
> 
> The stack keeps deploying but the client has died, so the job fails.
> My investigation so far has only pointed out that it's the swap
> allocation that is delaying things enough to cause the failure.
> 
> We do not see this error in the ha job even though it deploys more
> nodes. As of now, my only suspect is that it's the overhead of the
> initial SSL connections causing the error.
> 
> If I test with 6GB ram and 4 vcpus I can't reproduce the error,
> although much more swap is used due to the increased number of default
> workers for each API service.
> 
> However, I suggest we just raise the undercloud specs in our jobs to
> 8GB ram and 4 vcpus. These seem reasonable to me because those are the
> default specs used by infra in all of their devstack single and
> multinode jobs spawned on all their other cloud providers. Our own
> multinode job for the undercloud/overcloud and undercloud only job are
> running on instances of these sizes.
> 
Close, our current flavors are 8vCPU, 8GB RAM, 80GB HDD. I'd recommend doing
that for the undercloud just to be consistent.

[1] http://docs.openstack.org/infra/system-config/contribute-cloud.html

> Yes, this is just sidestepping the problem by throwing more resources
> at it. The reality is that we do not prioritize working on optimizing
> for speed/performance/resources. We prioritize feature work that
> indirectly (or maybe it's directly?) makes everything slower,
> especially at this point in the development cycle.
> 
> We should therefore expect to have to continue to provide more and
> more resources to our CI jobs until we prioritize optimizing them to
> run with less.
> 
I actually believe these problem highlights how large tripleo-ci has grown, and
in need of a refactor. While we won't solve this problem today, I do think
tripleo-ci is to monolithic today. I believe there is some discussion on
breaking jobs into different scenarios, but I haven't had a chance to read up on
that.

I'm hoping in Barcelona we can have a topic on CI pipelines and how better to
optimize our runs.

> Let me know if there is any disagreement on making these changes. If
> there isn't, I'll apply them in the next day or so. If there are any
> other ideas on how to address this particular bug for some immediate
> short term relief, please let me know.
> 
> -- 
> -- James Slagle
> --
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to