On 29/05/14 19:52, Clint Byrum wrote:
Multiple Stacks
===============

We could break the stack up between controllers, and compute nodes. The
controller will be less likely to fail because it will probably be 3 nodes
for a reasonably sized cloud. The compute nodes would then live in their
own stack of (n) nodes. We could further break that up into chunks of
compute nodes, which would further mitigate failure. If a small chunk of
compute nodes fails, we can just migrate off of them. One challenge here
is that compute nodes need to know about all of the other compute nodes
to support live migration. We would have to do a second stack update after
creation to share data between all of these stacks to make this work.

Pros: * Exists today

Cons: * Complicates host awareness
       * Still vulnerable to stack failure (just reduces probability and
         impact).

Separating the controllers and compute nodes is something you should do anyway (although moving to autoscaling, which will be even better when it is possible, would actually have the same effect). Splitting the compute nodes into smaller groups would certainly reduce the cost of failure. If we were to use an OS::Heat::Stack resource that calls python-heatclient instead of creating a nested stack in the same engine, then these child stacks would get split across a multi-engine deployment automagically. There's a possible implementation already at https://review.openstack.org/53313

update-failure-recovery
=======================

This is a blueprint I believe Zane is working on to land in Juno. It will
allow us to retry a failed create or update action. Combined with the
separate controller/compute node strategy, this may be our best option,
but it is unclear whether that code will be available soon or not. The
chunking is definitely required, because with 500 compute nodes, if
node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
cancelled, which makes the impact of a transient failure quite extreme.
Also without chunking, we'll suffer from some of the performance
problems we've seen where a single engine process will have to do all of
the work to bring up a stack.

Pros: * Uses blessed strategy

Cons: * Implementation is not complete
      * Still suffers from heavy impact of failure
      * Requires chunking to be feasible

I've already started working on this and I'm expecting to have this ready some time between the j-1 and j-2 milestones.

I think these two strategies combined could probably get you a long way in the short term, though obviously they are not a replacement for the convergence strategy in the long term.


BTW You missed off another strategy that we have discussed in the past, and which I think Steve Baker might(?) be working on: retrying failed calls at the client level.

cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to