Re: [openstack-dev] [heat] Convergence: Detecting and handling worker failures

ZengYingzhe Wed, 30 Sep 2015 02:54:44 -0700

Hi Anant,
For the second option, if the leader engine fails, how to trigger a new leader 
election progress?
Best Regards,Yingzhe Zeng


> To: openstack-dev@lists.openstack.org
> From: anant.pa...@hpe.com
> Date: Wed, 30 Sep 2015 12:40:52 +0530
> Subject: [openstack-dev] [heat] Convergence: Detecting and handling worker 
> failures
> 
> Hi,
> 
> One of remaining items in convergence is detecting and handling engine
> (the engine worker) failures, and here are my thoughts.
> 
> Background: Since the work is distributed among heat engines, by some
> means heat needs to detect the failure and pick up the tasks from failed
> engine and re-distribute or run the task again.
> 
> One of the simple way is to poll the DB to detect the liveliness by
> checking the table populated by heat-manage. Each engine records its
> presence periodically by updating current timestamp. All the engines
> will have a periodic task for checking the DB for liveliness of other
> engines. Each engine will check for timestamp updated by other engines
> and if it finds one which is older than the periodicity of timestamp
> updates, then it detects a failure. When this happens, the remaining
> engines, as and when they detect the failures, will try to acquire the
> lock for in-progress resources that were handled by the engine which
> died. They will then run the tasks to completion.
> 
> Another option is to use a coordination library like the community owned
> tooz (http://docs.openstack.org/developer/tooz/) which supports
> distributed locking and leader election. We use it to elect a leader
> among heat engines and that will be responsible for running periodic
> tasks for checking state of each engine and distributing the tasks to
> other engines when one fails. The advantage, IMHO, will be simplified
> heat code. Also, we can move the timeout task to the leader which will
> run time out for all the stacks and sends signal for aborting operation
> when timeout happens. The downside: an external resource like
> Zookeper/memcached etc are needed for leader election.
> 
> In the long run, IMO, using a library like tooz will be useful for heat.
> A lot of boiler plate needed for locking and running centralized tasks
> (such as timeout) will not be needed in heat. Given that we are moving
> towards distribution of tasks and horizontal scaling is preferred, it
> will be advantageous to use them.
> 
> Please share your thoughts.
> 
> - Anant
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [heat] Convergence: Detecting and handling worker failures

Reply via email to