Re: [openstack-dev] [heat] Convergence: Detecting and handling worker failures

Anant Patil Wed, 30 Sep 2015 07:20:35 -0700

On 30-Sep-15 14:59, Clint Byrum wrote:
> Excerpts from Anant Patil's message of 2015-09-30 00:10:52 -0700:
>> Hi,
>>
>> One of remaining items in convergence is detecting and handling engine
>> (the engine worker) failures, and here are my thoughts.
>>
>> Background: Since the work is distributed among heat engines, by some
>> means heat needs to detect the failure and pick up the tasks from failed
>> engine and re-distribute or run the task again.
>>
>> One of the simple way is to poll the DB to detect the liveliness by
>> checking the table populated by heat-manage. Each engine records its
>> presence periodically by updating current timestamp. All the engines
>> will have a periodic task for checking the DB for liveliness of other
>> engines. Each engine will check for timestamp updated by other engines
>> and if it finds one which is older than the periodicity of timestamp
>> updates, then it detects a failure. When this happens, the remaining
>> engines, as and when they detect the failures, will try to acquire the
>> lock for in-progress resources that were handled by the engine which
>> died. They will then run the tasks to completion.
>>
>> Another option is to use a coordination library like the community owned
>> tooz (http://docs.openstack.org/developer/tooz/) which supports
>> distributed locking and leader election. We use it to elect a leader
>> among heat engines and that will be responsible for running periodic
>> tasks for checking state of each engine and distributing the tasks to
>> other engines when one fails. The advantage, IMHO, will be simplified
>> heat code. Also, we can move the timeout task to the leader which will
>> run time out for all the stacks and sends signal for aborting operation
>> when timeout happens. The downside: an external resource like
>> Zookeper/memcached etc are needed for leader election.
>>
> 
> It's becoming increasingly clear that OpenStack services in general need
> to look at distributed locking primitives. There's a whole spec for that
> right now:
> 
> https://review.openstack.org/#/c/209661/
> 
> I suggest joining that conversation, and embracing a DLM as the way to
> do this.
>


Thanks Clint for pointing to this.

> Also, the leader election should be per-stack, and the leader selection
> should be heavily weighted based on a consistent hash algorithm so that
> you get even distribution of stacks to workers. You can look at how
> Ironic breaks up all of the nodes that way. They're using a similar lock
> to the one Heat uses now, so the two projects can collaborate nicely on
> a real solution.
>

>From each stack, all the resources are distributed among heat engines,
so it is evenly distributed at resource level. I need to investigate
more on this. Thoughts are welcome.
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [heat] Convergence: Detecting and handling worker failures

Reply via email to