On 30-Sep-15 14:59, Clint Byrum wrote: > Excerpts from Anant Patil's message of 2015-09-30 00:10:52 -0700: >> Hi, >> >> One of remaining items in convergence is detecting and handling engine >> (the engine worker) failures, and here are my thoughts. >> >> Background: Since the work is distributed among heat engines, by some >> means heat needs to detect the failure and pick up the tasks from failed >> engine and re-distribute or run the task again. >> >> One of the simple way is to poll the DB to detect the liveliness by >> checking the table populated by heat-manage. Each engine records its >> presence periodically by updating current timestamp. All the engines >> will have a periodic task for checking the DB for liveliness of other >> engines. Each engine will check for timestamp updated by other engines >> and if it finds one which is older than the periodicity of timestamp >> updates, then it detects a failure. When this happens, the remaining >> engines, as and when they detect the failures, will try to acquire the >> lock for in-progress resources that were handled by the engine which >> died. They will then run the tasks to completion. >> >> Another option is to use a coordination library like the community owned >> tooz (http://docs.openstack.org/developer/tooz/) which supports >> distributed locking and leader election. We use it to elect a leader >> among heat engines and that will be responsible for running periodic >> tasks for checking state of each engine and distributing the tasks to >> other engines when one fails. The advantage, IMHO, will be simplified >> heat code. Also, we can move the timeout task to the leader which will >> run time out for all the stacks and sends signal for aborting operation >> when timeout happens. The downside: an external resource like >> Zookeper/memcached etc are needed for leader election. >> > > It's becoming increasingly clear that OpenStack services in general need > to look at distributed locking primitives. There's a whole spec for that > right now: > > https://review.openstack.org/#/c/209661/ > > I suggest joining that conversation, and embracing a DLM as the way to > do this. >
Thanks Clint for pointing to this. > Also, the leader election should be per-stack, and the leader selection > should be heavily weighted based on a consistent hash algorithm so that > you get even distribution of stacks to workers. You can look at how > Ironic breaks up all of the nodes that way. They're using a similar lock > to the one Heat uses now, so the two projects can collaborate nicely on > a real solution. > >From each stack, all the resources are distributed among heat engines, so it is evenly distributed at resource level. I need to investigate more on this. Thoughts are welcome. > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev