Hi all, There are some cases that a communication failure between the different nova services, might cause a bad state in the system.
For example, when "shelving" a VM, nova-api puts the VM's task_state as "shelving", sends an RPC to nova-compute, which shelves the VM, and resets it's task_state in DB. But, if for some reason, nova-compute didn't get the message (i.e. the RPC service was down, there's a bug in the RPC service, nova-compute was down, there was a temporary network malfunction), the VM is now stuck as "shelving", and the user can't perform any operation on the stuck VM. This example applies to a couple of scenarios in the system that involve communication between different services. >From nova-api's point-of-view, all it does is sending a message through RPC, and neither actually checks that the message was received, nor waits to get a reply or an acknowledgement from the receiver. Of course, to solve this, a user can "reset-state" on a VM, and try to run the action again, but this is error-prone and doesn't scale. Possible solutions might be: - nova-api should receive an acknowledgement from nova-compute. It is unclear to me why today it uses a non-reply mechanism - probably to free the worker as fast as it can. - Change the task_state mechanism to prevent this kind of a stuck state to stay in the DB. nova-compute can be the one that writes the task_state to the DB, but this is not enough of course, but maybe there's another way? - nova-api could start a timer for the action to complete. If the shelving operation hasn't completed in X seconds, it will clean it by itself and rollback\try-again. What do you think about the problem and the solutions? Thanks, Shoham Peller
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev