Hi all,

There are some cases that a communication failure between the different
nova services, might cause a bad state in the system.

For example, when "shelving" a VM, nova-api puts the VM's task_state as
"shelving", sends an RPC to nova-compute, which shelves the VM, and resets
it's task_state in DB.
But, if for some reason, nova-compute didn't get the message (i.e. the RPC
service was down, there's a bug in the RPC service, nova-compute was down,
there was a temporary network malfunction), the VM is now stuck as
"shelving", and the user can't perform any operation on the stuck VM.
This example applies to a couple of scenarios in the system that involve
communication between different services.

>From nova-api's point-of-view, all it does is sending a message through
RPC, and neither actually checks that the message was received, nor waits
to get a reply or an acknowledgement from the receiver.

Of course, to solve this, a user can "reset-state" on a VM, and try to run
the action again, but this is error-prone and doesn't scale.

Possible solutions might be:

   - nova-api should receive an acknowledgement from nova-compute. It is
   unclear to me why today it uses a non-reply mechanism - probably to free
   the worker as fast as it can.
   - Change the task_state mechanism to prevent this kind of a stuck state
   to stay in the DB. nova-compute can be the one that writes the task_state
   to the DB, but this is not enough of course, but maybe there's another way?
   - nova-api could start a timer for the action to complete. If the
   shelving operation hasn't completed in X seconds, it will clean it by
   itself and rollback\try-again.

What do you think about the problem and the solutions?

Thanks,
Shoham Peller
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to