Hello,

Is this the right place to report this issue? (please redirect me if not)

As we were experiencing/demonstrating our new cluster yesterday, we stumbled on 
a caveat in our LibvirtQemu resource agent (derived from VirtualDomain). Since 
the caveat is the same in the VirtualDomain resource agent; I thought I better 
report it. Please see the patch below (for LibvirtQemu), which comments should 
allow you to understand where the problem lies.

--- LibvirtQemu.orig    2014-08-22 09:39:21.997201000 +0200
+++ LibvirtQemu    2014-08-22 09:50:32.440969000 +0200
@@ -154,11 +154,10 @@
   local virsh_output
   local domain_name
 
-  # Note: passing in the domain name from outside the script is
-  # intended for testing and debugging purposes only. Don't do this
-  # in production, instead let the script figure out the domain name
-  # from the config file. You have been warned.
-  if [ -z "${DOMAIN_NAME}" ]; then
+  # NOTE: Re-defining an already defined domain is dangerous! It shall be done 
only
+  # if we can reasonably assume the configuration file hasn't changed since 
the last
+  # time the domain has been defined.
+  if [ -z "${DOMAIN_NAME}" ] || [ "${OCF_RESKEY_config}" -ot "${STATEFILE}" ]; 
then
     # Spin until we have a domain name
     while true; do
       virsh_output="$(virsh ${VIRSH_OPTIONS} define ${OCF_RESKEY_config})"
@@ -170,7 +169,7 @@
     echo "${domain_name}" > "${STATEFILE}"
     ocf_log info "Domain name '${domain_name}' saved to state file 
'${STATEFILE}'."
   else
-    ocf_log warn "Domain name '${DOMAIN_NAME}' already defined; overriding 
configuration file '${OCF_RESKEY_config}' (this should NOT ne done in 
production!)."
+    ocf_log warn "Domain name '${DOMAIN_NAME}' already defined; overriding by 
newer configuration file will NOT be done!"
   fi
 }
 
@@ -205,12 +204,12 @@
         ;;
       ''|'no state')
         # Empty string may be returned when virsh does not
-        # receive a reply from libvirtd.
+        # receive a reply from libvirtd or after the domain has
+        # been undefined.
         # "no state" may occur when the domain is currently
         # being migrated (on the migration target only), or
         # whenever virsh can't reliably obtain the domain
         # state.
-        status='no state'
         if [ "${__OCF_ACTION}" == 'stop' ] && [ ${try} -ge 3 ]; then
           # During the stop operation, we want to bail out
           # quickly, so as to be able to force-stop (destroy)
@@ -224,6 +223,17 @@
           ocf_log info "Domain '${DOMAIN_NAME}' currently has no state; 
retrying."
           sleep 1
         fi
+        if [ "${status}" == '' ] && [ $(( ${try} % 10 )) -eq 0 ]; then
+          # Could it be that libvirtd is running healthily but the domain
+          # has been undefined? In that case, let's attempt to re-define it.
+          # If libvirtd IS running, it can not hurt (given the safeguards in
+          # LibvirtQemu_Define). If libvirtd is NOT running, then something is
+          # definitely wrong (and the monitor operation will time-out in
+          # LibvirtQemu_Define the same way as it would here).
+          ocf_log warn "Has domain '${DOMAIN_NAME}' been undefined? attempting 
to re-define it."
+          LibvirtQemu_Define
+        fi
+        status='no state'
         ;;
       *)
         # any other output is unexpected.
@@ -487,6 +497,11 @@
 
 # Define the domain on startup, and re-define whenever someone deleted
 # the state file, or touched the config.
+# WARNING: There is a caveat here! When the resource is stopped, the state file
+# is deleted ONLY on the node where it was running. In case the domain is then
+# undefined (from libvirtd), on all nodes, we will end-up with a state file 
but no
+# domain definition on those nodes that were not running the resource. The 
monitor
+# operation MUST handle that situation, should the resource be restarted.
 if [ ! -e "${STATEFILE}" ] || [ "${OCF_RESKEY_config}" -nt "${STATEFILE}" ]; 
then
   LibvirtQemu_Define
 fi

One could ask "why undefine a libvirt domain and then restart it?". The answer 
is two-fold: 1. experience showed us that we shall undefine a decommissioned 
domain from libvirt to prevent potential UUID conflict when defining a new 
domain (which is likely in our setup, since UUID are build from the domain IP 
address); 2. the "demo-effect" (or potential legitimate reasons), where one 
would "decommission" a domain and restart it right afterwards ( :-/ ).

PS: we now also make sure to delete the VirtualDomain/LibvirtQemu state file 
when undefining the domain. But best have multiple safe guards as far as this 
caveat is concerned (thus the patch above).

Hope it helps,

Cédric

-- 

Cédric Dufour @ Idiap Research Institute

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to