Hi, On Fri, Aug 22, 2014 at 10:23:29AM +0200, Cédric Dufour - Idiap Research Institute wrote: > Hello, > > Is this the right place to report this issue? (please redirect me if not)
Yes. Though bugs/issues/fixes are nowadays mostly handled at github.com/ClusterLabs/resource-agents and reports there have certainly more visibility. > As we were experiencing/demonstrating our new cluster yesterday, we stumbled > on a caveat in our LibvirtQemu resource agent (derived from VirtualDomain). > Since the caveat is the same in the VirtualDomain resource agent; I thought I > better report it. Please see the patch below (for LibvirtQemu), which > comments should allow you to understand where the problem lies. Perhaps I missed something, but may I ask why did you decide to create a new RA instead of improving the existing one? Was there anything in VirtualDomain making it unsuitable for your use case? > --- LibvirtQemu.orig 2014-08-22 09:39:21.997201000 +0200 > +++ LibvirtQemu 2014-08-22 09:50:32.440969000 +0200 > @@ -154,11 +154,10 @@ > local virsh_output > local domain_name > > - # Note: passing in the domain name from outside the script is > - # intended for testing and debugging purposes only. Don't do this > - # in production, instead let the script figure out the domain name > - # from the config file. You have been warned. > - if [ -z "${DOMAIN_NAME}" ]; then > + # NOTE: Re-defining an already defined domain is dangerous! It shall be > done only > + # if we can reasonably assume the configuration file hasn't changed since > the last > + # time the domain has been defined. > + if [ -z "${DOMAIN_NAME}" ] || [ "${OCF_RESKEY_config}" -ot "${STATEFILE}" > ]; then > # Spin until we have a domain name > while true; do > virsh_output="$(virsh ${VIRSH_OPTIONS} define ${OCF_RESKEY_config})" > @@ -170,7 +169,7 @@ > echo "${domain_name}" > "${STATEFILE}" > ocf_log info "Domain name '${domain_name}' saved to state file > '${STATEFILE}'." > else > - ocf_log warn "Domain name '${DOMAIN_NAME}' already defined; overriding > configuration file '${OCF_RESKEY_config}' (this should NOT ne done in > production!)." > + ocf_log warn "Domain name '${DOMAIN_NAME}' already defined; overriding > by newer configuration file will NOT be done!" > fi > } Under which circumstances did you run into these issues? There were some recent additions which enable saving the changes back to the configuration file. Would that help? Cheers, Dejan > @@ -205,12 +204,12 @@ > ;; > ''|'no state') > # Empty string may be returned when virsh does not > - # receive a reply from libvirtd. > + # receive a reply from libvirtd or after the domain has > + # been undefined. > # "no state" may occur when the domain is currently > # being migrated (on the migration target only), or > # whenever virsh can't reliably obtain the domain > # state. > - status='no state' > if [ "${__OCF_ACTION}" == 'stop' ] && [ ${try} -ge 3 ]; then > # During the stop operation, we want to bail out > # quickly, so as to be able to force-stop (destroy) > @@ -224,6 +223,17 @@ > ocf_log info "Domain '${DOMAIN_NAME}' currently has no state; > retrying." > sleep 1 > fi > + if [ "${status}" == '' ] && [ $(( ${try} % 10 )) -eq 0 ]; then > + # Could it be that libvirtd is running healthily but the domain > + # has been undefined? In that case, let's attempt to re-define it. > + # If libvirtd IS running, it can not hurt (given the safeguards in > + # LibvirtQemu_Define). If libvirtd is NOT running, then something > is > + # definitely wrong (and the monitor operation will time-out in > + # LibvirtQemu_Define the same way as it would here). > + ocf_log warn "Has domain '${DOMAIN_NAME}' been undefined? > attempting to re-define it." > + LibvirtQemu_Define > + fi > + status='no state' > ;; > *) > # any other output is unexpected. > @@ -487,6 +497,11 @@ > > # Define the domain on startup, and re-define whenever someone deleted > # the state file, or touched the config. > +# WARNING: There is a caveat here! When the resource is stopped, the state > file > +# is deleted ONLY on the node where it was running. In case the domain is > then > +# undefined (from libvirtd), on all nodes, we will end-up with a state file > but no > +# domain definition on those nodes that were not running the resource. The > monitor > +# operation MUST handle that situation, should the resource be restarted. > if [ ! -e "${STATEFILE}" ] || [ "${OCF_RESKEY_config}" -nt "${STATEFILE}" ]; > then > LibvirtQemu_Define > fi > > One could ask "why undefine a libvirt domain and then restart it?". The > answer is two-fold: 1. experience showed us that we shall undefine a > decommissioned domain from libvirt to prevent potential UUID conflict when > defining a new domain (which is likely in our setup, since UUID are build > from the domain IP address); 2. the "demo-effect" (or potential legitimate > reasons), where one would "decommission" a domain and restart it right > afterwards ( :-/ ). > > PS: we now also make sure to delete the VirtualDomain/LibvirtQemu state file > when undefining the domain. But best have multiple safe guards as far as this > caveat is concerned (thus the patch above). > > Hope it helps, > > Cédric > > -- > > Cédric Dufour @ Idiap Research Institute > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org