Hi Andrew,
Thanks, that sounds good. I am using the Ubuntu HA ppa, so I will wait for a 1.1.7 package to become available. Andrew ----- Original Message ----- From: "Andrew Beekhof" <and...@beekhof.net> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> Sent: Thursday, March 29, 2012 1:08:21 AM Subject: Re: [Pacemaker] VirtualDomain Shutdown Timeout On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin <amar...@xes-inc.com> wrote: > Hello, > > I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and > Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so > there is no shared storage, no live-migration): > primitive p_vm ocf:heartbeat:VirtualDomain \ > params config="/vmstore/config/vm.xml" \ > meta allow-migrate="false" \ > op start interval="0" timeout="180s" \ > op stop interval="0" timeout="120s" \ > op monitor interval="10" timeout="30" > > I would expect the following events to happen on failover on the "from" node > (the migration source) if the VM hangs while shutting down: > 1. VirtualDomain issues "virsh shutdown vm" to gracefully shutdown the VM > 2. pacemaker waits 120 seconds for the timeout specified in the "op stop" > timeout > 3. VirtualDomain waits a bit less than 120 seconds to see if it will > gracefully shutdown. Once it gets to almost 120 seconds, it issues "virsh > destroy vm" to hard stop the VM. > 4. pacemaker wakes up from the 120 second timeout and sees that the VM has > stopped and proceeds with the failover > > However, I observed that VirtualDomain seems to be using the timeout from > the "op start" line, 180 seconds, yet pacemaker uses the 120 second timeout. > Thus, the VM is still running after the pacemaker timeout is reached and so > the node is STONITHed. Here is the relevant section of code from > /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: > VirtualDomain_Stop() { > local i > local status > local shutdown_timeout > local out ex > > VirtualDomain_Status > status=$? > > case $status in > $OCF_SUCCESS) > if ! ocf_is_true $OCF_RESKEY_force_stop; then > # Issue a graceful shutdown request > ocf_log info "Issuing graceful shutdown request for domain > ${DOMAIN_NAME}." > virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} > # The "shutdown_timeout" we use here is the operation > # timeout specified in the CIB, minus 5 seconds > shutdown_timeout=$(( $NOW + > ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) > # Loop on status until we reach $shutdown_timeout > while [ $NOW -lt $shutdown_timeout ]; do > > Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the > "op stop ..." line? It should, however there was a bug in 1.1.6 where this wasn't the case. The relevant patch is: https://github.com/beekhof/pacemaker/commit/fcfe6fe Or you could try 1.1.7 > > How can I optimize my pacemaker configuration so that the VM will attempt to > gracefully shutdown and then at worst case destroy the VM before the > pacemaker timeout is reached? Moreover, is there anything I can do inside of > the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown > process? > > Thanks, > > Andrew > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org