I woke up this morning and discovered that one of my clusters had failed over during the night. Everything was working fine, but I wanted to know what happened. From reading the logs, it looks to like the primary node ftp02 gave up its resources to the secondary node ftp01, which had become overloaded and failed to report in within the deadtime. (ftp01 was busy taking a backup and was experiencing high iowait, which I think explains why it failed to send heartbeats soon enough.) Why would apparently healthy node ftp02 give its resources to the overloaded ftp01? Here is the pertinent section of the log from ftp02, which was primary before the failover: heartbeat[5391]: 2011/03/03_04:03:20 info: Link ftp01.mycharts.md:/dev/ttyS0 dead. heartbeat[5391]: 2011/03/03_04:03:20 info: Link ftp01.mycharts.md:bond0 dead. ipfail[5775]: 2011/03/03_04:03:20 info: Link Status update: Link ftp01.mycharts.md//dev/ttyS0 now has status dead heartbeat[5391]: 2011/03/03_04:03:20 info: Link ftp01.mycharts.md:eth2 dead. ipfail[5775]: 2011/03/03_04:03:20 debug: Found ping node 192.168.10.100! ipfail[5775]: 2011/03/03_04:03:20 info: Asking other side for ping node count. ipfail[5775]: 2011/03/03_04:03:20 debug: Message [num_ping] sent. ipfail[5775]: 2011/03/03_04:03:20 info: Checking remote count of ping nodes. ipfail[5775]: 2011/03/03_04:03:20 info: Link Status update: Link ftp01.mycharts.md/bond0 now has status dead ipfail[5775]: 2011/03/03_04:03:21 debug: Found ping node 192.168.10.100! heartbeat[5391]: 2011/03/03_04:03:21 WARN: node ftp01.mycharts.md: is dead heartbeat[5391]: 2011/03/03_04:03:21 info: Dead node ftp01.mycharts.md gave up resources. heartbeat[5391]: 2011/03/03_04:03:21 debug: hb_rsc_recover_dead_resources: other now stable ipfail[5775]: 2011/03/03_04:03:21 info: Asking other side for ping node count. ipfail[5775]: 2011/03/03_04:03:21 debug: Message [num_ping] sent. ipfail[5775]: 2011/03/03_04:03:21 info: Checking remote count of ping nodes. ipfail[5775]: 2011/03/03_04:03:21 info: Link Status update: Link ftp01.mycharts.md/eth2 now has status dead ipfail[5775]: 2011/03/03_04:03:21 debug: Found ping node 192.168.10.100! ipfail[5775]: 2011/03/03_04:03:21 info: Asking other side for ping node count. ipfail[5775]: 2011/03/03_04:03:21 debug: Message [num_ping] sent. ipfail[5775]: 2011/03/03_04:03:21 info: Checking remote count of ping nodes. ipfail[5775]: 2011/03/03_04:03:21 info: Status update: Node ftp01.mycharts.md now has status dead ipfail[5775]: 2011/03/03_04:03:22 debug: Found ping node 192.168.10.100! ipfail[5775]: 2011/03/03_04:03:22 info: NS: We are still alive! heartbeat[5391]: 2011/03/03_04:03:38 CRIT: Cluster node ftp01.mycharts.md returning after partition. heartbeat[5391]: 2011/03/03_04:03:38 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain <http://linux-ha.org/SplitBrain> heartbeat[5391]: 2011/03/03_04:03:38 WARN: Deadtime value may be too small. heartbeat[5391]: 2011/03/03_04:03:38 info: See FAQ for information on tuning deadtime. heartbeat[5391]: 2011/03/03_04:03:38 info: URL: http://linux-ha.org/FAQ#heavy_load <http://linux-ha.org/FAQ#heavy_load> heartbeat[5391]: 2011/03/03_04:03:38 info: Link ftp01.mycharts.md:/dev/ttyS0 up. heartbeat[5391]: 2011/03/03_04:03:38 WARN: Late heartbeat: Node ftp01.mycharts.md: interval 49030 ms heartbeat[5391]: 2011/03/03_04:03:38 info: Status update for node ftp01.mycharts.md: status active ipfail[5775]: 2011/03/03_04:03:38 info: Link Status update: Link ftp01.mycharts.md//dev/ttyS0 now has status up heartbeat[5391]: 2011/03/03_04:03:38 debug: Status seqno: 568925 msgtime: 1299153818 heartbeat[5391]: 2011/03/03_04:03:38 debug: StartNextRemoteRscReq() - calling hook heartbeat[5391]: 2011/03/03_04:03:38 debug: notify_world: invoking harc: OLD status: active heartbeat[5391]: 2011/03/03_04:03:38 debug: Process [status] started pid 16973 heartbeat[5391]: 2011/03/03_04:03:38 debug: Starting notify process [status] ipfail[5775]: 2011/03/03_04:03:38 info: Status update: Node ftp01.mycharts.md now has status active heartbeat[5391]: 2011/03/03_04:03:38 info: Link ftp01.mycharts.md:bond0 up. ipfail[5775]: 2011/03/03_04:03:38 info: Link Status update: Link ftp01.mycharts.md/bond0 now has status up heartbeat[5391]: 2011/03/03_04:03:38 info: Link ftp01.mycharts.md:eth2 up. heartbeat[16973]: 2011/03/03_04:03:38 debug: notify_world: setting SIGCHLD Handler to SIG_DFL ipfail[5775]: 2011/03/03_04:03:38 info: Link Status update: Link ftp01.mycharts.md/eth2 now has status up heartbeat[16973]: 2011/03/03_04:03:38 debug: notify_world: Running harc status harc[16973]: 2011/03/03_04:03:38 info: Running /etc/ha.d/rc.d/status status heartbeat[5391]: 2011/03/03_04:03:38 info: Managed status process 16973 exited with return code 0. heartbeat[5391]: 2011/03/03_04:03:38 debug: RscMgmtProc 'status' exited code 0 heartbeat[5391]: 2011/03/03_04:03:40 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5391]: 2011/03/03_04:03:40 info: hb_giveup_resources(): current status: active heartbeat[5391]: 2011/03/03_04:03:40 debug: Sending hold resources msg: none, stable=0 # shutdown heartbeat[5391]: 2011/03/03_04:03:40 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 5 heartbeat[5391]: 2011/03/03_04:03:40 info: Heartbeat shutdown in progress. (5391) heartbeat[5391]: 2011/03/03_04:03:40 debug: Process [hb_giveup_resources] started pid 17023 heartbeat[17023]: 2011/03/03_04:03:40 info: Giving up all HA resources. ResourceManager[17036]: 2011/03/03_04:03:40 info: Releasing resource group: ftp01.mycharts.md drbddisk::ha_ftp Filesystem::/dev/drbd0::/ha_ftp::ext3::noatime IPaddr2::192.168.10.203/24/bond0 pure-ftpd nfs ResourceManager[17036]: 2011/03/03_04:03:40 info: Running /etc/init.d/nfs stop ResourceManager[17036]: 2011/03/03_04:03:40 debug: Starting /etc/init.d/nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] ResourceManager[17036]: 2011/03/03_04:03:41 debug: /etc/init.d/nfs stop done. RC=0 ResourceManager[17036]: 2011/03/03_04:03:41 info: Running /etc/init.d/pure-ftpd stop ResourceManager[17036]: 2011/03/03_04:03:41 debug: Starting /etc/init.d/pure-ftpd stop Stopping pure-ftpd: [ OK ] ResourceManager[17036]: 2011/03/03_04:03:41 debug: /etc/init.d/pure-ftpd stop done. RC=0 ResourceManager[17036]: 2011/03/03_04:03:41 info: Running /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 stop ResourceManager[17036]: 2011/03/03_04:03:41 debug: Starting /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 stop IPaddr2[17196]: 2011/03/03_04:03:41 INFO: ip -f inet addr delete 192.168.10.203/24 dev bond0 IPaddr2[17196]: 2011/03/03_04:03:41 INFO: ip -o -f inet addr show bond0 IPaddr2[17167]: 2011/03/03_04:03:41 INFO: Success INFO: Success ResourceManager[17036]: 2011/03/03_04:03:41 debug: /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 stop done. RC=0 ResourceManager[17036]: 2011/03/03_04:03:41 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime stop ResourceManager[17036]: 2011/03/03_04:03:41 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime stop Filesystem[17277]: 2011/03/03_04:03:41 INFO: Running stop for /dev/drbd0 on /ha_ftp Filesystem[17277]: 2011/03/03_04:03:41 INFO: Trying to unmount /ha_ftp umount: /ha_ftp: device is busy umount: /ha_ftp: device is busy Filesystem[17277]: 2011/03/03_04:03:41 ERROR: Couldn't unmount /ha_ftp; trying cleanup with SIGTERM /ha_ftp: 8796c Filesystem[17277]: 2011/03/03_04:03:42 INFO: Some processes on /ha_ftp were signalled ipfail[5775]: 2011/03/03_04:03:45 info: No giveup timer to abort. ipfail[5775]: 2011/03/03_04:03:47 info: No giveup timer to abort. ipfail[5775]: 2011/03/03_04:03:48 info: No giveup timer to abort. Filesystem[17277]: 2011/03/03_04:03:54 INFO: unmounted /ha_ftp successfully Filesystem[17266]: 2011/03/03_04:03:54 INFO: Success INFO: Success ResourceManager[17036]: 2011/03/03_04:03:54 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime stop done. RC=0 ResourceManager[17036]: 2011/03/03_04:03:54 info: Running /etc/ha.d/resource.d/drbddisk ha_ftp stop ResourceManager[17036]: 2011/03/03_04:03:54 debug: Starting /etc/ha.d/resource.d/drbddisk ha_ftp stop ResourceManager[17036]: 2011/03/03_04:03:54 debug: /etc/ha.d/resource.d/drbddisk ha_ftp stop done. RC=0 heartbeat[17023]: 2011/03/03_04:03:54 info: All HA resources relinquished. heartbeat[17023]: 2011/03/03_04:03:54 debug: Sending T_SHUTDONE. heartbeat[17023]: 2011/03/03_04:03:54 info: FIFO message [type shutdone] written rc=27 heartbeat[5391]: 2011/03/03_04:03:54 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 5 heartbeat[5391]: 2011/03/03_04:03:54 debug: Received T_SHUTDONE from us. heartbeat[5391]: 2011/03/03_04:03:54 debug: Calling hb_mcp_final_shutdown in a second. heartbeat[5391]: 2011/03/03_04:03:54 debug: RscMgmtProc 'hb_giveup_resources' exited code 0 heartbeat[5391]: 2011/03/03_04:03:55 debug: hb_mcp_final_shutdown() phase 0 heartbeat[5391]: 2011/03/03_04:03:55 debug: Shutting down client /usr/lib/heartbeat/ipfail heartbeat[5391]: 2011/03/03_04:03:55 info: killing /usr/lib/heartbeat/ipfail process group 5775 with signal 15 heartbeat[5391]: 2011/03/03_04:03:55 info: other_holds_resources: 3 heartbeat[5391]: 2011/03/03_04:03:55 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 5 heartbeat[5391]: 2011/03/03_04:03:56 debug: G_remove_client(pid=5775, reason='died' gsource=0x9e4b348) { heartbeat[5391]: 2011/03/03_04:03:56 debug: api_remove_client_int: removing pid [5775] reason: died heartbeat[5391]: 2011/03/03_04:03:56 debug: api_send_client: client 5775 died heartbeat[5391]: 2011/03/03_04:03:56 debug: }/*G_remove_client;*/ heartbeat[5391]: 2011/03/03_04:03:56 debug: Final client "/usr/lib/heartbeat/ipfail" died. heartbeat[5391]: 2011/03/03_04:03:56 debug: hb_mcp_final_shutdown() phase 1 heartbeat[5391]: 2011/03/03_04:03:57 debug: hb_mcp_final_shutdown() phase 2 heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5500 with signal 15 heartbeat[5497]: 2011/03/03_04:03:57 debug: Process 5497 processing SIGTERM heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE process 5501 with signal 15 heartbeat[5497]: 2011/03/03_04:03:57 debug: Exiting from pid 5497 [rc=15] heartbeat[5503]: 2011/03/03_04:03:57 debug: Process 5503 processing SIGTERM heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5502 with signal 15 heartbeat[5503]: 2011/03/03_04:03:57 debug: Exiting from pid 5503 [rc=15] heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE process 5503 with signal 15 heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5504 with signal 15 heartbeat[5501]: 2011/03/03_04:03:57 debug: Process 5501 processing SIGTERM heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE process 5494 with signal 15 heartbeat[5504]: 2011/03/03_04:03:57 debug: Process 5504 processing SIGTERM heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5495 with signal 15 heartbeat[5501]: 2011/03/03_04:03:57 debug: Exiting from pid 5501 [rc=15] heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBFIFO process 5419 with signal 15 heartbeat[5419]: 2011/03/03_04:03:57 debug: Process 5419 processing SIGTERM heartbeat[5502]: 2011/03/03_04:03:57 debug: Process 5502 processing SIGTERM heartbeat[5502]: 2011/03/03_04:03:57 debug: Exiting from pid 5502 [rc=15] heartbeat[5500]: 2011/03/03_04:03:57 debug: Process 5500 processing SIGTERM heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE process 5497 with signal 15 heartbeat[5494]: 2011/03/03_04:03:57 debug: Process 5494 processing SIGTERM heartbeat[5494]: 2011/03/03_04:03:57 debug: Calling localdie() function heartbeat[5495]: 2011/03/03_04:03:57 debug: Process 5495 processing SIGTERM heartbeat[5495]: 2011/03/03_04:03:57 debug: Exiting from pid 5495 [rc=15] heartbeat[5499]: 2011/03/03_04:03:57 debug: Process 5499 processing SIGTERM heartbeat[5498]: 2011/03/03_04:03:57 debug: Process 5498 processing SIGTERM heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBREAD process 5498 with signal 15 heartbeat[5494]: 2011/03/03_04:03:57 debug: glib: serial_localdie: Flushing tty heartbeat[5499]: 2011/03/03_04:03:57 debug: Exiting from pid 5499 [rc=15] heartbeat[5494]: 2011/03/03_04:03:57 debug: Exiting from pid 5494 [rc=15] heartbeat[5498]: 2011/03/03_04:03:57 debug: Exiting from pid 5498 [rc=15] heartbeat[5391]: 2011/03/03_04:03:57 info: killing HBWRITE process 5499 with signal 15 heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5494 exited. 11 remaining heartbeat[5419]: 2011/03/03_04:03:57 debug: Exiting from pid 5419 [rc=15] heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5495 exited. 10 remaining heartbeat[5504]: 2011/03/03_04:03:57 debug: Exiting from pid 5504 [rc=15] heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5497 exited. 9 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5498 exited. 8 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5499 exited. 7 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5501 exited. 6 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5502 exited. 5 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5503 exited. 4 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5419 exited. 3 remaining heartbeat[5500]: 2011/03/03_04:03:57 debug: Exiting from pid 5500 [rc=15] heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5504 exited. 2 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: Core process 5500 exited. 1 remaining heartbeat[5391]: 2011/03/03_04:03:57 info: ftp02.mycharts.md Heartbeat shutdown complete. heartbeat[5391]: 2011/03/03_04:03:57 info: Heartbeat restart triggered. heartbeat[5391]: 2011/03/03_04:03:57 info: Restarting heartbeat. heartbeat[5391]: 2011/03/03_04:03:57 info: Performing heartbeat restart exec. heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=hacluster, gid=haclient uid=hacluster heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=hacluster, gid=<null> heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=<null>, gid=haclient heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=root, gid=<null> heartbeat[5391]: 2011/03/03_04:04:28 debug: uid=<null>, gid=haclient heartbeat[5391]: 2011/03/03_04:04:28 debug: Beginning authentication parsing heartbeat[5391]: 2011/03/03_04:04:28 debug: 16 max authentication methods heartbeat[5391]: 2011/03/03_04:04:28 debug: Keyfile opened heartbeat[5391]: 2011/03/03_04:04:28 debug: Keyfile perms OK heartbeat[5391]: 2011/03/03_04:04:28 debug: 16 max authentication methods heartbeat[5391]: 2011/03/03_04:04:28 debug: Found authentication method [sha1] heartbeat[5391]: 2011/03/03_04:04:28 info: AUTH: i=2: key = 0x1412cf48, auth=0x2b4e4e66a610, authname=sha1 heartbeat[5391]: 2011/03/03_04:04:28 debug: Outbound signing method is 2 heartbeat[5391]: 2011/03/03_04:04:28 debug: Authentication parsing complete [1] heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(cluster,linux-ha) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(hopfudge,1) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(hbgenmethod,file) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(realtime,true) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(msgfmt,classic) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(conn_logd_time,60) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(log_badpack,true) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(syslogmsgfmt,false) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(coredumps,true) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(crm,false) heartbeat[5391]: 2011/03/03_04:04:28 info: Version 2 support: false heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(autojoin,none) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(uuidfrom,file) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(compression,zlib) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(compression_threshold,2) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(max_rexmit_delay,250) heartbeat[5391]: 2011/03/03_04:04:28 debug: Setting max_rexmit_delay to 250 ms heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(record_config_changes,on) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(record_pengine_inputs,on) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(enable_config_writes,on) heartbeat[5391]: 2011/03/03_04:04:28 debug: add_option(memreserve,6500) heartbeat[5391]: 2011/03/03_04:04:28 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[5391]: 2011/03/03_04:04:28 info: ************************** heartbeat[5391]: 2011/03/03_04:04:28 info: Configuration validated. Starting heartbeat 2.1.3 heartbeat[5391]: 2011/03/03_04:04:28 debug: HA configuration OK. Heartbeat starting. heartbeat[18456]: 2011/03/03_04:04:28 info: heartbeat: version 2.1.3 heartbeat[18456]: 2011/03/03_04:04:28 info: Heartbeat generation: 1248828643 heartbeat[18456]: 2011/03/03_04:04:28 debug: uuid is:4ae573d0-2ccd-4f5b-b769-ddcb47da58c9 heartbeat[18456]: 2011/03/03_04:04:28 debug: FIFO process pid: 18458 heartbeat[18456]: 2011/03/03_04:04:28 debug: opening serial /dev/ttyS0 (serial ring) heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: tty setup on /dev/ttyS0 complete. heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: Baud rate set to: 0xe heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_iflag = 0xb2 heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_oflag = 0x4 heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_cflag = 0x80000cbe heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: ti.c_lflag = 0x8a30 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: Starting serial heartbeat on tty /dev/ttyS0 (19200 baud) heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18466 heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: 18467 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair wchan socket 9 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair rchan socket 11 heartbeat[18456]: 2011/03/03_04:04:28 debug: opening bcast bond0 (UDP/IP broadcast) heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: SO_BINDTODEVICE(r) set for device bond0 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast heartbeat started on port 698 (698) interface bond0 heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18468 heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: 18469 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast heartbeat closed on port 698 interface bond0 - Status: 1 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair wchan socket 12 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair rchan socket 14 heartbeat[18456]: 2011/03/03_04:04:28 debug: opening bcast eth2 (UDP/IP broadcast) heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: SO_BINDTODEVICE(r) set for device eth2 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast heartbeat started on port 698 (698) interface eth2 heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18470 heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: 18471 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast heartbeat closed on port 698 interface eth2 - Status: 1 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair wchan socket 13 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair rchan socket 16 heartbeat[18456]: 2011/03/03_04:04:28 debug: opening bcast eth3 (UDP/IP broadcast) heartbeat[18456]: 2011/03/03_04:04:28 debug: glib: SO_BINDTODEVICE(r) set for device eth3 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast heartbeat started on port 698 (698) interface eth3 heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18472 heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: 18473 heartbeat[18456]: 2011/03/03_04:04:28 info: glib: UDP Broadcast heartbeat closed on port 698 interface eth3 - Status: 1 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair wchan socket 15 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair rchan socket 18 heartbeat[18456]: 2011/03/03_04:04:28 debug: opening ping 192.168.10.100 (ping membership) heartbeat[18456]: 2011/03/03_04:04:28 info: glib: ping heartbeat started. heartbeat[18456]: 2011/03/03_04:04:28 debug: write process pid: 18474 heartbeat[18456]: 2011/03/03_04:04:28 debug: read child process pid: 18475 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair wchan socket 10 heartbeat[18456]: 2011/03/03_04:04:28 debug: make_io_childpair: CREATED childpair rchan socket 19 heartbeat[18456]: 2011/03/03_04:04:28 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[18456]: 2011/03/03_04:04:28 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[18456]: 2011/03/03_04:04:28 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[18456]: 2011/03/03_04:04:28 debug: Limiting CPU: 42 CPU seconds every 60000 milliseconds heartbeat[18456]: 2011/03/03_04:04:28 debug: pid 18456 locked in memory. heartbeat[18456]: 2011/03/03_04:04:28 debug: Waiting for child processes to start heartbeat[18456]: 2011/03/03_04:04:28 info: Local status now set to: 'up' heartbeat[18456]: 2011/03/03_04:04:28 debug: All your child process are belong to us heartbeat[18456]: 2011/03/03_04:04:28 debug: Starting local status message @ 2000 ms intervals heartbeat[18456]: 2011/03/03_04:04:28 debug: Forking temp process write_hostcachedata heartbeat[18456]: 2011/03/03_04:04:28 info: Managed write_hostcachedata process 18476 exited with return code 0. heartbeat[18458]: 2011/03/03_04:04:29 debug: pid 18458 locked in memory. heartbeat[18458]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds heartbeat[18466]: 2011/03/03_04:04:29 debug: pid 18466 locked in memory. heartbeat[18466]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds heartbeat[18467]: 2011/03/03_04:04:29 debug: pid 18467 locked in memory. heartbeat[18467]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds heartbeat[18469]: 2011/03/03_04:04:29 debug: pid 18469 locked in memory. heartbeat[18468]: 2011/03/03_04:04:29 debug: pid 18468 locked in memory. heartbeat[18469]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds heartbeat[18468]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds heartbeat[18471]: 2011/03/03_04:04:29 debug: pid 18471 locked in memory. heartbeat[18471]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds heartbeat[18472]: 2011/03/03_04:04:29 debug: pid 18472 locked in memory. heartbeat[18472]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds heartbeat[18470]: 2011/03/03_04:04:29 debug: pid 18470 locked in memory. heartbeat[18470]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds heartbeat[18475]: 2011/03/03_04:04:29 debug: pid 18475 locked in memory. heartbeat[18475]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds heartbeat[18473]: 2011/03/03_04:04:29 debug: pid 18473 locked in memory. heartbeat[18473]: 2011/03/03_04:04:29 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds heartbeat[18474]: 2011/03/03_04:04:29 debug: pid 18474 locked in memory. heartbeat[18474]: 2011/03/03_04:04:29 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds heartbeat[18456]: 2011/03/03_04:04:29 info: Link 192.168.10.100:192.168.10.100 up. heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: status heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: ask_resources heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: ip-request heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: ip-request-resp heartbeat[18456]: 2011/03/03_04:04:29 debug: CreateInitialFilter: hb_takeover heartbeat[18456]: 2011/03/03_04:04:29 info: Status update for node 192.168.10.100: status ping heartbeat[18456]: 2011/03/03_04:04:29 debug: Status seqno: 0 msgtime: 1299153868 heartbeat[18456]: 2011/03/03_04:04:29 info: Link ftp02.mycharts.md:bond0 up. heartbeat[18456]: 2011/03/03_04:04:29 info: Link ftp02.mycharts.md:eth2 up. heartbeat[18456]: 2011/03/03_04:04:29 info: Link ftp02.mycharts.md:eth3 up. heartbeat[18456]: 2011/03/03_04:04:30 WARN: string2msg_ll: node [ftp01.mycharts.md] failed authentication heartbeat[18456]: 2011/03/03_04:04:30 info: Link ftp01.mycharts.md:bond0 up. heartbeat[18456]: 2011/03/03_04:04:30 debug: sending reqnodes msg to node ftp01.mycharts.md heartbeat[18456]: 2011/03/03_04:04:30 info: Link ftp01.mycharts.md:eth2 up. heartbeat[18456]: 2011/03/03_04:04:30 info: Link ftp01.mycharts.md:/dev/ttyS0 up. heartbeat[18456]: 2011/03/03_04:04:30 debug: Forking temp process write_hostcachedata heartbeat[18456]: 2011/03/03_04:04:30 info: Managed write_hostcachedata process 18512 exited with return code 0. heartbeat[18456]: 2011/03/03_04:04:30 info: Status update for node ftp01.mycharts.md: status active heartbeat[18456]: 2011/03/03_04:04:30 debug: Status seqno: 568961 msgtime: 1299153870 heartbeat[18456]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq() - calling hook heartbeat[18456]: 2011/03/03_04:04:30 debug: notify_world: invoking harc: OLD status: up heartbeat[18456]: 2011/03/03_04:04:30 debug: Process [status] started pid 18540 heartbeat[18456]: 2011/03/03_04:04:30 debug: Starting notify process [status] heartbeat[18456]: 2011/03/03_04:04:30 debug: Get a repnodes msg from ftp01.mycharts.md heartbeat[18456]: 2011/03/03_04:04:30 debug: nodelist received:ftp01.mycharts.md ftp02.mycharts.md heartbeat[18456]: 2011/03/03_04:04:30 info: Comm_now_up(): updating status to active heartbeat[18456]: 2011/03/03_04:04:30 info: Local status now set to: 'active' heartbeat[18456]: 2011/03/03_04:04:30 debug: Sending local starting msg: resourcestate = 0 heartbeat[18456]: 2011/03/03_04:04:30 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 0 heartbeat[18456]: 2011/03/03_04:04:30 info: Starting child client "/usr/lib/heartbeat/ipfail" (498,496) heartbeat[18456]: 2011/03/03_04:04:30 debug: Forking temp process write_hostcachedata heartbeat[18541]: 2011/03/03_04:04:30 info: Starting "/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 18541) heartbeat[18456]: 2011/03/03_04:04:30 debug: Forking temp process write_delcachedata heartbeat[18540]: 2011/03/03_04:04:30 debug: notify_world: setting SIGCHLD Handler to SIG_DFL heartbeat[18540]: 2011/03/03_04:04:30 debug: notify_world: Running harc status heartbeat[18456]: 2011/03/03_04:04:30 info: Managed write_hostcachedata process 18542 exited with return code 0. heartbeat[18456]: 2011/03/03_04:04:30 info: Managed write_delcachedata process 18543 exited with return code 0. ipfail[18541]: 2011/03/03_04:04:30 debug: PID=18541 ipfail[18541]: 2011/03/03_04:04:30 debug: Signing in with heartbeat heartbeat[18456]: 2011/03/03_04:04:30 debug: APIregistration_dispatch() { heartbeat[18456]: 2011/03/03_04:04:30 debug: process_registerevent() { heartbeat[18456]: 2011/03/03_04:04:30 debug: client->gsource = 0x14151c28 heartbeat[18456]: 2011/03/03_04:04:30 debug: }/*process_registerevent*/; heartbeat[18456]: 2011/03/03_04:04:30 debug: }/*APIregistration_dispatch*/; heartbeat[18456]: 2011/03/03_04:04:30 debug: Checking client authorization for client ipfail (498:496) heartbeat[18456]: 2011/03/03_04:04:30 debug: create_seq_snapshot_table:no missing packets found for node ftp01.mycharts.md heartbeat[18456]: 2011/03/03_04:04:30 debug: create_seq_snapshot_table:no missing packets found for node ftp02.mycharts.md heartbeat[18456]: 2011/03/03_04:04:30 debug: Signing on API client 18541 (ipfail) harc[18540]: 2011/03/03_04:04:30 info: Running /etc/ha.d/rc.d/status status heartbeat[18456]: 2011/03/03_04:04:30 info: Managed status process 18540 exited with return code 0. heartbeat[18456]: 2011/03/03_04:04:30 debug: RscMgmtProc 'status' exited code 0 heartbeat[18456]: 2011/03/03_04:04:30 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 1 ipfail[18541]: 2011/03/03_04:04:30 debug: [We are ftp02.mycharts.md] heartbeat[18456]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0)) heartbeat[18456]: 2011/03/03_04:04:31 info: remote resource transition completed. heartbeat[18456]: 2011/03/03_04:04:31 debug: Sending hold resources msg: none, stable=0 # <none> heartbeat[18456]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0)) heartbeat[18456]: 2011/03/03_04:04:31 info: STATE 1 => 3 heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 3 heartbeat[18456]: 2011/03/03_04:04:31 debug: Calling PerformAutoFailback() heartbeat[18456]: 2011/03/03_04:04:31 info: other_holds_resources: 3 heartbeat[18456]: 2011/03/03_04:04:31 info: remote resource transition completed. heartbeat[18456]: 2011/03/03_04:04:31 info: Local Resource acquisition completed. (none) heartbeat[18456]: 2011/03/03_04:04:31 debug: Sending hold resources msg: none, stable=1 # <none> heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 3 heartbeat[18456]: 2011/03/03_04:04:31 debug: Calling PerformAutoFailback() heartbeat[18456]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (0)) heartbeat[18456]: 2011/03/03_04:04:31 info: Initial resource acquisition complete (T_RESOURCES(them)) heartbeat[18456]: 2011/03/03_04:04:31 info: STATE 3 => 4 heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 ipfail[18541]: 2011/03/03_04:04:31 debug: auto_failback -> 0 (off) ipfail[18541]: 2011/03/03_04:04:31 debug: Setting message filter mode heartbeat[18456]: 2011/03/03_04:04:31 info: other_holds_resources: 3 heartbeat[18456]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 ipfail[18541]: 2011/03/03_04:04:31 debug: Starting node walk ipfail[18541]: 2011/03/03_04:04:32 debug: Cluster node: 192.168.10.100: status: ping ipfail[18541]: 2011/03/03_04:04:32 debug: Cluster node: ftp02.mycharts.md: status: active ipfail[18541]: 2011/03/03_04:04:33 debug: Cluster node: ftp01.mycharts.md: status: active ipfail[18541]: 2011/03/03_04:04:33 debug: [They are ftp01.mycharts.md] ipfail[18541]: 2011/03/03_04:04:33 debug: Setting message signal ipfail[18541]: 2011/03/03_04:04:33 debug: Waiting for messages... ipfail[18541]: 2011/03/03_04:04:34 debug: Other side is now stable. ipfail[18541]: 2011/03/03_04:04:34 debug: Other side is now stable. ipfail[18541]: 2011/03/03_04:04:34 debug: Got asked for num_ping. ipfail[18541]: 2011/03/03_04:04:35 debug: Found ping node 192.168.10.100! ipfail[18541]: 2011/03/03_04:04:36 info: Ping node count is balanced. ipfail[18541]: 2011/03/03_04:04:36 debug: Abort message sent. heartbeat[18456]: 2011/03/03_04:14:29 info: Daily informational memory statistics ////////////////////////////////////////////////////////////// Now here is the ha-debug from server ftp01, which was secondary before the failover: heartbeat[5280]: 2011/03/03_04:03:38 WARN: Gmain_timeout_dispatch: Dispatch function for memory stats took too long to execute: 48880 ms (> 100 ms) (GSource: 0xb29108) heartbeat[5280]: 2011/03/03_04:03:38 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 47980 ms (> 1010 ms) before being called (GSource: 0xb28398) heartbeat[5280]: 2011/03/03_04:03:40 info: Gmain_timeout_dispatch: started at 542452156 should have started at 542447358 heartbeat[5280]: 2011/03/03_04:03:41 WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 2480 ms (> 50 ms) (GSource: 0xb28398) heartbeat[5280]: 2011/03/03_04:03:44 WARN: Late heartbeat: Node ftp02.mycharts.md: interval 53380 ms heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 3280 ms (> 50 ms) (GSource: 0xb1c388) heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch function for read child was delayed 3290 ms (> 500 ms) before being called (GSource: 0xb1ee98) heartbeat[5280]: 2011/03/03_04:03:44 info: G_CH_dispatch_int: started at 542452778 should have started at 542452449 heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch function for read child was delayed 3290 ms (> 500 ms) before being called (GSource: 0xb1f7d8) heartbeat[5280]: 2011/03/03_04:03:44 info: G_CH_dispatch_int: started at 542452778 should have started at 542452449 heartbeat[5280]: 2011/03/03_04:03:44 WARN: G_CH_dispatch_int: Dispatch function for read child was delayed 3280 ms (> 500 ms) before being called (GSource: 0xb20528) heartbeat[5280]: 2011/03/03_04:03:44 info: G_CH_dispatch_int: started at 542452778 should have started at 542452450 heartbeat[5280]: 2011/03/03_04:03:44 WARN: Late heartbeat: Node 192.168.10.100: interval 56190 ms heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 1740 ms (> 1010 ms) before being called (GSource: 0xb28398) heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: started at 542452778 should have started at 542452604 ipfail[5456]: 2011/03/03_04:03:44 debug: Got asked for num_ping. heartbeat[5280]: 2011/03/03_04:03:44 debug: process_resources(2): other now unstable heartbeat[5280]: 2011/03/03_04:03:44 info: other_holds_resources: 0 heartbeat[5280]: 2011/03/03_04:03:44 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 54200 ms (> 1010 ms) before being called (GSource: 0xb29698) heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: started at 542452779 should have started at 542447359 heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 55170 ms (> 30000 ms) before being called (GSource: 0xb29818) heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: started at 542452779 should have started at 542447262 ipfail[5456]: 2011/03/03_04:03:44 debug: Found ping node 192.168.10.100! heartbeat[5280]: 2011/03/03_04:03:44 WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 52220 ms (> 5000 ms) before being called (GSource: 0xb29518) heartbeat[5280]: 2011/03/03_04:03:44 info: Gmain_timeout_dispatch: started at 542452779 should have started at 542447557 ipfail[5456]: 2011/03/03_04:03:44 info: Ping node count is balanced. ipfail[5456]: 2011/03/03_04:03:44 debug: Abort message sent. ipfail[5456]: 2011/03/03_04:03:45 debug: Got asked for num_ping. ipfail[5456]: 2011/03/03_04:03:45 debug: Found ping node 192.168.10.100! ipfail[5456]: 2011/03/03_04:03:46 info: Ping node count is balanced. ipfail[5456]: 2011/03/03_04:03:46 debug: Abort message sent. ipfail[5456]: 2011/03/03_04:03:46 debug: Got asked for num_ping. ipfail[5456]: 2011/03/03_04:03:47 debug: Found ping node 192.168.10.100! ipfail[5456]: 2011/03/03_04:03:48 info: Ping node count is balanced. ipfail[5456]: 2011/03/03_04:03:48 debug: Abort message sent. ipfail[5456]: 2011/03/03_04:03:48 debug: Other side is unstable. heartbeat[5280]: 2011/03/03_04:03:54 debug: process_resources(4): other now stable - T_SHUTDONE heartbeat[5280]: 2011/03/03_04:03:54 debug: Process [go_standby] started pid 1142 heartbeat[5280]: 2011/03/03_04:03:54 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: 0, going_standby: 3, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:03:54 info: Received shutdown notice from 'ftp02.mycharts.md'. heartbeat[5280]: 2011/03/03_04:03:54 info: Resources being acquired from ftp02.mycharts.md. heartbeat[5280]: 2011/03/03_04:03:54 debug: StartNextRemoteRscReq(): child count 1 heartbeat[5280]: 2011/03/03_04:03:54 debug: takeover_from_node: other now stable heartbeat[5280]: 2011/03/03_04:03:54 debug: Process [req_our_resources(ask)] started pid 1143 heartbeat[1142]: 2011/03/03_04:03:54 info: acquire all HA resources (standby). heartbeat[1142]: 2011/03/03_04:03:54 info: go_standby: who: 2 resource set: all heartbeat[1142]: 2011/03/03_04:03:54 info: go_standby: (query/action): (allkeys/takegroup) heartbeat[1143]: 2011/03/03_04:03:54 debug: req_our_resources(/usr/share/heartbeat/ResourceManager listkeys ftp01.mycharts.md) heartbeat[1143]: 2011/03/03_04:03:54 debug: req_our_resources(): running [/usr/share/heartbeat/req_resource drbddisk::ha_ftp] ResourceManager[1169]: 2011/03/03_04:03:55 info: Acquiring resource group: ftp01.mycharts.md drbddisk::ha_ftp Filesystem::/dev/drbd0::/ha_ftp::ext3::noatime IPaddr2::192.168.10.203/24/bond0 pure-ftpd nfs req_resource[1168]: 2011/03/03_04:03:55 debug: in /usr/share/heartbeat/req_resource drbddisk::ha_ftp req_resource[1168]: 2011/03/03_04:03:55 debug: dont_ask: nice_failback: yes heartbeat[1143]: 2011/03/03_04:03:55 info: 1 local resources from [/usr/share/heartbeat/ResourceManager listkeys ftp01.mycharts.md] heartbeat[1143]: 2011/03/03_04:03:55 info: Local Resource acquisition completed. heartbeat[1143]: 2011/03/03_04:03:55 debug: Sending hold resources msg: all, stable=1 # req_our_resources() heartbeat[1143]: 2011/03/03_04:03:55 info: FIFO message [type resource] written rc=79 heartbeat[5280]: 2011/03/03_04:03:55 debug: StartNextRemoteRscReq(): child count 2 heartbeat[5280]: 2011/03/03_04:03:55 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[5280]: 2011/03/03_04:03:55 debug: hb_rsc_isstable: ResourceMgmt_child_count: 2, other_is_stable: 1, takeover_in_progress: 1, going_standby: 3, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:03:55 info: Managed req_our_resources(ask) process 1143 exited with return code 0. heartbeat[5280]: 2011/03/03_04:03:55 debug: RscMgmtProc 'req_our_resources(ask)' exited code 0 heartbeat[5280]: 2011/03/03_04:03:55 debug: StartNextRemoteRscReq(): child count 1 ResourceManager[1169]: 2011/03/03_04:03:55 info: Running /etc/ha.d/resource.d/drbddisk ha_ftp start ResourceManager[1169]: 2011/03/03_04:03:55 debug: Starting /etc/ha.d/resource.d/drbddisk ha_ftp start ResourceManager[1169]: 2011/03/03_04:03:55 debug: /etc/ha.d/resource.d/drbddisk ha_ftp start done. RC=0 Filesystem[1280]: 2011/03/03_04:03:56 INFO: Resource is stopped ResourceManager[1169]: 2011/03/03_04:03:56 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime start ResourceManager[1169]: 2011/03/03_04:03:56 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime start Filesystem[1366]: 2011/03/03_04:03:56 INFO: Running start for /dev/drbd0 on /ha_ftp Filesystem[1355]: 2011/03/03_04:03:57 INFO: Success INFO: Success ResourceManager[1169]: 2011/03/03_04:03:57 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /ha_ftp ext3 noatime start done. RC=0 IPaddr2[1440]: 2011/03/03_04:03:57 INFO: Resource is stopped ResourceManager[1169]: 2011/03/03_04:03:58 info: Running /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 start ResourceManager[1169]: 2011/03/03_04:03:58 debug: Starting /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 start IPaddr2[1552]: 2011/03/03_04:03:58 INFO: ip -f inet addr add 192.168.10.203/24 brd 192.168.10.255 dev bond0 IPaddr2[1552]: 2011/03/03_04:03:58 INFO: ip link set bond0 up IPaddr2[1552]: 2011/03/03_04:03:58 INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.10.203 bond0 192.168.10.203 auto not_used not_used IPaddr2[1523]: 2011/03/03_04:03:58 INFO: Success INFO: Success ResourceManager[1169]: 2011/03/03_04:03:58 debug: /etc/ha.d/resource.d/IPaddr2 192.168.10.203/24/bond0 start done. RC=0 ResourceManager[1169]: 2011/03/03_04:03:58 info: Running /etc/init.d/pure-ftpd start ResourceManager[1169]: 2011/03/03_04:03:58 debug: Starting /etc/init.d/pure-ftpd start Starting pure-ftpd: [ OK ] ResourceManager[1169]: 2011/03/03_04:03:58 debug: /etc/init.d/pure-ftpd start done. RC=0 ResourceManager[1169]: 2011/03/03_04:03:58 info: Running /etc/init.d/nfs start ResourceManager[1169]: 2011/03/03_04:03:58 debug: Starting /etc/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] ResourceManager[1169]: 2011/03/03_04:03:59 debug: /etc/init.d/nfs start done. RC=0 heartbeat[1142]: 2011/03/03_04:03:59 info: all HA resource acquisition completed (standby). heartbeat[1142]: 2011/03/03_04:03:59 debug: Sending standby [done] msg heartbeat[1142]: 2011/03/03_04:03:59 info: FIFO message [type ask_resources] written rc=47 heartbeat[5280]: 2011/03/03_04:03:59 debug: Received standby message done from ftp01.mycharts.md in state 3 heartbeat[5280]: 2011/03/03_04:03:59 info: Standby resource acquisition done [all]. heartbeat[5280]: 2011/03/03_04:03:59 debug: Sending hold resources msg: all, stable=1 # <none> heartbeat[5280]: 2011/03/03_04:03:59 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[5280]: 2011/03/03_04:03:59 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: 1, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:03:59 info: New standby state: 0 heartbeat[5280]: 2011/03/03_04:03:59 info: Managed go_standby process 1142 exited with return code 0. heartbeat[5280]: 2011/03/03_04:03:59 debug: RscMgmtProc 'go_standby' exited code 0 heartbeat[5280]: 2011/03/03_04:03:59 debug: StartNextRemoteRscReq() - calling hook heartbeat[5280]: 2011/03/03_04:03:59 debug: notify_world: invoking harc: OLD status: active heartbeat[5280]: 2011/03/03_04:03:59 debug: Process [status] started pid 1802 heartbeat[5280]: 2011/03/03_04:03:59 debug: Starting notify process [status] heartbeat[1802]: 2011/03/03_04:03:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL heartbeat[1802]: 2011/03/03_04:03:59 debug: notify_world: Running harc status harc[1802]: 2011/03/03_04:03:59 info: Running /etc/ha.d/rc.d/status status mach_down[1818]: 2011/03/03_04:03:59 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[1818]: 2011/03/03_04:03:59 info: mach_down takeover complete for node ftp02.mycharts.md. heartbeat[5280]: 2011/03/03_04:03:59 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[5280]: 2011/03/03_04:03:59 info: mach_down takeover complete. heartbeat[5280]: 2011/03/03_04:03:59 debug: process_resources(3): other now stable heartbeat[5280]: 2011/03/03_04:03:59 info: AnnounceTakeover(local 1, foreign 1, reason 'mach_down' (1)) heartbeat[5280]: 2011/03/03_04:03:59 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:03:59 info: Managed status process 1802 exited with return code 0. heartbeat[5280]: 2011/03/03_04:03:59 debug: RscMgmtProc 'status' exited code 0 heartbeat[5280]: 2011/03/03_04:03:59 debug: StartNextRemoteRscReq() - calling hook heartbeat[5280]: 2011/03/03_04:03:59 debug: notify_world: invoking harc: OLD status: active heartbeat[5280]: 2011/03/03_04:03:59 debug: Process [ip-request-resp] started pid 1852 heartbeat[5280]: 2011/03/03_04:03:59 debug: Starting notify process [ip-request-resp] heartbeat[1852]: 2011/03/03_04:03:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL heartbeat[1852]: 2011/03/03_04:03:59 debug: notify_world: Running harc ip-request-resp harc[1852]: 2011/03/03_04:03:59 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[1852]: 2011/03/03_04:03:59 received ip-request-resp drbddisk::ha_ftp OK yes ResourceManager[1873]: 2011/03/03_04:03:59 info: Acquiring resource group: ftp01.mycharts.md drbddisk::ha_ftp Filesystem::/dev/drbd0::/ha_ftp::ext3::noatime IPaddr2::192.168.10.203/24/bond0 pure-ftpd nfs Filesystem[1915]: 2011/03/03_04:04:00 INFO: Running OK IPaddr2[1975]: 2011/03/03_04:04:00 INFO: Running OK heartbeat[5280]: 2011/03/03_04:04:00 info: Managed ip-request-resp process 1852 exited with return code 0. heartbeat[5280]: 2011/03/03_04:04:00 debug: RscMgmtProc 'ip-request-resp' exited code 0 heartbeat[5280]: 2011/03/03_04:04:00 info: AnnounceTakeover(local 1, foreign 1, reason 'ip-request-resp' (1)) heartbeat[5340]: 2011/03/03_04:04:25 WARN: glib: TTY write timeout on [/dev/ttyS0] (no connection or bad cable? [see documentation]) heartbeat[5340]: 2011/03/03_04:04:25 info: glib: See http://linux-ha.org/FAQ#TTYtimeout <http://linux-ha.org/FAQ#TTYtimeout> for details heartbeat[5280]: 2011/03/03_04:04:26 info: Link ftp02.mycharts.md:bond0 dead. heartbeat[5280]: 2011/03/03_04:04:26 info: Link ftp02.mycharts.md:eth2 dead. ipfail[5456]: 2011/03/03_04:04:26 info: Link Status update: Link ftp02.mycharts.md/bond0 now has status dead ipfail[5456]: 2011/03/03_04:04:27 debug: Found ping node 192.168.10.100! heartbeat[5280]: 2011/03/03_04:04:27 info: Link ftp02.mycharts.md:/dev/ttyS0 dead. ipfail[5456]: 2011/03/03_04:04:27 info: Asking other side for ping node count. ipfail[5456]: 2011/03/03_04:04:27 debug: Message [num_ping] sent. ipfail[5456]: 2011/03/03_04:04:27 info: Checking remote count of ping nodes. ipfail[5456]: 2011/03/03_04:04:27 info: Link Status update: Link ftp02.mycharts.md/eth2 now has status dead ipfail[5456]: 2011/03/03_04:04:28 debug: Found ping node 192.168.10.100! ipfail[5456]: 2011/03/03_04:04:28 info: Asking other side for ping node count. ipfail[5456]: 2011/03/03_04:04:28 debug: Message [num_ping] sent. ipfail[5456]: 2011/03/03_04:04:28 info: Checking remote count of ping nodes. ipfail[5456]: 2011/03/03_04:04:28 info: Link Status update: Link ftp02.mycharts.md//dev/ttyS0 now has status dead ipfail[5456]: 2011/03/03_04:04:29 debug: Found ping node 192.168.10.100! ipfail[5456]: 2011/03/03_04:04:29 info: Asking other side for ping node count. ipfail[5456]: 2011/03/03_04:04:29 debug: Message [num_ping] sent. ipfail[5456]: 2011/03/03_04:04:29 info: Checking remote count of ping nodes. heartbeat[5280]: 2011/03/03_04:04:30 info: Heartbeat restart on node ftp02.mycharts.md heartbeat[5280]: 2011/03/03_04:04:30 info: Link ftp02.mycharts.md:/dev/ttyS0 up. heartbeat[5280]: 2011/03/03_04:04:30 info: Status update for node ftp02.mycharts.md: status init heartbeat[5280]: 2011/03/03_04:04:30 debug: Status seqno: 1 msgtime: 1299153868 ipfail[5456]: 2011/03/03_04:04:30 info: Link Status update: Link ftp02.mycharts.md//dev/ttyS0 now has status up heartbeat[5280]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq() - calling hook heartbeat[5280]: 2011/03/03_04:04:30 debug: notify_world: invoking harc: OLD status: active heartbeat[5280]: 2011/03/03_04:04:30 debug: Process [status] started pid 2976 heartbeat[5280]: 2011/03/03_04:04:30 debug: Starting notify process [status] ipfail[5456]: 2011/03/03_04:04:30 info: Status update: Node ftp02.mycharts.md now has status init heartbeat[5280]: 2011/03/03_04:04:30 info: Link ftp02.mycharts.md:bond0 up. ipfail[5456]: 2011/03/03_04:04:30 info: Link Status update: Link ftp02.mycharts.md/bond0 now has status up heartbeat[5280]: 2011/03/03_04:04:30 info: Link ftp02.mycharts.md:eth2 up. ipfail[5456]: 2011/03/03_04:04:30 info: Link Status update: Link ftp02.mycharts.md/eth2 now has status up heartbeat[5280]: 2011/03/03_04:04:30 info: Status update for node ftp02.mycharts.md: status up heartbeat[5280]: 2011/03/03_04:04:30 debug: Status seqno: 2 msgtime: 1299153868 heartbeat[5280]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq(): child count 1 ipfail[5456]: 2011/03/03_04:04:30 info: Status update: Node ftp02.mycharts.md now has status up heartbeat[2976]: 2011/03/03_04:04:30 debug: notify_world: setting SIGCHLD Handler to SIG_DFL heartbeat[2976]: 2011/03/03_04:04:30 debug: notify_world: Running harc status harc[2976]: 2011/03/03_04:04:30 info: Running /etc/ha.d/rc.d/status status heartbeat[5280]: 2011/03/03_04:04:30 info: Managed status process 2976 exited with return code 0. heartbeat[5280]: 2011/03/03_04:04:30 debug: RscMgmtProc 'status' exited code 0 heartbeat[5280]: 2011/03/03_04:04:30 debug: StartNextRemoteRscReq() - calling hook heartbeat[5280]: 2011/03/03_04:04:30 debug: notify_world: invoking harc: OLD status: active heartbeat[5280]: 2011/03/03_04:04:30 debug: Process [status] started pid 2992 heartbeat[5280]: 2011/03/03_04:04:30 debug: Starting notify process [status] heartbeat[2992]: 2011/03/03_04:04:30 debug: notify_world: setting SIGCHLD Handler to SIG_DFL heartbeat[2992]: 2011/03/03_04:04:30 debug: notify_world: Running harc status harc[2992]: 2011/03/03_04:04:30 info: Running /etc/ha.d/rc.d/status status heartbeat[5280]: 2011/03/03_04:04:30 info: Managed status process 2992 exited with return code 0. heartbeat[5280]: 2011/03/03_04:04:30 debug: RscMgmtProc 'status' exited code 0 heartbeat[5280]: 2011/03/03_04:04:30 debug: Get a reqnodes message from ftp02.mycharts.md heartbeat[5280]: 2011/03/03_04:04:30 debug: get_delnodelist: delnodelist= heartbeat[5280]: 2011/03/03_04:04:31 info: Status update for node ftp02.mycharts.md: status active heartbeat[5280]: 2011/03/03_04:04:31 debug: Status seqno: 6 msgtime: 1299153870 heartbeat[5280]: 2011/03/03_04:04:31 debug: StartNextRemoteRscReq() - calling hook heartbeat[5280]: 2011/03/03_04:04:31 debug: notify_world: invoking harc: OLD status: active heartbeat[5280]: 2011/03/03_04:04:31 debug: Process [status] started pid 3046 heartbeat[5280]: 2011/03/03_04:04:31 debug: Starting notify process [status] ipfail[5456]: 2011/03/03_04:04:31 info: Status update: Node ftp02.mycharts.md now has status active heartbeat[5280]: 2011/03/03_04:04:31 debug: process_resources: other now unstable heartbeat[5280]: 2011/03/03_04:04:31 debug: Sending hold resources msg: all, stable=1 # <none> heartbeat[5280]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[3046]: 2011/03/03_04:04:31 debug: notify_world: setting SIGCHLD Handler to SIG_DFL heartbeat[3046]: 2011/03/03_04:04:31 debug: notify_world: Running harc status ipfail[5456]: 2011/03/03_04:04:31 debug: Got join message from another ipfail client. (ftp02.mycharts.md) harc[3046]: 2011/03/03_04:04:31 info: Running /etc/ha.d/rc.d/status status heartbeat[5280]: 2011/03/03_04:04:31 info: Managed status process 3046 exited with return code 0. heartbeat[5280]: 2011/03/03_04:04:31 debug: RscMgmtProc 'status' exited code 0 ipfail[5456]: 2011/03/03_04:04:31 debug: Found ping node 192.168.10.100! heartbeat[5280]: 2011/03/03_04:04:31 debug: process_resources(2): other now unstable heartbeat[5280]: 2011/03/03_04:04:31 info: other_holds_resources: 0 heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:04:31 info: remote resource transition completed. heartbeat[5280]: 2011/03/03_04:04:31 debug: Sending hold resources msg: all, stable=1 # <none> heartbeat[5280]: 2011/03/03_04:04:31 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 heartbeat[5280]: 2011/03/03_04:04:31 debug: Calling PerformAutoFailback() heartbeat[5280]: 2011/03/03_04:04:31 info: other_holds_resources: 0 heartbeat[5280]: 2011/03/03_04:04:31 debug: hb_rsc_isstable: ResourceMgmt_child_count: 0, other_is_stable: 1, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 4 ipfail[5456]: 2011/03/03_04:04:31 info: Asking other side for ping node count. ipfail[5456]: 2011/03/03_04:04:31 debug: Message [num_ping] sent. ipfail[5456]: 2011/03/03_04:04:31 debug: Other side is unstable. ipfail[5456]: 2011/03/03_04:04:31 debug: Other side is now stable. ipfail[5456]: 2011/03/03_04:04:37 info: No giveup timer to abort. heartbeat[5280]: 2011/03/03_04:12:49 info: Daily informational memory statistics
-- Eric Robinson Disclaimer - March 3, 2011 This email and any files transmitted with it are confidential and intended solely for General Linux-HA mailing list. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physicians' Managed Care or Physician Select Management. Warning: Although Physicians' Managed Care or Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This disclaimer was added by Policy Patrol: http://www.policypatrol.com/ _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
