Hi, On 4 February 2011 23:09, Ryan Thomson <r...@pet.ubc.ca> wrote: > Hello list, > > I've got a question surrounding the behaviour of pacemaker (with heartbeat) > when the partition hosting /var becomes full. Hopefully I can explain the > situation clearly. > > We are running a two-node cluster with pacemaker 1.0.9 with heartbeat 3.0.3 > on CentOS 5 x86_64. STONITH is configured with IPMI. We run in an > active/passive configuration. > > On Wednesday night our active node (resonance) experienced a severe kernel > soft lockup issue. Starting around The soft lockup caused the services > running on this node to become inaccessible to the clients. While some of the > TCP ports accepted telnet connections and the node was responding to pings > but none of the clients were able to access the actual services, including > SSH. The first soft lockup occurred around 4:30PM. > > Earlier that day (in the wee hours of the morning), /var became full on the > passive node (mricenter) causing pengine to experience problems writing to > /var: > > Feb 2 00:15:36 mricenter pengine: [23556]: ERROR: write_xml_file: > bzWriteClose() failed: -6 > > This was not noticed as our monitoring was inadequate. > > Once the soft lockup occurred on the active node and /var on the passive node > was full, both heartbeat and pacemaker apparently continued operating as if > everything was normal with the cluster. The logs on the passive node did not > indicate any loss of heartbeat communication and that the resources > controlled by pacemaker were running and presumably returning success from > their "monitor" operations: > > Feb 2 22:11:30 mricenter pengine: [23556]: notice: native_print: > stonith-mricenter (stonith:external/ipmi): Started > resonance.fakedomain.com > Feb 2 22:11:30 mricenter pengine: [23556]: notice: native_print: > stonith-resonance (stonith:external/ipmi): Started > mricenter.fakedomain.com > Feb 2 22:11:30 mricenter pengine: [23556]: notice: clone_print: Clone Set: > ping-clone > Feb 2 22:11:30 mricenter pengine: [23556]: notice: short_print: > Started: [ mricenter.fakedomain.com resonance.fakedomain.com ] > Feb 2 22:11:30 mricenter pengine: [23556]: notice: group_print: Resource > Group: DRBD > Feb 2 22:11:30 mricenter pengine: [23556]: notice: native_print: > DRBD-Disk (heartbeat:drbddisk): Started resonance.fakedomain.com > Feb 2 22:11:30 mricenter pengine: [23556]: notice: native_print: > DRBD-Filesystem (ocf::heartbeat:Filesystem): Started > resonance.fakedomain.com > Feb 2 22:11:31 mricenter pengine: [23556]: notice: group_print: Resource > Group: LUN-HOME > Feb 2 22:11:31 mricenter pengine: [23556]: notice: native_print: > Home-LVM (ocf::heartbeat:LVM): Started resonance.fakedomain.com > Feb 2 22:11:31 mricenter pengine: [23556]: notice: native_print: > Home-Filesystem (ocf::heartbeat:Filesystem): Started > resonance.fakedomain.com > Feb 2 22:11:31 mricenter pengine: [23556]: notice: group_print: Resource > Group: LUN-DATA > Feb 2 22:11:31 mricenter pengine: [23556]: notice: native_print: > Data-LVM (ocf::heartbeat:LVM): Started resonance.fakedomain.com > Feb 2 22:11:31 mricenter pengine: [23556]: notice: native_print: > Workgroup-Filesystem (ocf::heartbeat:Filesystem): Started > resonance.fakedomain.com > Feb 2 22:11:31 mricenter pengine: [23556]: notice: native_print: > Mrcntr-Filesystem (ocf::heartbeat:Filesystem): Started > resonance.fakedomain.com > Feb 2 22:11:32 mricenter pengine: [23556]: notice: group_print: Resource > Group: LUN-DATABASE > Feb 2 22:11:32 mricenter pengine: [23556]: notice: native_print: > Database-LVM (ocf::heartbeat:LVM): Started resonance.fakedomain.com > Feb 2 22:11:32 mricenter pengine: [23556]: notice: native_print: > Database-Filesystem (ocf::heartbeat:Filesystem): Started > resonance.fakedomain.com > Feb 2 22:11:32 mricenter pengine: [23556]: notice: group_print: Resource > Group: LUN-CHH > Feb 2 22:11:32 mricenter pengine: [23556]: notice: native_print: > Chh-LVM (ocf::heartbeat:LVM): Started resonance.fakedomain.com > Feb 2 22:11:32 mricenter pengine: [23556]: notice: native_print: > Chh-Filesystem (ocf::heartbeat:Filesystem): Started > resonance.fakedomain.com > Feb 2 22:11:32 mricenter pengine: [23556]: notice: group_print: Resource > Group: NFS > Feb 2 22:11:32 mricenter pengine: [23556]: notice: native_print: > NFSLock (lsb:nfslock): Started resonance.fakedomain.com > Feb 2 22:11:33 mricenter pengine: [23556]: notice: native_print: > NFS-Daemon (lsb:nfs): Started resonance.fakedomain.com > Feb 2 22:11:33 mricenter pengine: [23556]: notice: native_print: Virtual-IP > (ocf::heartbeat:IPaddr2): Started resonance.fakedomain.com > Feb 2 22:11:33 mricenter pengine: [23556]: notice: native_print: > Samba-Daemon (lsb:smb): Started resonance.fakedomain.com > Feb 2 22:11:33 mricenter pengine: [23556]: notice: native_print: > SMmonitor-Daemon (lsb:SMmonitor): Started resonance.fakedomain.com > Feb 2 22:11:33 mricenter pengine: [23556]: notice: native_print: > Tina-Backup-Agent (lsb:tina.tina_ha): Started resonance.fakedomain.com > Feb 2 22:11:33 mricenter pengine: [23556]: notice: native_print: CUPS-Daemon > (lsb:cups): Started resonance.fakedomain.com > Feb 2 22:11:34 mricenter pengine: [23556]: notice: native_print: > Failover-Email-Alert (ocf::heartbeat:MailTo): Started > resonance.fakedomain.com > > However, only the pacemaker/heartbeat logs on the passive node continued as > per normal. On the active and soft lockup'ed node, the pacemaker log output > abruptly stopped once the soft lockup condition had occurred. We did however > get this repeating message from heartbeat in the logs: > > Feb 2 17:45:46 resonance heartbeat: [8129]: ERROR: 36 messages dropped on a > non-blocking channel (send queue maximum length 64) > > My question is this: Would /var being full on the passive node have played a > role in the cluster's inability to failover during the soft lockup condition > on the active node? Or perhaps we hit a condition in which our configuration > of pacemaker was unable to detect this type of failure? I'm basically trying > to figure out if /var being full on the passive node played a role in the > lack of failover or if our configuration is inadequate at detecting the type > of failure we experienced.
I'd say absolutely yes. /var being full probably stopped cluster traffic or at the least, changes to the cib from being accepted (from memory cib changes are written to temp files in /var/lib/heartbeat/crm/...). It can certainly stop ssh sessions from being established. > > Thoughts? Just for the list (since I'm sure you've done this or similar already) I'd suggest you use SNMP monitoring and add an SNMP trap for /var being 95% full. A useful addition is to mount /var/log on a different disk/partition/logical volume from /var, that way even if your logs fill up, the system should still continue to function for a while. > > -- > Ryan Thomson, Systems Administrator, UBC-PET > UBC Hospital, Koerner Pavilion > Room G358, 2211 Wesbrook Mall > Vancouver, BC V6T 2B5 > > Daytime Tel: 604.822.7605 > Evening Tel: 778.319.4505 > Pager: 604.205.4349 / 6042054...@msg.telus.com > Email: r...@pet.ubc.ca -- Best Regards, Brett Delle Grazie _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker