Re: [Pacemaker] The effects of /var being full on failure detection

Brett Delle Grazie Fri, 04 Feb 2011 23:53:23 -0800

Hi,

On 4 February 2011 23:09, Ryan Thomson <r...@pet.ubc.ca> wrote:
> Hello list,
>
> I've got a question surrounding the behaviour of pacemaker (with heartbeat) 
> when the partition hosting /var becomes full. Hopefully I can explain the 
> situation clearly.
>
> We are running a two-node cluster with pacemaker 1.0.9 with heartbeat 3.0.3 
> on CentOS 5 x86_64. STONITH is configured with IPMI. We run in an 
> active/passive configuration.
>
> On Wednesday night our active node (resonance) experienced a severe kernel 
> soft lockup issue. Starting around  The soft lockup caused the services 
> running on this node to become inaccessible to the clients. While some of the 
> TCP ports accepted telnet connections and the node was responding to pings 
> but none of the clients were able to access the actual services, including 
> SSH. The first soft lockup occurred around 4:30PM.
>
> Earlier that day (in the wee hours of the morning), /var became full on the 
> passive node (mricenter) causing pengine to experience problems writing to 
> /var:
>
> Feb  2 00:15:36 mricenter pengine: [23556]: ERROR: write_xml_file: 
> bzWriteClose() failed: -6
>
> This was not noticed as our monitoring was inadequate.
>
> Once the soft lockup occurred on the active node and /var on the passive node 
> was full, both heartbeat and pacemaker apparently continued operating as if 
> everything was normal with the cluster. The logs on the passive node did not 
> indicate any loss of heartbeat communication and that the resources 
> controlled by pacemaker were running and presumably returning success from 
> their "monitor" operations:
>
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print: 
> stonith-mricenter     (stonith:external/ipmi):        Started 
> resonance.fakedomain.com
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print: 
> stonith-resonance     (stonith:external/ipmi):        Started 
> mricenter.fakedomain.com
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: clone_print:  Clone Set: 
> ping-clone
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: short_print:      
> Started: [ mricenter.fakedomain.com resonance.fakedomain.com ]
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: group_print:  Resource 
> Group: DRBD
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print:      
> DRBD-Disk        (heartbeat:drbddisk):   Started resonance.fakedomain.com
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print:      
> DRBD-Filesystem  (ocf::heartbeat:Filesystem):    Started 
> resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: group_print:  Resource 
> Group: LUN-HOME
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      
> Home-LVM (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      
> Home-Filesystem  (ocf::heartbeat:Filesystem):    Started 
> resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: group_print:  Resource 
> Group: LUN-DATA
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      
> Data-LVM (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      
> Workgroup-Filesystem     (ocf::heartbeat:Filesystem):    Started 
> resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      
> Mrcntr-Filesystem        (ocf::heartbeat:Filesystem):    Started 
> resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource 
> Group: LUN-DATABASE
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      
> Database-LVM     (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      
> Database-Filesystem      (ocf::heartbeat:Filesystem):    Started 
> resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource 
> Group: LUN-CHH
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      
> Chh-LVM  (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      
> Chh-Filesystem   (ocf::heartbeat:Filesystem):    Started 
> resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource 
> Group: NFS
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      
> NFSLock  (lsb:nfslock):  Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print:      
> NFS-Daemon       (lsb:nfs):      Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Virtual-IP  
>   (ocf::heartbeat:IPaddr2):       Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: 
> Samba-Daemon  (lsb:smb):      Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: 
> SMmonitor-Daemon      (lsb:SMmonitor):        Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: 
> Tina-Backup-Agent     (lsb:tina.tina_ha):     Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: CUPS-Daemon 
>   (lsb:cups):     Started resonance.fakedomain.com
> Feb  2 22:11:34 mricenter pengine: [23556]: notice: native_print: 
> Failover-Email-Alert  (ocf::heartbeat:MailTo):        Started 
> resonance.fakedomain.com
>
> However, only the pacemaker/heartbeat logs on the passive node continued as 
> per normal. On the active and soft lockup'ed node, the pacemaker log output 
> abruptly stopped once the soft lockup condition had occurred. We did however 
> get this repeating message from heartbeat in the logs:
>
> Feb  2 17:45:46 resonance heartbeat: [8129]: ERROR: 36 messages dropped on a 
> non-blocking channel (send queue maximum length 64)
>
> My question is this: Would /var being full on the passive node have played a 
> role in the cluster's inability to failover during the soft lockup condition 
> on the active node? Or perhaps we hit a condition in which our configuration 
> of pacemaker was unable to detect this type of failure? I'm basically trying 
> to figure out if /var being full on the passive node played a role in the 
> lack of failover or if our configuration is inadequate at detecting the type 
> of failure we experienced.


I'd say absolutely yes. /var being full probably stopped cluster
traffic or at the least, changes to the cib from being accepted (from
memory cib changes are written to temp files in /var/lib/heartbeat/crm/...).

It can certainly stop ssh sessions from being established.

>
> Thoughts?

Just for the list (since I'm sure you've done this or similar already)
I'd suggest you use SNMP monitoring and add an SNMP trap for /var
being 95% full.

A useful addition is to mount /var/log on a different
disk/partition/logical volume from /var, that way even if your logs
fill up, the system should still continue to function for a while.

>
> --
> Ryan Thomson, Systems Administrator, UBC-PET
> UBC Hospital, Koerner Pavilion
> Room G358, 2211 Wesbrook Mall
> Vancouver, BC V6T 2B5
>
> Daytime Tel: 604.822.7605
> Evening Tel: 778.319.4505
> Pager: 604.205.4349 / 6042054...@msg.telus.com
> Email: r...@pet.ubc.ca

-- 
Best Regards,

Brett Delle Grazie

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] The effects of /var being full on failure detection

Reply via email to