Re: [Pacemaker] Pacemaker hang with hardware reset

Errol Neal Tue, 03 Jul 2012 15:38:40 -0700

On Tue, 07/03/2012 06:27 PM, Damiano Scaramuzza <cese...@daimonlab.it> wrote:
> Hi all, my first post in this ML.
> I've used in 2008 heartbeat for a big project and now I'm back with
> pacemaker for a smaller one.
> 
> I've two nodes with drbd/clvm/ocfs2/kvm virtual machines. all in debian
> wheezy using testing(quite stable) packages.
> I've made configuration with stonith meatware and some colocation rule
> (if needed I can post cib file)
> If I stop gracefully one of two node everything works good (I mean vm
> resources migrate in the other node ,drbd fences and
> all colocation/start-stop orders are fullfilled)
> 
> Bad things happens when I force to reset one of two nodes with echo b >
> /proc/sysrq-trigger
> 
> Scenario 1) cluster software hang completely, I mean crm_mon returns 2
> nodes online but the other node reboot and stay
> without corosync/pacemaker unloaded. No stonith message at all
> 
> Scenario 2) sometimes I see the meatware stonith message, I call
> meatclient and the cluster hang
> Scenario 3) meatware message, call meat client, crm_mon returns "node
> unclean" but I see some resource stopped and some running or Master.
> 
> Using the full configuration with  ocfs2 (but I tested gfs2 too) I see
> these messages in syslog
> 
> kernel: [ 2277.229622] INFO: task virsh:11370 blocked for more than 120
> seconds.
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229626] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229629] virsh           D
> ffff88041fc53540     0 11370  11368 0x00000000
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229635]  ffff88040b50ce60
> 0000000000000082 0000000000000000 ffff88040f235610
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229642]  0000000000013540
> ffff8803e1953fd8 ffff8803e1953fd8 ffff88040b50ce60
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229648]  0000000000000246
> 0000000181349294 ffff8803f5ca2690 ffff8803f5ca2000
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229655] Call Trace:
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229673]  [<ffffffffa06da2d9>] ?
> ocfs2_wait_for_recovery+0xa2/0xbc [ocfs2]
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229679]  [<ffffffff8105f51b>] ?
> add_wait_queue+0x3c/0x3c
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229696]  [<ffffffffa06c8896>] ?
> ocfs2_inode_lock_full_nested+0xeb/0x925 [ocfs2]
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229714]  [<ffffffffa06cdd2a>] ?
> ocfs2_permission+0x2b/0xe1 [ocfs2]
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229721]  [<ffffffff811019e9>] ?
> unlazy_walk+0x100/0x132
> 
> 
> So to simplify and exclude ocfs2 from hang I tried drbd/clvm only but
> resetting one node with the same echo b
> I see cluster hang with these messages in syslog
> 
<SNIP>


Interesting.. I ran into this same problem last nite. I'm running on Debian 
squeeze using debs in squeeze backports on XCP (opensource Xenserver). When I 
force shutdown a node, the remaining node hangs with a similar message. I 
originally thought it was an OCFS2 problem - but it may be something different. 
Either way, I'm going to try my configuration using CentOS - but that is having 
it's own unique challenges lol. 



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker hang with hardware reset

Reply via email to