Hi!

OK, I manually added a "sync" before those reboots. Maybe I'll find some trace 
after the next unexplainable reboot.

Regards,
Ulrich


>>> Lars Ellenberg <[email protected]> schrieb am 19.12.2011 um 21:40 in
Nachricht <[email protected]>:
> On Fri, Dec 16, 2011 at 01:31:32PM +0100, Ulrich Windl wrote:
> > Hi!
> > 
> > I have some troubel with OCFS on top of DRBD that seems to be 
> timing-related:
> > OCFS is working on the DRBD when DRBD itself wants to vhange something it 
> seems:
> > 
> > ...
> > Dec 16 11:39:58 h06 kernel: [  122.426174] block drbd0: role( Secondary -> 
> Primary )
> > Dec 16 11:39:58 h06 multipathd: drbd0: update path write_protect to '0' 
> (uevent)
> > Dec 16 11:40:29 h06 ocfs2_controld: start_mount: uuid 
> "FD32E504527742CEA7DA6DB272D5D7B2", device "/dev/drbd_r0", service "ocfs2"
> > ...
> > Dec 16 11:40:29 h06 kernel: [  152.837615] block drbd0: peer( Secondary -> 
> Primary )
> > Dec 16 11:40:29 h06 ocfs2_hb_ctl[19177]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -P 
> -d /dev/drbd_r0
> > Dec 16 11:43:50 h06 kernel: [  354.559240] block drbd0: State change 
> failed: Device is held open by someone
> > Dec 16 11:43:50 h06 kernel: [  354.559244] block drbd0:   state = { 
> cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- }
> > Dec 16 11:43:50 h06 kernel: [  354.559246] block drbd0:  wanted = { 
> cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- }
> > Dec 16 11:43:50 h06 drbd[28754]: [28786]: ERROR: r0: Called drbdadm -c 
> /etc/drbd.conf secondary r0
> 
> The resource agent was told to demote.
> That fails, as DRBD is still/already in use (by ocfs2 or other).
> 
> > Dec 16 11:43:50 h06 drbd[28754]: [28789]: ERROR: r0: Exit code 11
> > 
> > A little bit later DRBD did it's own fencing (the machine rebooted)
> 
> I very much doubt that.  At least, from the above log excerpt,
> I can not imagine a scenario for any of the below cited handlers to trigger,
> unless you throw multiple failures in the mix.
> 
> But you apparently get a "demote failure", and possibly then a "stop
> failure" as well, which may trigger a stonith event.
> 
> Or maybe IO is blocked for "too long" so OCFS2 decides to self-fence.
> 
> Guess you have to improve your logging.
> 
> > Is there a way to let the cluster do the fencing instead of writing to 
> sysctl? Those handlers are used:
> >         handlers {
> >                 pri-on-incon-degr 
> > "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
> reboot 
> -f";
> >                 pri-lost-after-sb 
> > "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
> reboot 
> -f";
> >                 local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; 
> halt 
> -f";



 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to