Hi!

I had an unexplainable failure of the stonith monitor for SBD. When examining 
the syslog, I got the impression that RA configuration data got corrupted, 
causing a RA failure.

What I see in syslog is like this:

lrmd: [9798]: info: perform_op:2950: operation monitor[88] with pid 12147 on 
prm_stonith_sbd:1 for client 9801, its parameters: 
CRM_meta_record_pending=[true] CRM_meta_clone=[1] CRM_meta_clone_node_max=[1] 
CRM_meta_clone_max=[2] CRM_meta_notify=[false] 
sbd_device=[/dev/disk/by-id/dm-name-Cl1_SBD-E1;/dev/disk/by-id/dm-name-Cl1_SBD-CRM_meta_globally_unique=[false]
 crm_feature_set=[3.0.6] CRM_meta_name=[monitor] CRM_meta_interval=[180000] 
CRM_meta_timeout=[90000]  for rsc is already running.

If you look closely at the "sbd_device" parameter, you'll see that it got 
merged with "CRM_meta_globally_unique=[false]", while the device name was 
truncated. Maybe there is some static buffer that overflowed.

I saw many ".. is already running" messages, and then
lrmd: [9798]: WARN: prm_stonith_sbd:1:monitor process (PID 12147) timed out 
(try 1).  Killing with signal SIGTERM (15).

and later:

lrmd: [9798]: WARN: prm_stonith_sbd:1:monitor process (PID 12147) timed out 
(try 2).  Killing with signal SIGKILL (9).
In "cibadmin -Q" the configuration looks OK, but that's two resets later (the 
stonith problem caused fencing).

Even after a hard reset the situation seems to be the same!

WARN: prm_stonith_sbd:1:monitor process (PID 12072) timed out (try 2).  Killing 
with signal SIGKILL (9).
lrmd: [9799]: WARN: perform_ra_op: the operation monitor[89] on 
prm_stonith_sbd:1 for client 9802 stayed in operation list for 96500 ms (longer 
than 10000 ms)

I discovered more bad things: stonithd crashed:
crmd: [9801]: info: process_lrm_event: LRM operation 
prm_stonith_sbd:1_monitor_180000 (call=89, status=1, cib-update=0, 
confirmed=true) Cancelled
stonith-ng: [9797]: WARN: free_device: Removal of device 'prm_stonith_sbd:1' 
purged operation monitor
kernel: [  323.648355] show_signal_msg: 30 callbacks suppressed
kernel: [  323.648361] stonithd[9797]: segfault at 0 ip 00007f70528afb94 sp 
00007fffaf06a410 error 4 in libcrmcommon.so.2.0.0[7f70528a4000+2d000]
lrm-stonith: [14098]: ERROR: stonith_send_command: STONITH disconnected: 3
lrm-stonith: [14098]: WARN: map_ra_retvalue: Mapped the invalid return code -10.
lrmd: [9798]: info: operation stop[90] on prm_stonith_sbd:1 for client 9801: 
pid 14098 exited with return code 1
crmd: [9801]: info: process_lrm_event: LRM operation prm_stonith_sbd:1_stop_0 
(call=90, rc=1, cib-update=145, confirmed=true) unknown error
[...]

It happened again (after another hard reset):
kernel: [  300.400783] stonithd[9798]: segfault at 0 ip 00007f8e32a18b94 sp 
00007fffa5c954f0 error 4 in libcrmcommon.so.2.0.0[7f8e32a0d000+2d000]

Regards,
Ulrich


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to