Hi! I had an unexplainable failure of the stonith monitor for SBD. When examining the syslog, I got the impression that RA configuration data got corrupted, causing a RA failure.
What I see in syslog is like this: lrmd: [9798]: info: perform_op:2950: operation monitor[88] with pid 12147 on prm_stonith_sbd:1 for client 9801, its parameters: CRM_meta_record_pending=[true] CRM_meta_clone=[1] CRM_meta_clone_node_max=[1] CRM_meta_clone_max=[2] CRM_meta_notify=[false] sbd_device=[/dev/disk/by-id/dm-name-Cl1_SBD-E1;/dev/disk/by-id/dm-name-Cl1_SBD-CRM_meta_globally_unique=[false] crm_feature_set=[3.0.6] CRM_meta_name=[monitor] CRM_meta_interval=[180000] CRM_meta_timeout=[90000] for rsc is already running. If you look closely at the "sbd_device" parameter, you'll see that it got merged with "CRM_meta_globally_unique=[false]", while the device name was truncated. Maybe there is some static buffer that overflowed. I saw many ".. is already running" messages, and then lrmd: [9798]: WARN: prm_stonith_sbd:1:monitor process (PID 12147) timed out (try 1). Killing with signal SIGTERM (15). and later: lrmd: [9798]: WARN: prm_stonith_sbd:1:monitor process (PID 12147) timed out (try 2). Killing with signal SIGKILL (9). In "cibadmin -Q" the configuration looks OK, but that's two resets later (the stonith problem caused fencing). Even after a hard reset the situation seems to be the same! WARN: prm_stonith_sbd:1:monitor process (PID 12072) timed out (try 2). Killing with signal SIGKILL (9). lrmd: [9799]: WARN: perform_ra_op: the operation monitor[89] on prm_stonith_sbd:1 for client 9802 stayed in operation list for 96500 ms (longer than 10000 ms) I discovered more bad things: stonithd crashed: crmd: [9801]: info: process_lrm_event: LRM operation prm_stonith_sbd:1_monitor_180000 (call=89, status=1, cib-update=0, confirmed=true) Cancelled stonith-ng: [9797]: WARN: free_device: Removal of device 'prm_stonith_sbd:1' purged operation monitor kernel: [ 323.648355] show_signal_msg: 30 callbacks suppressed kernel: [ 323.648361] stonithd[9797]: segfault at 0 ip 00007f70528afb94 sp 00007fffaf06a410 error 4 in libcrmcommon.so.2.0.0[7f70528a4000+2d000] lrm-stonith: [14098]: ERROR: stonith_send_command: STONITH disconnected: 3 lrm-stonith: [14098]: WARN: map_ra_retvalue: Mapped the invalid return code -10. lrmd: [9798]: info: operation stop[90] on prm_stonith_sbd:1 for client 9801: pid 14098 exited with return code 1 crmd: [9801]: info: process_lrm_event: LRM operation prm_stonith_sbd:1_stop_0 (call=90, rc=1, cib-update=145, confirmed=true) unknown error [...] It happened again (after another hard reset): kernel: [ 300.400783] stonithd[9798]: segfault at 0 ip 00007f8e32a18b94 sp 00007fffa5c954f0 error 4 in libcrmcommon.so.2.0.0[7f8e32a0d000+2d000] Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
