On 29 Oct 2010 14:43, Dejan Muhamedagic wrote: >> stonith -t rcd_serial -p "test /dev/ttyS0 rts 2000" test >> ** (process:21181): DEBUG: rcd_serial_set_config:called >> Alarm clock >> ==> RESET WORKS! >> >> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0" >> dtr\|rts="rts" msduration="2000" -S >> ** (process:28054): DEBUG: rcd_serial_set_config:called >> stonith: rcd_serial device OK. >> >> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0" >> dtr\|rts="rts" msduration="2000" -l >> ** (process:27543): DEBUG: rcd_serial_set_config:called >> node1 node2 >> >> stonith -t rcd_serial hostlist='node1 node2' ttydev="/dev/ttyS0" >> dtr\|rts="rts" msduration="2000" -T reset node2 >> ** (process:29624): DEBUG: rcd_serial_set_config:called >> ** (process:29624): CRITICAL **: rcd_serial_reset_req: host 'node2' not >> in hostlist. >> > And this message never appears in the logs? > Not in /var/log/messages >> ==> RESET FAILED >> >> stonith -t rcd_serial hostlist='node1, node2' ttydev="/dev/ttyS0" >> dtr\|rts="rts" msduration="2000" -T reset node2 >> ** (process:26929): DEBUG: rcd_serial_set_config:called >> ** (process:26929): CRITICAL **: rcd_serial_reset_req: host 'node2' not >> in hostlist. >> ==> RESET FAILED (notice: hostlist is comma separated here) >> >> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0" >> dtr\|rts="rts" msduration="2000" -T reset "node1 node2" >> ==> RESET WORKS, BUT the argument <<reset "node1 node2">> is shit... >> ==> There seems to be a problem with parsing the host list! >> > It turns out that the hostlist can contain just one node. That > makes sense since you can reach only one host over the serial > cable. The plugin also makes no effort to tell the user if the > hostlist looks meaningful, i.e. it considers "node1 node2" as a > node name (as you've shown above). > > So, you'll need to configure two stonith resources, one per node. > Very good idea! That brought me a bit foreward:
Now, I used the patched rcd_serial.so with dtr_rts instead of dtr|rts and the following config: primitive stonith1 stonith:rcd_serial \ params hostlist="node2" ttydev="/dev/ttyS0" dtr_rts="rts" msduration="2000" \ op monitor interval="60s" primitive stonith2 stonith:rcd_serial \ params hostlist="node1" ttydev="/dev/ttyS0" dtr_rts="rts" msduration="2000" \ op monitor interval="60s" location stonith1-loc stonith1 \ rule $id="stonith1-loc-id" -inf: #uname eq node2 location stonith2-loc stonith2 \ rule $id="stonith2-loc-id" -inf: #uname eq node1 Then, I said 'kill -9 <corosync_pid> ' on node2, and stonith on node1 really initiated a REBOOT of node2! BUT in /var/log/messages of node1, stonith-ng thinks that the operation failed: Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could not parse (0 2): ** (process:12139): DEBUG: rcd_serial_set_config:called Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could not parse (3 19): (process:12139): DEBUG: rcd_serial_set_config:called Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could not parse (0 0): Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could not parse (0 2): ** (process:12141): DEBUG: rcd_serial_set_config:called Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could not parse (3 19): (process:12141): DEBUG: rcd_serial_set_config:called Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could not parse (0 0): Oct 29 16:06:55 node1 pengine: [31454]: WARN: process_pe_message: Transition 29: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-10.bz2 Oct 29 16:06:55 node1 stonith: rcd_serial device not accessible. Oct 29 16:06:55 node1 stonith-ng: [31449]: notice: log_operation: Operation 'monitor' [12143] for device 'stonith2' returned: 1 Oct 29 16:06:55 node1 crmd: [31455]: WARN: status_from_rc: Action 118 (stonith2_monitor_60000) on node1 failed (target: 0 vs. rc: 1): Error Oct 29 16:06:55 node1 crmd: [31455]: WARN: update_failcount: Updating failcount for stonith2 on node1 after failed monitor: rc=1 (update=value++, time=1288361215) Oct 29 16:06:57 node1 kernel: [23312.814010] r8169 0000:02:00.0: eth0: link down Oct 29 16:06:57 node1 stonith-ng: [31449]: ERROR: log_operation: Operation 'reboot' [12142] for host 'node2' with device 'stonith1' returned: 1 (call 0 from (null)) The state remained unclean: # crm_mon Node node2: UNCLEAN (offline) Online: [ node1 ] That caused multiple reboots of node2, until I deactivated stonith. (The message "Operation 'reboot' ... returned: 1" repeated each time.) After that, the state became clean. So, we are a big step forward, but not at the finish... Thank you, Eberhard ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker