On Sat, Dec 15, 2012 at 8:45 AM, <laurent+pacema...@u-picardie.fr> wrote: > Andrew Beekhof <and...@beekhof.net> writes: > >> On Wed, Dec 12, 2012 at 11:51 AM, <laurent+pacema...@u-picardie.fr> wrote: >>> >>> Hi, >>> >>> I've just observed something weird. >>> A node is running a stonith resource for which gethosts gives an empty >>> node list. The result of stonith_admin -l does include it in the >>> device list ! >>> >>> result of "stonith_admin -l elasticsearch-05" run from >>> elasticsearch-06 : >>> stonith-xen-peatbull >>> stonith-xen-eddu >>> 2 devices found >>> >>> stonith-xen-peatbull is a correct fencing device >>> stonith-xen-eddu is a fencing device with an empty hostlist >>> >>> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand >>> doesn't return any host, and it does exit with 0 (is that correct to >>> return 0 with an empty host list ?) >>> >>> logs : >>> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]: notice: >>> stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 >>> active devices) >>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: >>> attrd_trigger_update: Sending flush op to all hosts for: probe_complete >>> (true) >>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: >>> attrd_perform_update: Sent update 5: probe_complete=true >>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: >>> stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 >>> active devices) >>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: >>> stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 >>> active devices) >>> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: >>> external/my-xen0-ha device OK. >>> Dec 12 01:09:12 elasticsearch-06 crmd[18185]: notice: process_lrm_event: >>> LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, >>> confirmed=true) ok >>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: >>> '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05 >>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: >>> '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06 >>> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 >>> device OK. >>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: >>> LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, >>> confirmed=true) ok >>> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 >>> device OK. >>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: >>> LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, >>> confirmed=true) ok >>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: >>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog >>> (1): (null) >>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: >>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka >>> (1): (null) >>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: >>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi >>> (1): (null) >>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: >>> 'my-xen0 gethosts' returned an empty hostlist >>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list >>> hosts for external/my-xen0. >>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: >>> 'my-xen0 gethosts' returned an empty hostlist >>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list >>> hosts for external/my-xen0. >>> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]: notice: >>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu >>> (1): failed: 255 >>> >>> David, I mentioned a node being wrongly fenced in the "stonith-timeout >>> duration 0 is too low" bug, could it be related ? > > Hi, > >> Doubtful, what does your config look like? > > i've restarted from scratch with a simpler setup: > primitive dummy_01 ocf:heartbeat:Dummy \ > meta allow-migrate="true" \ > op monitor interval="180" timeout="20" > primitive stonith-xen-eddu stonith:external/my-xen0 \ > params > hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 > elasticsearch-04" dom0="eddu" > clone clone-stonith-xen-eddu stonith-xen-eddu \ > meta clone-max="3" clone-node-max="1" > location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \ > rule $id="clone-stonith-xen-eddu-location-01-rule" inf: > defined #uname > location dummy_01-location-01 dummy_01 \ > rule $id="dummy_01-location-01-rule" inf: defined #uname > property $id="cib-bootstrap-options" \ > dc-version="1.1.8-56429db" \ > cluster-infrastructure="corosync" \ > stonith-timeout="120" \ > symmetric-cluster="false" \ > no-quorum-policy="stop" \ > stonith-enabled="true" > > there're 6 nodes: elasticsearch-01 ... 06 > afaik pcmk_host_check defaults to "dynamic-list". > > when the external stonith agent is called with "gethosts" it checks if > any of the guests are running on eddu (the xen dom0/host) > In this case, there're none of them running on eddu, it then returns > an empty hostlist. > Looking at the logs there's a critical message concerning the empty > hostlist. > So I guess it's not valid to have a stonith primitive temporarily > having no hosts to fence.
Just to be clear, thats the cluster-glue stonith binary complaining. Not pacemaker. > > It's just I would certainly not expect that device to appear in the > result of "stonith-admin -l nodename". > And it does ! :) Might be time to create a bug and attach logs. > I've just reproduced it again starting a new cluster from scratch and > using the above config. > Let's say the stonith agent runs on nodes 02, 03 and 04. > The first time I run stonith-admin -l "elasticsearch-01" on node 02, > 03 or 04 it returns "No devices found". From the second attempt it > does list "stonith-xen-eddu" as valid device. > > That's a behavior I did observe with the "stonith-timeout duration 0 > is too low" bug. > I wouldn't be surprised if it was related: in case of a timeout or in > case of an empty hostlist the stonith device is wrongly reported as > a valid fencing device instead of being blacklisted/disabled. > > I hope it's a bit clearer now. If not i'll have to try to learn how to > write a test case for it. (that would definitely make it clearer !) > :-) > > >> IIRC, these agents want to be told which machines they can fence > > I'd say that's true for the ipmi agent. > But a xen guest might be migrated from one host to another. Agreed. But I believe thats how most of them are written. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org