At what point did the node get fenced? Its not obvious from these logs. On Sat, Dec 15, 2012 at 12:15 PM, Lazy <lazy...@gmail.com> wrote: > Hi, > > we have a 2 node corosync 1.4.2 and pacemaker 1.1.7 cluster running > drdb, nfs, solr and redis in master slave configurations. > > Currently node 2 is unable to rejoin cluster after being fenced by stonith. > > the logs on node 2 > Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message: > Membership 0: quorum still lost > Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2: > id=33610762 state=member (new) addr=(null) votes=1 (new) born=0 seen=0 > proc=00000000000000000000000000111312 (new) > Dec 15 01:52:38 www2 stonith-ng: [6706]: info: get_ais_nodeid: Server > details: id=33610762 uname=www2 cname=pcmk > Dec 15 01:52:38 www2 stonith-ng: [6706]: info: > init_ais_connection_once: Connection to 'classic openais (with > plugin)': established > Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node www2 > now has id: 33610762 > Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node > 33610762 is now known as www2 > Dec 15 01:52:38 www2 attrd: [6708]: notice: main: Starting mainloop... > Dec 15 01:52:38 www2 stonith-ng: [6706]: notice: setup_cib: Watching > for stonith topology changes > Dec 15 01:52:38 www2 stonith-ng: [6706]: info: main: Starting > stonith-ng mainloop > Dec 15 01:52:38 www2 corosync[6682]: [TOTEM ] Incrementing problem > counter for seqid 1 iface 46.248.167.141 to [1 of 10] > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: > pcmk_peer_update: Transitional membership event on ring 11800: memb=0, > new=0, lost=0 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: > pcmk_peer_update: Stable membership event on ring 11800: memb=1, > new=1, lost=0 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > pcmk_peer_update: NEW: www2 33610762 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > pcmk_peer_update: MEMB: www2 33610762 > Dec 15 01:52:38 www2 corosync[6682]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Dec 15 01:52:38 www2 corosync[6682]: [CPG ] chosen downlist: > sender r(0) ip(10.220.0.2) r(1) ip(46.248.167.141) ; members(old:0 > left:0) > Dec 15 01:52:38 www2 corosync[6682]: [MAIN ] Completed service > synchronization, ready to provide service. > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: > pcmk_peer_update: Transitional membership event on ring 11804: memb=1, > new=0, lost=0 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > pcmk_peer_update: memb: www2 33610762 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: > pcmk_peer_update: Stable membership event on ring 11804: memb=2, > new=1, lost=0 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > Creating entry for node 16833546 born on 11804 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > Node 16833546/unknown is now: member > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > pcmk_peer_update: NEW: .pending. 16833546 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > pcmk_peer_update: MEMB: .pending. 16833546 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > pcmk_peer_update: MEMB: www2 33610762 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > send_member_notification: Sending membership update 11804 to 1 > children > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > 0x200cdc0 Node 33610762 ((null)) born on: 11804 > Dec 15 01:52:38 www2 corosync[6682]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message: > Membership 11804: quorum still lost > Dec 15 01:52:38 www2 cib: [6705]: info: crm_new_peer: Node <null> now > has id: 16833546 > Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node (null): > id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1) > ip(46.248.167.140) votes=0 born=0 seen=11804 > proc=00000000000000000000000000000000 > Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2: > id=33610762 state=member addr=r(0) ip(10.220.0.2) r(1) > ip(46.248.167.141) (new) votes=1 born=0 seen=11804 > proc=00000000000000000000000000111312 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > 0x20157a0 Node 16833546 (www1) born on: 11708 > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > 0x20157a0 Node 16833546 now known as www1 (was: (null)) > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > Node www1 now has process list: 00000000000000000000000000111312 > (1118994) > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: > Node www1 now has 1 quorum votes (was 0) > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: > send_member_notification: Sending membership update 11804 to 1 > children > Dec 15 01:52:38 www2 cib: [6705]: notice: ais_dispatch_message: > Membership 11804: quorum acquired > Dec 15 01:52:38 www2 cib: [6705]: info: crm_get_peer: Node 16833546 is > now known as www1 > Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www1: > id=16833546 state=member addr=r(0) ip(10.220.0.1) r(1) > ip(46.248.167.140) votes=1 (new) born=11708 seen=11804 > proc=00000000000000000000000000111312 (new) > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery > failed (rc=-2) > Dec 15 01:52:38 www2 corosync[6682]: [CPG ] chosen downlist: > sender r(0) ip(10.220.0.1) r(1) ip(46.248.167.140) ; members(old:1 > left:0) > Dec 15 01:52:38 www2 corosync[6682]: [MAIN ] Completed service > synchronization, ready to provide service. > Dec 15 01:52:38 www2 cib: [6705]: info: cib_process_diff: Diff > 0.407.26 -> 0.407.27 not applied to 0.406.0: current "epoch" is less > than required > Dec 15 01:52:38 www2 cib: [6705]: info: cib_server_process_diff: > Requesting re-sync from peer > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.27 -> 0.407.28 (sync in progress) > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.28 -> 0.407.29 (sync in progress) > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.29 -> 0.407.30 (sync in progress) > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.30 -> 0.407.31 (sync in progress) > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery > failed (rc=-2) > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.31 -> 0.407.32 (sync in progress) > Dec 15 01:52:38 www2 cib: [6705]: info: cib_process_diff: Diff > 0.407.32 -> 0.407.33 not applied to 0.406.0: current "epoch" is less > than required > Dec 15 01:52:38 www2 cib: [6705]: info: cib_server_process_diff: > Requesting re-sync from peer > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.33 -> 0.407.34 (sync in progress) > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.34 -> 0.407.35 (sync in progress) > Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery > failed (rc=-2) > Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not > applying diff 0.407.35 -> 0.407.36 (sync in progress) > Dec 15 01:52:39 www2 crmd: [6710]: info: do_cib_control: CIB > connection established > Dec 15 01:52:39 www2 crmd: [6710]: info: get_cluster_type: Cluster > type is: 'openais' > Dec 15 01:52:39 www2 crmd: [6710]: notice: crm_cluster_connect: > Connecting to cluster infrastructure: classic openais (with plugin) > Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_classic: > Creating connection to our Corosync plugin > Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_classic: > AIS connection established > Dec 15 01:52:39 www2 corosync[6682]: [pcmk ] info: pcmk_ipc: > Recorded connection 0x2024af0 for crmd/6710 > Dec 15 01:52:39 www2 corosync[6682]: [pcmk ] info: pcmk_ipc: > Sending membership update 11804 to crmd > Dec 15 01:52:39 www2 crmd: [6710]: info: get_ais_nodeid: Server > details: id=33610762 uname=www2 cname=pcmk > Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_once: > Connection to 'classic openais (with plugin)': established > Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node www2 now > has id: 33610762 > Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node 33610762 > is now known as www2 > Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status: > www2 is now unknown > Dec 15 01:52:39 www2 crmd: [6710]: info: do_ha_control: Connected to the > cluster > Dec 15 01:52:39 www2 crmd: [6710]: info: do_started: Delaying start, > no membership data (0000000000100000) > Dec 15 01:52:39 www2 crmd: [6710]: notice: ais_dispatch_message: > Membership 11804: quorum acquired > Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node www1 now > has id: 16833546 > Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node 16833546 > is now known as www1 > Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status: > www1 is now unknown > Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status: > www1 is now member (was unknown) > Dec 15 01:52:39 www2 crmd: [6710]: info: crm_update_peer: Node www1: > id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1) > ip(46.248.167.140) votes=1 > Dec 15 01:52:39 www2 crmd: [6710]: notice: crmd_peer_update: Status > update: Client www2/crmd now has status [online] (DC=<null>) > Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status: > www2 is now member (was unknown) > Dec 15 01:52:39 www2 crmd: [6710]: info: crm_update_peer: Node www2: > id=33610762 state=member (new) addr=r(0) ip(10.220.0.2) r(1) > ip(46.248.167.141) (new) votes=1 (new) born=11804 seen=11804 > proc=00000000000000000000000000111312 (new) > Dec 15 01:52:39 www2 crmd: [6710]: info: do_started: Delaying start, > Config not read (0000000000000040) > Dec 15 01:52:39 www2 crmd: [6710]: notice: do_started: The local CRM > is operational > Dec 15 01:52:39 www2 crmd: [6710]: notice: do_state_transition: State > transition S_STARTING -> S_PENDING [ input=I_PENDING > cause=C_FSA_INTERNAL origin=do_started ] > Dec 15 01:52:39 www2 corosync[6682]: [pcmk ] info: > update_expected_votes: Expected quorum votes 2 -> 3 > Dec 15 01:52:40 www2 crmd: [6710]: info: ais_dispatch_message: > Membership 11804: quorum retained > Dec 15 01:52:40 www2 corosync[6682]: [TOTEM ] ring 1 active with no faults > Dec 15 01:52:41 www2 crmd: [6710]: info: update_dc: Set DC to www1 (3.0.6) > Dec 15 01:52:41 www2 cib: [6705]: info: cib_replace_notify: Replaced: > -1.-1.-1 -> 0.407.36 from www1 > Dec 15 01:52:41 www2 crmd: [6710]: info: erase_status_tag: Deleting > xpath: //node_state[@uname='www2']/transient_attributes > Dec 15 01:52:41 www2 crmd: [6710]: info: update_attrd: Connecting to attrd... > Dec 15 01:52:41 www2 crmd: [6710]: notice: do_state_transition: State > transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE > origin=do_cl_join_finalize_respond ] > Dec 15 01:52:41 www2 attrd: [6708]: notice: attrd_local_callback: > Sending full refresh (origin=crmd) > Dec 15 01:52:41 www2 crmd: [6710]: info: ais_dispatch_message: > Membership 11804: quorum retained > Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:drbd:1 probe[2] (pid 6718) > Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:nfs_ip probe[3] (pid 6719) > Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:fs_home probe[4] (pid 6720) > Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:nfs-kernel-server > probe[5] (pid 6722) > Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[3] on > nfs_ip for client 6710: pid 6719 exited with return code 7 > Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[4] on > fs_home for client 6710: pid 6720 exited with return code 7 > Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM > operation nfs_ip_monitor_0 (call=3, rc=7, cib-update=8, > confirmed=true) not running > Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM > operation fs_home_monitor_0 (call=4, rc=7, cib-update=9, > confirmed=true) not running > Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[5] on > nfs-kernel-server for client 6710: pid 6722 exited with return code 7 > (mapped from 3) > Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM > operation nfs-kernel-server_monitor_0 (call=5, rc=7, cib-update=10, > confirmed=true) not running > Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[2] on > drbd:1 for client 6710: pid 6718 exited with return code 7 > Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM > operation drbd:1_monitor_0 (call=2, rc=7, cib-update=11, > confirmed=true) not running > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:nfs-common probe[6] (pid 6829) > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:fs_nfs_home probe[7] (pid 6830) > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:haproxy_ip probe[8] (pid 6831) > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:redis_ip probe[9] (pid 6832) > Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[9] on > redis_ip for client 6710: pid 6832 exited with return code 7 > Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM > operation redis_ip_monitor_0 (call=9, rc=7, cib-update=12, > confirmed=true) not running > Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[8] on > haproxy_ip for client 6710: pid 6831 exited with return code 7 > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:fs_nfs_home probe[7] (pid 6830) > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:haproxy_ip probe[8] (pid 6831) > Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:redis_ip probe[9] (pid 6832) > Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[9] on > redis_ip for client 6710: pid 6832 exited with return code 7 > Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM > operation redis_ip_monitor_0 (call=9, rc=7, cib-update=12, > confirmed=true) not running > Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[8] on > haproxy_ip for client 6710: pid 6831 exited with return code 7 > Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM > operation haproxy_ip_monitor_0 (call=8, rc=7, cib-update=13, > confirmed=true) not running > Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[6] on > nfs-common for client 6710: pid 6829 exited with return code 7 (mapped > from 3) > Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM > operation nfs-common_monitor_0 (call=6, rc=7, cib-update=14, > confirmed=true) not running > Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[7] on > fs_nfs_home for client 6710: pid 6830 exited with return code 7 > Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM > operation fs_nfs_home_monitor_0 (call=7, rc=7, cib-update=15, > confirmed=true) not running > Dec 15 01:52:43 www2 lrmd: [6707]: info: rsc:redis_server probe[10] (pid 6922) > Dec 15 01:52:43 www2 lrmd: [6707]: info: rsc:res_haproxy_1 probe[11] (pid > 6923) > Dec 15 01:52:43 www2 lrmd: [6707]: info: operation monitor[11] on > res_haproxy_1 for client 6710: pid 6923 exited with return code 7 > (mapped from 3) > Dec 15 01:52:43 www2 crmd: [6710]: info: process_lrm_event: LRM > operation res_haproxy_1_monitor_0 (call=11, rc=7, cib-update=16, > confirmed=true) not running > Dec 15 01:52:43 www2 redis[6922]: INFO: Could not read PID, > redis-server is not running > Dec 15 01:52:43 www2 lrmd: [6707]: info: operation monitor[10] on > redis_server for client 6710: pid 6922 exited with return code 7 > Dec 15 01:52:43 www2 crmd: [6710]: info: process_lrm_event: LRM > operation redis_server_monitor_0 (call=10, rc=7, cib-update=17, > confirmed=true) not running > Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 8 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 9 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 10 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 11 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 12 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 13 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 14 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 15 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:43 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 16 failed: (rc=-41) Remote node did not respond > Dec 15 01:53:43 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource > update 17 failed: (rc=-41) Remote node did not respond > Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:res_php5-fpm_1 probe[12] (pid > 6958) > Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:res_php5-fpm_2:0 > probe[13] (pid 6959) > Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:solr_ip probe[14] (pid 6960) > Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:solr_server:0 probe[15] (pid > 6962) > Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:www1-stonith probe[16] (pid 6963) > Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No > lrm_rprovider field in message > Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:www2-stonith probe[17] (pid 6964) > Dec 15 01:54:01 www2 stonith-ng: [6706]: notice: > stonith_device_action: Device www1-stonith not found > Dec 15 01:54:01 www2 stonith-ng: [6706]: info: stonith_command: > Processed st_execute from lrmd: rc=-12 > > > log on www1 (DC) > Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update: > Sending flush op to all hosts for: probe_complete (true) > Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_config: On loss > of CCM Quorum: Ignore > Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update: > Sending flush op to all hosts for: last-failure-solr_server:1 > (1355303185) > Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_rsc_op: Operation > monitor found resource solr_server:1 active in master mode on www1 > Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_rsc_op: Operation > monitor found resource drbd:0 active in master mode on www1 > Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update: > Sending flush op to all hosts for: master-drbd:0 (10000) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start > drbd:1#011(www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start > fs_nfs_home#011(www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move > haproxy_ip#011(Started www1 -> www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move > redis_ip#011(Started www1 -> www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Restart > redis_server#011(Started www1) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move > res_haproxy_1#011(Started www1 -> www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start > res_php5-fpm_2:0#011(www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start > solr_server:0#011(www2) > Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start > www1-stonith#011(www2) > Dec 15 01:52:41 www1 crmd: [2500]: notice: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Dec 15 01:52:41 www1 crmd: [2500]: info: do_te_invoke: Processing > graph 3158 (ref=pe_calc-dc-1355532761-6586) derived from > /var/lib/pengine/pe-input-880.bz2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 11: monitor drbd:1_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 12: monitor nfs_ip_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 13: monitor fs_home_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 14: monitor nfs-kernel-server_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 15: monitor nfs-common_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 16: monitor fs_nfs_home_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 17: monitor haproxy_ip_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 18: monitor redis_ip_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 19: monitor redis_server_monitor_0 on www2 > Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating > action 20: monitor res_haproxy_1_monitor_0 on www2 > Dec 15 01:52:41 www1 pengine: [2499]: notice: process_pe_message: > Transition 3158: PEngine Input stored in: > /var/lib/pengine/pe-input-880.bz2 > Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer > popped (timeout=20000, abort_level=0, complete=false) > Dec 15 01:54:01 www1 crmd: [2500]: ERROR: print_elem: Aborting > transition, action lost: [Action 11]: In-flight (id: drbd:1_monitor_0, > loc: www2, priority: 0) > Dec 15 01:54:01 www1 crmd: [2500]: info: abort_transition_graph: > action_timer_callback:535 - Triggered transition abort (complete=0) : > Action lost > Dec 15 01:54:01 www1 crmd: [2500]: WARN: cib_action_update: rsc_op 11: > drbd:1_monitor_0 on www2 timed out > Dec 15 01:54:01 www1 crmd: [2500]: info: create_operation_update: > cib_action_update: Updating resouce drbd:1 after Timed Out monitor op > (interval=0) > Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer > popped (timeout=20000, abort_level=1000000, complete=false) > Dec 15 01:54:01 www1 crmd: [2500]: ERROR: print_elem: Aborting > transition, action lost: [Action 12]: In-flight (id: nfs_ip_monitor_0, > loc: www2, priority: 0) > Dec 15 01:54:01 www1 crmd: [2500]: info: abort_transition_graph: > action_timer_callback:535 - Triggered transition abort (complete=0) : > Action lost > Dec 15 01:54:01 www1 crmd: [2500]: WARN: cib_action_update: rsc_op 12: > nfs_ip_monitor_0 on www2 timed out > Dec 15 01:54:01 www1 crmd: [2500]: info: create_operation_update: > cib_action_update: Updating resouce nfs_ip after Timed Out monitor op > (interval=0) > Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer > popped (timeout=20000, abort_level=1000000, complete=false) > > cibadmin -Q doesn't work on www2 when corosync is up, > > on www1 www2 is visible as online in crm_mon > > after the stonith reboot on www2 > /var/lib/heartbeat/crm/cib.xml was owned by root, I fixed this and now > cib.xml files are the same on both nodes. > > corosync-cfgtool -s reports no faults, > > objdump also seems ok > runtime.totem.pg.mrp.srp.members.33610762.ip=r(0) ip(10.220.0.2) r(1) > ip(46.x.x.141) > runtime.totem.pg.mrp.srp.members.33610762.join_count=1 > runtime.totem.pg.mrp.srp.members.33610762.status=joined > runtime.totem.pg.mrp.srp.members.16833546.ip=r(0) ip(10.220.0.1) r(1) > ip(46.x.x.140) > runtime.totem.pg.mrp.srp.members.16833546.join_count=1 > runtime.totem.pg.mrp.srp.members.16833546.status=joined > > > > Kind Regards, > > Michal Grzedzicki > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org