Re: [Pacemaker] rejoin failure

Andrew Beekhof Wed, 09 Jan 2013 21:45:49 -0800

At what point did the node get fenced?
Its not obvious from these logs.

On Sat, Dec 15, 2012 at 12:15 PM, Lazy <lazy...@gmail.com> wrote:
> Hi,
>
> we have a 2 node corosync 1.4.2 and pacemaker 1.1.7 cluster running
> drdb, nfs, solr and redis in master slave configurations.
>
> Currently node 2 is unable to rejoin cluster after being fenced by stonith.
>
> the logs on node 2
> Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message:
> Membership 0: quorum still lost
> Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2:
> id=33610762 state=member (new) addr=(null) votes=1 (new) born=0 seen=0
> proc=00000000000000000000000000111312 (new)
> Dec 15 01:52:38 www2 stonith-ng: [6706]: info: get_ais_nodeid: Server
> details: id=33610762 uname=www2 cname=pcmk
> Dec 15 01:52:38 www2 stonith-ng: [6706]: info:
> init_ais_connection_once: Connection to 'classic openais (with
> plugin)': established
> Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node www2
> now has id: 33610762
> Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node
> 33610762 is now known as www2
> Dec 15 01:52:38 www2 attrd: [6708]: notice: main: Starting mainloop...
> Dec 15 01:52:38 www2 stonith-ng: [6706]: notice: setup_cib: Watching
> for stonith topology changes
> Dec 15 01:52:38 www2 stonith-ng: [6706]: info: main: Starting
> stonith-ng mainloop
> Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] Incrementing problem
> counter for seqid 1 iface 46.248.167.141 to [1 of 10]
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 11800: memb=0,
> new=0, lost=0
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
> pcmk_peer_update: Stable membership event on ring 11800: memb=1,
> new=1, lost=0
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> pcmk_peer_update: NEW:  www2 33610762
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> pcmk_peer_update: MEMB: www2 33610762
> Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Dec 15 01:52:38 www2 corosync[6682]:   [CPG   ] chosen downlist:
> sender r(0) ip(10.220.0.2) r(1) ip(46.248.167.141) ; members(old:0
> left:0)
> Dec 15 01:52:38 www2 corosync[6682]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 11804: memb=1,
> new=0, lost=0
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> pcmk_peer_update: memb: www2 33610762
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
> pcmk_peer_update: Stable membership event on ring 11804: memb=2,
> new=1, lost=0
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> Creating entry for node 16833546 born on 11804
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> Node 16833546/unknown is now: member
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> pcmk_peer_update: NEW:  .pending. 16833546
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> pcmk_peer_update: MEMB: .pending. 16833546
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> pcmk_peer_update: MEMB: www2 33610762
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> send_member_notification: Sending membership update 11804 to 1
> children
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> 0x200cdc0 Node 33610762 ((null)) born on: 11804
> Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message:
> Membership 11804: quorum still lost
> Dec 15 01:52:38 www2 cib: [6705]: info: crm_new_peer: Node <null> now
> has id: 16833546
> Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node (null):
> id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1)
> ip(46.248.167.140)  votes=0 born=0 seen=11804
> proc=00000000000000000000000000000000
> Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2:
> id=33610762 state=member addr=r(0) ip(10.220.0.2) r(1)
> ip(46.248.167.141)  (new) votes=1 born=0 seen=11804
> proc=00000000000000000000000000111312
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> 0x20157a0 Node 16833546 (www1) born on: 11708
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> 0x20157a0 Node 16833546 now known as www1 (was: (null))
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> Node www1 now has process list: 00000000000000000000000000111312
> (1118994)
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
> Node www1 now has 1 quorum votes (was 0)
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
> send_member_notification: Sending membership update 11804 to 1
> children
> Dec 15 01:52:38 www2 cib: [6705]: notice: ais_dispatch_message:
> Membership 11804: quorum acquired
> Dec 15 01:52:38 www2 cib: [6705]: info: crm_get_peer: Node 16833546 is
> now known as www1
> Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www1:
> id=16833546 state=member addr=r(0) ip(10.220.0.1) r(1)
> ip(46.248.167.140)  votes=1 (new) born=11708 seen=11804
> proc=00000000000000000000000000111312 (new)
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.crmd failed: ipc delivery
> failed (rc=-2)
> Dec 15 01:52:38 www2 corosync[6682]:   [CPG   ] chosen downlist:
> sender r(0) ip(10.220.0.1) r(1) ip(46.248.167.140) ; members(old:1
> left:0)
> Dec 15 01:52:38 www2 corosync[6682]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Dec 15 01:52:38 www2 cib: [6705]: info: cib_process_diff: Diff
> 0.407.26 -> 0.407.27 not applied to 0.406.0: current "epoch" is less
> than required
> Dec 15 01:52:38 www2 cib: [6705]: info: cib_server_process_diff:
> Requesting re-sync from peer
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.27 -> 0.407.28 (sync in progress)
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.28 -> 0.407.29 (sync in progress)
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.29 -> 0.407.30 (sync in progress)
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.30 -> 0.407.31 (sync in progress)
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.crmd failed: ipc delivery
> failed (rc=-2)
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.31 -> 0.407.32 (sync in progress)
> Dec 15 01:52:38 www2 cib: [6705]: info: cib_process_diff: Diff
> 0.407.32 -> 0.407.33 not applied to 0.406.0: current "epoch" is less
> than required
> Dec 15 01:52:38 www2 cib: [6705]: info: cib_server_process_diff:
> Requesting re-sync from peer
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.33 -> 0.407.34 (sync in progress)
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.34 -> 0.407.35 (sync in progress)
> Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.crmd failed: ipc delivery
> failed (rc=-2)
> Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
> applying diff 0.407.35 -> 0.407.36 (sync in progress)
> Dec 15 01:52:39 www2 crmd: [6710]: info: do_cib_control: CIB
> connection established
> Dec 15 01:52:39 www2 crmd: [6710]: info: get_cluster_type: Cluster
> type is: 'openais'
> Dec 15 01:52:39 www2 crmd: [6710]: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_classic:
> Creating connection to our Corosync plugin
> Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_classic:
> AIS connection established
> Dec 15 01:52:39 www2 corosync[6682]:   [pcmk  ] info: pcmk_ipc:
> Recorded connection 0x2024af0 for crmd/6710
> Dec 15 01:52:39 www2 corosync[6682]:   [pcmk  ] info: pcmk_ipc:
> Sending membership update 11804 to crmd
> Dec 15 01:52:39 www2 crmd: [6710]: info: get_ais_nodeid: Server
> details: id=33610762 uname=www2 cname=pcmk
> Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_once:
> Connection to 'classic openais (with plugin)': established
> Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node www2 now
> has id: 33610762
> Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node 33610762
> is now known as www2
> Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
> www2 is now unknown
> Dec 15 01:52:39 www2 crmd: [6710]: info: do_ha_control: Connected to the 
> cluster
> Dec 15 01:52:39 www2 crmd: [6710]: info: do_started: Delaying start,
> no membership data (0000000000100000)
> Dec 15 01:52:39 www2 crmd: [6710]: notice: ais_dispatch_message:
> Membership 11804: quorum acquired
> Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node www1 now
> has id: 16833546
> Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node 16833546
> is now known as www1
> Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
> www1 is now unknown
> Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
> www1 is now member (was unknown)
> Dec 15 01:52:39 www2 crmd: [6710]: info: crm_update_peer: Node www1:
> id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1)
> ip(46.248.167.140)  votes=1
> Dec 15 01:52:39 www2 crmd: [6710]: notice: crmd_peer_update: Status
> update: Client www2/crmd now has status [online] (DC=<null>)
> Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
> www2 is now member (was unknown)
> Dec 15 01:52:39 www2 crmd: [6710]: info: crm_update_peer: Node www2:
> id=33610762 state=member (new) addr=r(0) ip(10.220.0.2) r(1)
> ip(46.248.167.141)  (new) votes=1 (new) born=11804 seen=11804
> proc=00000000000000000000000000111312 (new)
> Dec 15 01:52:39 www2 crmd: [6710]: info: do_started: Delaying start,
> Config not read (0000000000000040)
> Dec 15 01:52:39 www2 crmd: [6710]: notice: do_started: The local CRM
> is operational
> Dec 15 01:52:39 www2 crmd: [6710]: notice: do_state_transition: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING
> cause=C_FSA_INTERNAL origin=do_started ]
> Dec 15 01:52:39 www2 corosync[6682]:   [pcmk  ] info:
> update_expected_votes: Expected quorum votes 2 -> 3
> Dec 15 01:52:40 www2 crmd: [6710]: info: ais_dispatch_message:
> Membership 11804: quorum retained
> Dec 15 01:52:40 www2 corosync[6682]:   [TOTEM ] ring 1 active with no faults
> Dec 15 01:52:41 www2 crmd: [6710]: info: update_dc: Set DC to www1 (3.0.6)
> Dec 15 01:52:41 www2 cib: [6705]: info: cib_replace_notify: Replaced:
> -1.-1.-1 -> 0.407.36 from www1
> Dec 15 01:52:41 www2 crmd: [6710]: info: erase_status_tag: Deleting
> xpath: //node_state[@uname='www2']/transient_attributes
> Dec 15 01:52:41 www2 crmd: [6710]: info: update_attrd: Connecting to attrd...
> Dec 15 01:52:41 www2 crmd: [6710]: notice: do_state_transition: State
> transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
> origin=do_cl_join_finalize_respond ]
> Dec 15 01:52:41 www2 attrd: [6708]: notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Dec 15 01:52:41 www2 crmd: [6710]: info: ais_dispatch_message:
> Membership 11804: quorum retained
> Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:drbd:1 probe[2] (pid 6718)
> Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:nfs_ip probe[3] (pid 6719)
> Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:fs_home probe[4] (pid 6720)
> Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:nfs-kernel-server
> probe[5] (pid 6722)
> Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[3] on
> nfs_ip for client 6710: pid 6719 exited with return code 7
> Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[4] on
> fs_home for client 6710: pid 6720 exited with return code 7
> Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation nfs_ip_monitor_0 (call=3, rc=7, cib-update=8,
> confirmed=true) not running
> Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation fs_home_monitor_0 (call=4, rc=7, cib-update=9,
> confirmed=true) not running
> Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[5] on
> nfs-kernel-server for client 6710: pid 6722 exited with return code 7
> (mapped from 3)
> Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation nfs-kernel-server_monitor_0 (call=5, rc=7, cib-update=10,
> confirmed=true) not running
> Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[2] on
> drbd:1 for client 6710: pid 6718 exited with return code 7
> Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation drbd:1_monitor_0 (call=2, rc=7, cib-update=11,
> confirmed=true) not running
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:nfs-common probe[6] (pid 6829)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:fs_nfs_home probe[7] (pid 6830)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:haproxy_ip probe[8] (pid 6831)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:redis_ip probe[9] (pid 6832)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[9] on
> redis_ip for client 6710: pid 6832 exited with return code 7
> Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation redis_ip_monitor_0 (call=9, rc=7, cib-update=12,
> confirmed=true) not running
> Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[8] on
> haproxy_ip for client 6710: pid 6831 exited with return code 7
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:fs_nfs_home probe[7] (pid 6830)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:haproxy_ip probe[8] (pid 6831)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:redis_ip probe[9] (pid 6832)
> Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[9] on
> redis_ip for client 6710: pid 6832 exited with return code 7
> Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation redis_ip_monitor_0 (call=9, rc=7, cib-update=12,
> confirmed=true) not running
> Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[8] on
> haproxy_ip for client 6710: pid 6831 exited with return code 7
> Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation haproxy_ip_monitor_0 (call=8, rc=7, cib-update=13,
> confirmed=true) not running
> Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[6] on
> nfs-common for client 6710: pid 6829 exited with return code 7 (mapped
> from 3)
> Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation nfs-common_monitor_0 (call=6, rc=7, cib-update=14,
> confirmed=true) not running
> Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[7] on
> fs_nfs_home for client 6710: pid 6830 exited with return code 7
> Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation fs_nfs_home_monitor_0 (call=7, rc=7, cib-update=15,
> confirmed=true) not running
> Dec 15 01:52:43 www2 lrmd: [6707]: info: rsc:redis_server probe[10] (pid 6922)
> Dec 15 01:52:43 www2 lrmd: [6707]: info: rsc:res_haproxy_1 probe[11] (pid 
> 6923)
> Dec 15 01:52:43 www2 lrmd: [6707]: info: operation monitor[11] on
> res_haproxy_1 for client 6710: pid 6923 exited with return code 7
> (mapped from 3)
> Dec 15 01:52:43 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation res_haproxy_1_monitor_0 (call=11, rc=7, cib-update=16,
> confirmed=true) not running
> Dec 15 01:52:43 www2 redis[6922]: INFO: Could not read PID,
> redis-server is not running
> Dec 15 01:52:43 www2 lrmd: [6707]: info: operation monitor[10] on
> redis_server for client 6710: pid 6922 exited with return code 7
> Dec 15 01:52:43 www2 crmd: [6710]: info: process_lrm_event: LRM
> operation redis_server_monitor_0 (call=10, rc=7, cib-update=17,
> confirmed=true) not running
> Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 8 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 9 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 10 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 11 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 12 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 13 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 14 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 15 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:43 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 16 failed: (rc=-41) Remote node did not respond
> Dec 15 01:53:43 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
> update 17 failed: (rc=-41) Remote node did not respond
> Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:res_php5-fpm_1 probe[12] (pid 
> 6958)
> Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:res_php5-fpm_2:0
> probe[13] (pid 6959)
> Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:solr_ip probe[14] (pid 6960)
> Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:solr_server:0 probe[15] (pid 
> 6962)
> Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:www1-stonith probe[16] (pid 6963)
> Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:www2-stonith probe[17] (pid 6964)
> Dec 15 01:54:01 www2 stonith-ng: [6706]: notice:
> stonith_device_action: Device www1-stonith not found
> Dec 15 01:54:01 www2 stonith-ng: [6706]: info: stonith_command:
> Processed st_execute from lrmd: rc=-12
>
>
> log on www1 (DC)
> Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-solr_server:1
> (1355303185)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_rsc_op: Operation
> monitor found resource solr_server:1 active in master mode on www1
> Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_rsc_op: Operation
> monitor found resource drbd:0 active in master mode on www1
> Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: master-drbd:0 (10000)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
> drbd:1#011(www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
> fs_nfs_home#011(www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move
> haproxy_ip#011(Started www1 -> www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move
> redis_ip#011(Started www1 -> www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Restart
> redis_server#011(Started www1)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move
> res_haproxy_1#011(Started www1 -> www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
> res_php5-fpm_2:0#011(www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
> solr_server:0#011(www2)
> Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
> www1-stonith#011(www2)
> Dec 15 01:52:41 www1 crmd: [2500]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Dec 15 01:52:41 www1 crmd: [2500]: info: do_te_invoke: Processing
> graph 3158 (ref=pe_calc-dc-1355532761-6586) derived from
> /var/lib/pengine/pe-input-880.bz2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 11: monitor drbd:1_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 12: monitor nfs_ip_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 13: monitor fs_home_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 14: monitor nfs-kernel-server_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 15: monitor nfs-common_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 16: monitor fs_nfs_home_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 17: monitor haproxy_ip_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 18: monitor redis_ip_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 19: monitor redis_server_monitor_0 on www2
> Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
> action 20: monitor res_haproxy_1_monitor_0 on www2
> Dec 15 01:52:41 www1 pengine: [2499]: notice: process_pe_message:
> Transition 3158: PEngine Input stored in:
> /var/lib/pengine/pe-input-880.bz2
> Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer
> popped (timeout=20000, abort_level=0, complete=false)
> Dec 15 01:54:01 www1 crmd: [2500]: ERROR: print_elem: Aborting
> transition, action lost: [Action 11]: In-flight (id: drbd:1_monitor_0,
> loc: www2, priority: 0)
> Dec 15 01:54:01 www1 crmd: [2500]: info: abort_transition_graph:
> action_timer_callback:535 - Triggered transition abort (complete=0) :
> Action lost
> Dec 15 01:54:01 www1 crmd: [2500]: WARN: cib_action_update: rsc_op 11:
> drbd:1_monitor_0 on www2 timed out
> Dec 15 01:54:01 www1 crmd: [2500]: info: create_operation_update:
> cib_action_update: Updating resouce drbd:1 after Timed Out monitor op
> (interval=0)
> Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer
> popped (timeout=20000, abort_level=1000000, complete=false)
> Dec 15 01:54:01 www1 crmd: [2500]: ERROR: print_elem: Aborting
> transition, action lost: [Action 12]: In-flight (id: nfs_ip_monitor_0,
> loc: www2, priority: 0)
> Dec 15 01:54:01 www1 crmd: [2500]: info: abort_transition_graph:
> action_timer_callback:535 - Triggered transition abort (complete=0) :
> Action lost
> Dec 15 01:54:01 www1 crmd: [2500]: WARN: cib_action_update: rsc_op 12:
> nfs_ip_monitor_0 on www2 timed out
> Dec 15 01:54:01 www1 crmd: [2500]: info: create_operation_update:
> cib_action_update: Updating resouce nfs_ip after Timed Out monitor op
> (interval=0)
> Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer
> popped (timeout=20000, abort_level=1000000, complete=false)
>
> cibadmin -Q doesn't work on www2 when corosync is up,
>
> on www1 www2 is visible as online in crm_mon
>
> after the stonith reboot on www2
> /var/lib/heartbeat/crm/cib.xml was owned by root, I fixed this and now
> cib.xml files are the same on both nodes.
>
> corosync-cfgtool -s reports no faults,
>
> objdump also seems ok
> runtime.totem.pg.mrp.srp.members.33610762.ip=r(0) ip(10.220.0.2) r(1)
> ip(46.x.x.141)
> runtime.totem.pg.mrp.srp.members.33610762.join_count=1
> runtime.totem.pg.mrp.srp.members.33610762.status=joined
> runtime.totem.pg.mrp.srp.members.16833546.ip=r(0) ip(10.220.0.1) r(1)
> ip(46.x.x.140)
> runtime.totem.pg.mrp.srp.members.16833546.join_count=1
> runtime.totem.pg.mrp.srp.members.16833546.status=joined
>
>
>
> Kind Regards,
>
> Michal Grzedzicki
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] rejoin failure

Reply via email to