Re: [Pacemaker] One node thinks everyone is online, but the other node doesn't think so

Andrew Beekhof Sun, 29 Mar 2015 17:47:51 -0700

> On 11 Mar 2015, at 2:21 am, Dmitry Koterov <dmitry.kote...@gmail.com> wrote:
> 
> On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
> > I have a 3-node cluster where node1 and node2 are running 
> > corosync+pacemaker and node3 is running corosync only (for quorum). 
> > Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple 
> > of days.
> >
> > Once upon a time I discovered the following situation: node2 thinks that 
> > both node1 and node2 are online, but node1 thinks that node2 is down. Could 
> > you please say: how could it be? There are no connectivity problems between 
> > the nodes at the moment (maybe they were, but why the system hasn't 
> > recovered?).
> 
> The logs show connectivity problems occurring, so no doubt there.
> As to why it hasn't recovered, first check corosync - if it does not have a 
> consistent view of the world pacemaker has no hope.
> Alternatively, I recall there was a bug that could be preventing this in your 
> version.  So if corosync looks fine, perhaps try an upgrade.
> 
> Thanks.
> Are you talking about this bug: 
> https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ?
> 
> I believe I reproduced the problem one more time (it's very unstable), the 
> symptoms were the following:
> 
> 1. Once upon a time node2 became down.
> 2. The last message from corosync at node1 was "Quorum lost" (I suspect there 
> was a temporary misconnection with node3).
> 3. Then, in a couple of days, at node3 "service corosync stop" hanged (only 
> killall -9 helps). I tried to run strace during the service is stopping, it 
> shows:
> 
> [pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, 
> si_uid=0} ---
> [pid 19448] write(6, "\3\0\0\0", 4)     = 4
> [pid 19448] rt_sigreturn()              = 360
> ... <and repeats for 19448 again and again>
> 
> where pstree shows:
> 
> init,1
>   ├─corosync,19448
>   │   └─{corosync},19449
> 
> 4. As well as at node1: "service corosync stop" hangs at node1 too with same 
> symptoms, only killall -9 helps.
> 5. Restarting corosync & pacemaker at node1 and node2 solved the problem.
> 
> Could you please say is it related to the above bug in libqb?


I'm no expert in libqb, but it certainly looks likely.

> 
> 
>  
> > The "crm status" is below. What other logs should I attach for the 
> > diagnostics?
> >
> > Also, "service corosync stop" on node1 hangs forever with no additional 
> > lines in logs, so I cannot even stop the service. (But after "service 
> > corosync stop" on node1 the node node2 starts thinking that node1 is 
> > offline, although the command still hangs.)
> >
> >
> > root@node2:~# crm status
> > Current DC: node1 (1760315215) - partition with quorum
> > 2 Nodes configured
> > 6 Resources configured
> > Online: [ node1 node2 ]
> > Master/Slave Set: ms_drbd [drbd]
> >      Masters: [ node2 ]
> >      Slaves: [ node1 ]
> > Resource Group: server
> >      fs       (ocf::heartbeat:Filesystem):    Started node2
> >      postgresql       (lsb:postgresql):       Started node2
> >      bind9    (lsb:bind9):    Started node2
> >      nginx    (lsb:nginx):    Started node2
> >
> >
> > root@node1:/var/log# crm status
> > Current DC: node1 (1760315215) - partition with quorum
> > 2 Nodes configured
> > 6 Resources configured
> > Online: [ node1 ]
> > OFFLINE: [ node2 ]
> > Master/Slave Set: ms_drbd [drbd]
> >      Masters: [ node1 ]
> >      Stopped: [ node2 ]
> > Resource Group: server
> >      fs       (ocf::heartbeat:Filesystem):    Started node1
> >      postgresql       (lsb:postgresql):       Started node1
> >      bind9    (lsb:bind9):    Started node1
> >      nginx    (lsb:nginx):    Started node1
> > Failed actions:
> >     drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, 
> > last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown 
> > error
> >
> >
> > A part of "crm configure show":
> >
> > property $id="cib-bootstrap-options" \
> >         dc-version="1.1.10-42f2063" \
> >         cluster-infrastructure="corosync" \
> >         stonith-enabled="false" \
> >         last-lrm-refresh="1421250983"
> > rsc_defaults $id="rsc-options" \
> >         resource-stickiness="100"
> >
> >
> > Also I see in logs on node1 (maybe they're related to the issue, maybe not):
> >
> > Jan 22 10:14:02 node1 pengine[2772]:  warning: pe_fence_node: Node node2 is 
> > unclean because it is partially and/or un-expectedly down
> > Jan 22 10:14:02 node1 pengine[2772]:  warning: determine_online_status: 
> > Node node2 is unclean
> > Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: Node node2 is 
> > unclean!
> > Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: YOUR RESOURCES ARE 
> > NOW LIKELY COMPROMISED
> > Jan 22 10:14:02 node1 pengine[2772]:    error: stage6: ENABLE STONITH TO 
> > KEEP YOUR RESOURCES SAFE
> >
> >
> > On node2 the logs are:
> >
> > Jan 22 10:13:57 node2 corosync[32761]:  [TOTEM ] A new membership 
> > (188.166.54.190:6276) was formed. Members left: 1760315215 13071578
> > Jan 22 10:13:57 node2 crmd[311]:   notice: peer_update_callback: Our peer 
> > on the DC is dead
> > Jan 22 10:13:57 node2 crmd[311]:   notice: do_state_transition: State 
> > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] This node is within the 
> > non-primary component and will NOT provide any services.
> > Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] Members[1]: 1017525950
> > Jan 22 10:13:57 node2 crmd[311]:   notice: pcmk_quorum_notification: 
> > Membership 6276: quorum lost (1)
> > Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was 
> > member)
> > Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was 
> > member)
> > Jan 22 10:13:57 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: 
> > Membership 6276: quorum lost (1)
> > Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was 
> > member)
> > Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was 
> > member)
> > Jan 22 10:13:57 node2 corosync[32761]:  [MAIN  ] Completed service 
> > synchronization, ready to provide service.
> > Jan 22 10:14:01 node2 corosync[32761]:  [TOTEM ] A new membership 
> > (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
> > Jan 22 10:14:02 node2 crmd[311]:    error: pcmk_cpg_membership: Node 
> > node1[1760315215] appears to be online even though we think it is dead
> > Jan 22 10:14:02 node2 crmd[311]:   notice: crm_update_peer_state: 
> > pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
> > Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] This node is within the 
> > primary component and will provide service.
> > Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 
> > 13071578 1017525950
> > Jan 22 10:14:03 node2 crmd[311]:   notice: pcmk_quorum_notification: 
> > Membership 6288: quorum acquired (3)
> > Jan 22 10:14:03 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: 
> > Membership 6288: quorum acquired (3)
> > Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node node1[1760315215] - state is now member (was 
> > lost)
> > Jan 22 10:14:03 node2 corosync[32761]:  [MAIN  ] Completed service 
> > synchronization, ready to provide service.
> > Jan 22 10:14:03 node2 crmd[311]:   notice: corosync_node_name: Unable to 
> > get node name for nodeid 13071578
> > Jan 22 10:14:03 node2 crmd[311]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node (null)[13071578] - state is now member (was 
> > lost)
> > Jan 22 10:14:03 node2 pacemakerd[302]:   notice: corosync_node_name: Unable 
> > to get node name for nodeid 13071578
> > Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: 
> > pcmk_quorum_notification: Node (null)[13071578] - state is now member (was 
> > lost)
> > Jan 22 10:14:03 node2 crmd[311]:  warning: do_log: FSA: Input I_JOIN_OFFER 
> > from route_message() received in state S_ELECTION
> > Jan 22 10:14:04 node2 crmd[311]:   notice: do_state_transition: State 
> > transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
> > origin=do_election_count_vote ]
> > Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_local_callback: Sending 
> > full refresh (origin=crmd)
> > Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending 
> > flush op to all hosts for: master-drbd (10000)
> > Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending 
> > flush op to all hosts for: probe_complete (true)
> > Jan 22 10:14:05 node2 crmd[311]:   notice: do_state_transition: State 
> > transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
> > origin=do_cl_join_finalize_respond ]
> > Jan 22 10:15:11 node2 corosync[32761]:  [TOTEM ] A new membership 
> > (104.236.71.79:6296) was formed. Members left: 13071578
> > Jan 22 10:15:14 node2 corosync[32761]:  [TOTEM ] A new membership 
> > (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
> > Jan 22 10:15:17 node2 corosync[32761]:  [TOTEM ] A new membership 
> > (104.236.71.79:6324) was formed. Members joined: 1760315215
> > Jan 22 10:15:19 node2 crmd[311]:   notice: peer_update_callback: Our peer 
> > on the DC is dead
> > Jan 22 10:15:19 node2 crmd[311]:   notice: do_state_transition: State 
> > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> 
> > Secondary )
> > Jan 22 10:15:20 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 
> > 13071578 1017525950
> > Jan 22 10:15:20 node2 corosync[32761]:  [MAIN  ] Completed service 
> > synchronization, ready to provide service.
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] One node thinks everyone is online, but the other node doesn't think so

Reply via email to