> On 11 Mar 2015, at 2:21 am, Dmitry Koterov <dmitry.kote...@gmail.com> wrote: > > On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <and...@beekhof.net> wrote: > > > I have a 3-node cluster where node1 and node2 are running > > corosync+pacemaker and node3 is running corosync only (for quorum). > > Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple > > of days. > > > > Once upon a time I discovered the following situation: node2 thinks that > > both node1 and node2 are online, but node1 thinks that node2 is down. Could > > you please say: how could it be? There are no connectivity problems between > > the nodes at the moment (maybe they were, but why the system hasn't > > recovered?). > > The logs show connectivity problems occurring, so no doubt there. > As to why it hasn't recovered, first check corosync - if it does not have a > consistent view of the world pacemaker has no hope. > Alternatively, I recall there was a bug that could be preventing this in your > version. So if corosync looks fine, perhaps try an upgrade. > > Thanks. > Are you talking about this bug: > https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ? > > I believe I reproduced the problem one more time (it's very unstable), the > symptoms were the following: > > 1. Once upon a time node2 became down. > 2. The last message from corosync at node1 was "Quorum lost" (I suspect there > was a temporary misconnection with node3). > 3. Then, in a couple of days, at node3 "service corosync stop" hanged (only > killall -9 helps). I tried to run strace during the service is stopping, it > shows: > > [pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, > si_uid=0} --- > [pid 19448] write(6, "\3\0\0\0", 4) = 4 > [pid 19448] rt_sigreturn() = 360 > ... <and repeats for 19448 again and again> > > where pstree shows: > > init,1 > ├─corosync,19448 > │ └─{corosync},19449 > > 4. As well as at node1: "service corosync stop" hangs at node1 too with same > symptoms, only killall -9 helps. > 5. Restarting corosync & pacemaker at node1 and node2 solved the problem. > > Could you please say is it related to the above bug in libqb?
I'm no expert in libqb, but it certainly looks likely. > > > > > The "crm status" is below. What other logs should I attach for the > > diagnostics? > > > > Also, "service corosync stop" on node1 hangs forever with no additional > > lines in logs, so I cannot even stop the service. (But after "service > > corosync stop" on node1 the node node2 starts thinking that node1 is > > offline, although the command still hangs.) > > > > > > root@node2:~# crm status > > Current DC: node1 (1760315215) - partition with quorum > > 2 Nodes configured > > 6 Resources configured > > Online: [ node1 node2 ] > > Master/Slave Set: ms_drbd [drbd] > > Masters: [ node2 ] > > Slaves: [ node1 ] > > Resource Group: server > > fs (ocf::heartbeat:Filesystem): Started node2 > > postgresql (lsb:postgresql): Started node2 > > bind9 (lsb:bind9): Started node2 > > nginx (lsb:nginx): Started node2 > > > > > > root@node1:/var/log# crm status > > Current DC: node1 (1760315215) - partition with quorum > > 2 Nodes configured > > 6 Resources configured > > Online: [ node1 ] > > OFFLINE: [ node2 ] > > Master/Slave Set: ms_drbd [drbd] > > Masters: [ node1 ] > > Stopped: [ node2 ] > > Resource Group: server > > fs (ocf::heartbeat:Filesystem): Started node1 > > postgresql (lsb:postgresql): Started node1 > > bind9 (lsb:bind9): Started node1 > > nginx (lsb:nginx): Started node1 > > Failed actions: > > drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, > > last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown > > error > > > > > > A part of "crm configure show": > > > > property $id="cib-bootstrap-options" \ > > dc-version="1.1.10-42f2063" \ > > cluster-infrastructure="corosync" \ > > stonith-enabled="false" \ > > last-lrm-refresh="1421250983" > > rsc_defaults $id="rsc-options" \ > > resource-stickiness="100" > > > > > > Also I see in logs on node1 (maybe they're related to the issue, maybe not): > > > > Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 is > > unclean because it is partially and/or un-expectedly down > > Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status: > > Node node2 is unclean > > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is > > unclean! > > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES ARE > > NOW LIKELY COMPROMISED > > Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO > > KEEP YOUR RESOURCES SAFE > > > > > > On node2 the logs are: > > > > Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership > > (188.166.54.190:6276) was formed. Members left: 1760315215 13071578 > > Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our peer > > on the DC is dead > > Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State > > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the > > non-primary component and will NOT provide any services. > > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950 > > Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification: > > Membership 6276: quorum lost (1) > > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was > > member) > > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was > > member) > > Jan 22 10:13:57 node2 pacemakerd[302]: notice: pcmk_quorum_notification: > > Membership 6276: quorum lost (1) > > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was > > member) > > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was > > member) > > Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service > > synchronization, ready to provide service. > > Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership > > (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578 > > Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node > > node1[1760315215] appears to be online even though we think it is dead > > Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state: > > pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost) > > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the > > primary component and will provide service. > > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 > > 13071578 1017525950 > > Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification: > > Membership 6288: quorum acquired (3) > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: pcmk_quorum_notification: > > Membership 6288: quorum acquired (3) > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node node1[1760315215] - state is now member (was > > lost) > > Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service > > synchronization, ready to provide service. > > Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to > > get node name for nodeid 13071578 > > Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node (null)[13071578] - state is now member (was > > lost) > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: Unable > > to get node name for nodeid 13071578 > > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: > > pcmk_quorum_notification: Node (null)[13071578] - state is now member (was > > lost) > > Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input I_JOIN_OFFER > > from route_message() received in state S_ELECTION > > Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State > > transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL > > origin=do_election_count_vote ] > > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: Sending > > full refresh (origin=crmd) > > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending > > flush op to all hosts for: master-drbd (10000) > > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending > > flush op to all hosts for: probe_complete (true) > > Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State > > transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE > > origin=do_cl_join_finalize_respond ] > > Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership > > (104.236.71.79:6296) was formed. Members left: 13071578 > > Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership > > (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215 > > Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership > > (104.236.71.79:6324) was formed. Members joined: 1760315215 > > Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our peer > > on the DC is dead > > Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State > > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > > Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> > > Secondary ) > > Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 > > 13071578 1017525950 > > Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service > > synchronization, ready to provide service. > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org