Re: [ceph-users] Monitors not reaching quorum

Sergio A. de Carvalho Jr. Mon, 25 Jul 2016 08:18:08 -0700

We're having problems to start the 5th host (some BIOS problem, possibly),
so I won't be able to recover its monitor any time soon.


I knew having an even number of monitors wasn't ideal, and that's why I
started 3 monitors first and waited until they reached quorum before
starting the 4th monitor. I was hoping that once quorum was established,
the 4th monitor would simply join the other 3, instead of calling for new
elections. I didn't think having an odd number of monitors was a hard
requirement.

I'm wondering if having one dead monitor in the map is complicating the
election.



On Mon, Jul 25, 2016 at 3:45 PM, Joshua M. Boniface <jos...@boniface.me>
wrote:

> My understanding is that you need an odd number of monitors to reach
> quorum. This seems to match what you're seeing: with 3, there is a definite
> leader, but with 4, there isn't. Have you tried starting both the 4th and
> 5th simultaneously and letting them both vote?
>
> --
> Joshua M. Boniface
> Linux System Ærchitect
> Sigmentation fault. Core dumped.
>
> On 25/07/16 10:41 AM, Sergio A. de Carvalho Jr. wrote:
> > In the logs, there 2 monitors are constantly reporting that they won the
> leader election:
> >
> > 60z0m02 (monitor 0):
> > 2016-07-25 14:31:11.644335 7f8760af7700  0 log_channel(cluster) log
> [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4
> > 2016-07-25 14:31:44.521552 7f8760af7700  1 mon.60z0m02@0(leader).paxos(paxos
> recovering c 1318755..1319320) collect timeout, calling fresh election
> >
> > 60zxl02 (monitor 1):
> > 2016-07-25 14:31:59.542346 7fefdeaed700  1 
> > mon.60zxl02@1(electing).elector(11441)
> init, last seen epoch 11441
> > 2016-07-25 14:32:04.583929 7fefdf4ee700  0 log_channel(cluster) log
> [INF] : mon.60zxl02@1 won leader election with quorum 1,2,4
> > 2016-07-25 14:32:33.440103 7fefdf4ee700  1 mon.60zxl02@1(leader).paxos(paxos
> recovering c 1318755..1319319) collect timeout, calling fresh election
> >
> >
> > On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr. <
> scarvalh...@gmail.com <mailto:scarvalh...@gmail.com>> wrote:
> >
> >     Hi,
> >
> >     I have a cluster of 5 hosts running Ceph 0.94.6 on CentOS 6.5. On
> each host, there is 1 monitor and 13 OSDs. We had an issue with the network
> and for some reason (which I still don't know why), the servers were
> restarted. One host is still down, but the monitors on the 4 remaining
> servers are failing to enter a quorum.
> >
> >     I managed to get a quorum of 3 monitors by stopping all Ceph
> monitors and OSDs across all machines, and bringing up the top 3 ranked
> monitors in order of rank. After a few minutes, the 60z0m02 monitor (the
> top ranked one) became the leader:
> >
> >     {
> >         "name": "60z0m02",
> >         "rank": 0,
> >         "state": "leader",
> >         "election_epoch": 11328,
> >         "quorum": [
> >             0,
> >             1,
> >             2
> >         ],
> >         "outside_quorum": [],
> >         "extra_probe_peers": [],
> >         "sync_provider": [],
> >         "monmap": {
> >             "epoch": 5,
> >             "fsid": "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
> >             "modified": "2016-04-28 22:26:48.604393",
> >             "created": "0.000000",
> >             "mons": [
> >                 {
> >                     "rank": 0,
> >                     "name": "60z0m02",
> >                     "addr": "10.98.2.166:6789 <http://10.98.2.166:6789
> >\/0"
> >                 },
> >                 {
> >                     "rank": 1,
> >                     "name": "60zxl02",
> >                     "addr": "10.98.2.167:6789 <http://10.98.2.167:6789
> >\/0"
> >                 },
> >                 {
> >                     "rank": 2,
> >                     "name": "610wl02",
> >                     "addr": "10.98.2.173:6789 <http://10.98.2.173:6789
> >\/0"
> >                 },
> >                 {
> >                     "rank": 3,
> >                     "name": "618yl02",
> >                     "addr": "10.98.2.214:6789 <http://10.98.2.214:6789
> >\/0"
> >                 },
> >                 {
> >                     "rank": 4,
> >                     "name": "615yl02",
> >                     "addr": "10.98.2.216:6789 <http://10.98.2.216:6789
> >\/0"
> >                 }
> >             ]
> >         }
> >     }
> >
> >     The other 2 monitors became peons:
> >
> >     "name": "60zxl02",
> >         "rank": 1,
> >         "state": "peon",
> >         "election_epoch": 11328,
> >         "quorum": [
> >             0,
> >             1,
> >             2
> >         ],
> >
> >     "name": "610wl02",
> >         "rank": 2,
> >         "state": "peon",
> >         "election_epoch": 11328,
> >         "quorum": [
> >             0,
> >             1,
> >             2
> >         ],
> >
> >     I then proceeded to start the fourth monitor, 615yl02 (618yl02 is
> powered off), but after more than 2 hours and several election rounds, the
> monitors still haven't reached a quorum. The monitors alternate mostly
> between "election", "probing" states but they often seem to be in different
> election epochs.
> >
> >     Is this normal?
> >
> >     Is there anything I can do to help the monitors elect a leader?
> Should I manually remove the dead host's monitor from the monitor map?
> >
> >     I left all OSD daemons stopped while the election is going on
> purpose. Is this the best thing to do? Would bringing the OSDs up help or
> complicate matters even more? Or doesn't it make any difference?
> >
> >     I don't see anything obviously wrong in the monitor logs. They're
> mostly filled with messages like the following:
> >
> >     2016-07-25 14:17:57.806148 7fc1b3f7e700  1 
> > mon.610wl02@2(electing).elector(11411)
> init, last seen epoch 11411
> >     2016-07-25 14:17:57.829198 7fc1b7caf700  0 log_channel(audit) log
> [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]:
> dispatch
> >     2016-07-25 14:17:57.829200 7fc1b7caf700  0 log_channel(audit) do_log
> log to syslog
> >     2016-07-25 14:17:57.829254 7fc1b7caf700  0 log_channel(audit) log
> [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]:
> finished
> >
> >     Any help would be hugely appreciated.
> >
> >     Thanks,
> >
> >     Sergio
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitors not reaching quorum

Reply via email to