We're having problems to start the 5th host (some BIOS problem, possibly), so I won't be able to recover its monitor any time soon.
I knew having an even number of monitors wasn't ideal, and that's why I started 3 monitors first and waited until they reached quorum before starting the 4th monitor. I was hoping that once quorum was established, the 4th monitor would simply join the other 3, instead of calling for new elections. I didn't think having an odd number of monitors was a hard requirement. I'm wondering if having one dead monitor in the map is complicating the election. On Mon, Jul 25, 2016 at 3:45 PM, Joshua M. Boniface <jos...@boniface.me> wrote: > My understanding is that you need an odd number of monitors to reach > quorum. This seems to match what you're seeing: with 3, there is a definite > leader, but with 4, there isn't. Have you tried starting both the 4th and > 5th simultaneously and letting them both vote? > > -- > Joshua M. Boniface > Linux System Ærchitect > Sigmentation fault. Core dumped. > > On 25/07/16 10:41 AM, Sergio A. de Carvalho Jr. wrote: > > In the logs, there 2 monitors are constantly reporting that they won the > leader election: > > > > 60z0m02 (monitor 0): > > 2016-07-25 14:31:11.644335 7f8760af7700 0 log_channel(cluster) log > [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4 > > 2016-07-25 14:31:44.521552 7f8760af7700 1 mon.60z0m02@0(leader).paxos(paxos > recovering c 1318755..1319320) collect timeout, calling fresh election > > > > 60zxl02 (monitor 1): > > 2016-07-25 14:31:59.542346 7fefdeaed700 1 > > mon.60zxl02@1(electing).elector(11441) > init, last seen epoch 11441 > > 2016-07-25 14:32:04.583929 7fefdf4ee700 0 log_channel(cluster) log > [INF] : mon.60zxl02@1 won leader election with quorum 1,2,4 > > 2016-07-25 14:32:33.440103 7fefdf4ee700 1 mon.60zxl02@1(leader).paxos(paxos > recovering c 1318755..1319319) collect timeout, calling fresh election > > > > > > On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr. < > scarvalh...@gmail.com <mailto:scarvalh...@gmail.com>> wrote: > > > > Hi, > > > > I have a cluster of 5 hosts running Ceph 0.94.6 on CentOS 6.5. On > each host, there is 1 monitor and 13 OSDs. We had an issue with the network > and for some reason (which I still don't know why), the servers were > restarted. One host is still down, but the monitors on the 4 remaining > servers are failing to enter a quorum. > > > > I managed to get a quorum of 3 monitors by stopping all Ceph > monitors and OSDs across all machines, and bringing up the top 3 ranked > monitors in order of rank. After a few minutes, the 60z0m02 monitor (the > top ranked one) became the leader: > > > > { > > "name": "60z0m02", > > "rank": 0, > > "state": "leader", > > "election_epoch": 11328, > > "quorum": [ > > 0, > > 1, > > 2 > > ], > > "outside_quorum": [], > > "extra_probe_peers": [], > > "sync_provider": [], > > "monmap": { > > "epoch": 5, > > "fsid": "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af", > > "modified": "2016-04-28 22:26:48.604393", > > "created": "0.000000", > > "mons": [ > > { > > "rank": 0, > > "name": "60z0m02", > > "addr": "10.98.2.166:6789 <http://10.98.2.166:6789 > >\/0" > > }, > > { > > "rank": 1, > > "name": "60zxl02", > > "addr": "10.98.2.167:6789 <http://10.98.2.167:6789 > >\/0" > > }, > > { > > "rank": 2, > > "name": "610wl02", > > "addr": "10.98.2.173:6789 <http://10.98.2.173:6789 > >\/0" > > }, > > { > > "rank": 3, > > "name": "618yl02", > > "addr": "10.98.2.214:6789 <http://10.98.2.214:6789 > >\/0" > > }, > > { > > "rank": 4, > > "name": "615yl02", > > "addr": "10.98.2.216:6789 <http://10.98.2.216:6789 > >\/0" > > } > > ] > > } > > } > > > > The other 2 monitors became peons: > > > > "name": "60zxl02", > > "rank": 1, > > "state": "peon", > > "election_epoch": 11328, > > "quorum": [ > > 0, > > 1, > > 2 > > ], > > > > "name": "610wl02", > > "rank": 2, > > "state": "peon", > > "election_epoch": 11328, > > "quorum": [ > > 0, > > 1, > > 2 > > ], > > > > I then proceeded to start the fourth monitor, 615yl02 (618yl02 is > powered off), but after more than 2 hours and several election rounds, the > monitors still haven't reached a quorum. The monitors alternate mostly > between "election", "probing" states but they often seem to be in different > election epochs. > > > > Is this normal? > > > > Is there anything I can do to help the monitors elect a leader? > Should I manually remove the dead host's monitor from the monitor map? > > > > I left all OSD daemons stopped while the election is going on > purpose. Is this the best thing to do? Would bringing the OSDs up help or > complicate matters even more? Or doesn't it make any difference? > > > > I don't see anything obviously wrong in the monitor logs. They're > mostly filled with messages like the following: > > > > 2016-07-25 14:17:57.806148 7fc1b3f7e700 1 > > mon.610wl02@2(electing).elector(11411) > init, last seen epoch 11411 > > 2016-07-25 14:17:57.829198 7fc1b7caf700 0 log_channel(audit) log > [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: > dispatch > > 2016-07-25 14:17:57.829200 7fc1b7caf700 0 log_channel(audit) do_log > log to syslog > > 2016-07-25 14:17:57.829254 7fc1b7caf700 0 log_channel(audit) log > [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: > finished > > > > Any help would be hugely appreciated. > > > > Thanks, > > > > Sergio > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com