Hi Ashley,

The only change I made was increasing the osd_max_backfills from 3 to 10 at
first, and when that ended up causing more problems than it helped, it was
lowering the setting back down to 3 that took the cluster offline.

I've actually been working on this issue for a week now and my company
called in some outside help that walked me through some more intense
troubleshooting, and the solution we used was to remove every ceph-mon
except one and let it boot up by itself so it wouldn't be looking for a
quorum, and that fixed the majority of our problems. It turned out that
there was a potential networking issue that sort of popped up and then
fixed itself? We weren't sure what exactly happened or how to dig into it
once it started working again, but we're sure it had something to do either
with the physical NICs or the switches, possibly overloaded with the
changes and recovery traffic spiking.


On Tue, Sep 3, 2019 at 9:17 AM Ashley Merrick <singap...@amerrick.co.uk>
wrote:

> What change did you make in ceph.conf
>
> Id check that hasn't caused an issue first.
>
>
> ---- On Tue, 27 Aug 2019 04:37:15 +0800 * nkern...@gmail.com
> <nkern...@gmail.com> * wrote ----
>
> Hello,
>
> I have an old ceph 0.94.10 cluster that had 10 storage nodes with one
> extra management node used for running commands on the cluster. Over time
> we'd had some hardware failures on some of the storage nodes, so we're down
> to 6, with ceph-mon running on the management server and 4 of the storage
> nodes. We attempted deploying a ceph.conf change and restarted ceph-mon and
> ceph-osd services, but the cluster went down on us. We found all the
> ceph-mons are stuck in the electing state, I can't get any response from
> any ceph commands but I found I can contact the daemon directly and get
> this information (hostnames removed for privacy reasons):
>
> root@<mgmt1>:~# ceph daemon mon.<mgmt1> mon_status
> {
>     "name": "<mgmt1>",
>     "rank": 0,
>     "state": "electing",
>     "election_epoch": 4327,
>     "quorum": [],
>     "outside_quorum": [],
>     "extra_probe_peers": [],
>     "sync_provider": [],
>     "monmap": {
>         "epoch": 10,
>         "fsid": "69611c75-200f-4861-8709-8a0adc64a1c9",
>         "modified": "2019-08-23 08:20:57.620147",
>         "created": "0.000000",
>         "mons": [
>             {
>                 "rank": 0,
>                 "name": "<mgmt1>",
>                 "addr": "[fdc4:8570:e14c:132d::15]:6789\/0"
>             },
>             {
>                 "rank": 1,
>                 "name": "<mon1>",
>                 "addr": "[fdc4:8570:e14c:132d::16]:6789\/0"
>             },
>             {
>                 "rank": 2,
>                 "name": "<mon2>",
>                 "addr": "[fdc4:8570:e14c:132d::28]:6789\/0"
>             },
>             {
>                 "rank": 3,
>                 "name": "<mon3>",
>                 "addr": "[fdc4:8570:e14c:132d::29]:6789\/0"
>             },
>             {
>                 "rank": 4,
>                 "name": "<mon4>",
>                 "addr": "[fdc4:8570:e14c:132d::151]:6789\/0"
>             }
>         ]
>     }
> }
>
>
> Is there any way to force the cluster back into a quorum even if it's just
> one mon running to start it up? I've tried exporting the mgmt's monmap and
> injecting it into the other nodes, but it didn't make any difference.
>
> Thanks!
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to