Okay, and a hardware issue can be ruled out, I assume?
To get the cluster up again I would also consider starting one MON
only with a modified monmap. I haven't looked into the tracker though,
so maybe there's something in the logs.
Zitat von Wesley Dillingham <w...@wesdillingham.com>:
Thanks for the reply Eugen
These are Cisco UCSB-B200-M4, all 5 MONs the same hardware and the mon
store is around 1.3GB on all 5 mons.
I dont believe I can reach the contents of "ceph config` without a quorum
but `ceph daemon config diff` of the out-of-quorum and responsive mons
shows nothing other than what you would expect for bare minimum diff. The
cluster operator believes no `ceph config set` changes were issued for what
it's worth and bash history on the nodes corroborates that. There may be a
way to inspect the monitor store for what the mons config db contained with
an offline tool but i'm not sure how to do that right now.
Presumably those syncing tunables you tweaked only come into play if/when a
mon reaches synchronizing?
Respectfully,
*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
w...@wesdillingham.com
On Fri, May 30, 2025 at 11:15 AM Eugen Block <ebl...@nde.ag> wrote:
Hi Wes,
although I don't have seen this exact issue, we did investigate a mon
sync issue two years ago. The customer also has 5 MONs and two of them
get out of quorum regularly in addition to the long sync times. For
the syncing issue we found some workarounds (paxos settings), but we
never got to investigate the failing qourum properly. But what we
found was that those servers have different hardware, the two failing
MON servers had weaker CPUs. They're currently in the process of
replacing the old hardware, so hopefully in a couple of months, we'll
see if the quorum issue still persists. They didn't want to follow our
recommendation to reduce the number of MONs to three, unfortunately.
So a couple of questions:
- are the MON servers on the same hardware?
- are there any configuration differences between the MONs in 'ceph
config dump'?
- how large is the mon store?
Regards,
Eugen
Zitat von Wesley Dillingham <w...@wesdillingham.com>:
> Tracker issue made here with some additional details:
> https://tracker.ceph.com/issues/71501
>
> Cluster version 18.2.4
>
> I came to assist with a non-functional cluster which had OSDs erroneously
> --force purged and led to multiple (6) degraded + inactive PGs (4
remaining
> shards in a 4+2) and 1 remapped+incomplete PG (3 shards in a 4+2)
>
> In an effort to restore order to the cluster and get backfill working:
the
> degraded PGs shared a common primary OSD and that OSD was restarted.
> Additionally min_size was dropped from 5 to 4 for this pool (a temporary
> measure while the cluster recovered).
>
> This caused the inactive degraded PGs to go active and start their
> backfill. The PGs steadily worked on backfill for a few hours.
>
> Near immediately after the final of the 6 degraded PGs finished its
> backfill the monitor quorum broke and the cluster became unresponsive. In
> this state 2 of the 5 mons showed 100% cpu usage.
>
> In an attempt to fix the mon quorum some combination of monitor service
> restart attempts occurred and ultimately all ceph services were brought
> down in order to isolate the MON issue.
>
> As the situation stands currently any combination of starting 3 MONs
> (quorum eligible at 3) causes the lowest MON (the to-be-leader) to hit a
> 100% cpu with fn_monstore thread and render unresponsive the admin socket
> of that mon only (the other 2 respond via admin socket and show either
> probing or electing).
>
> I have captured logs of the to-be-leader mon (its daemon started first)
> with debug_mon = 20 and debug_ms = 20 (I should probably recapture with
> debug_paxos = 20). The pegged cpu only occurs once the third MON is
started
> (seems to be an election issue). The MON has been allowed to run for
hours
> with no progress in that state. Eventually it seems the leader goes into
> the (leader) state (Claims "I win" in the logs) but the other 2 mons
> continue their election cycle. In states of probing or electing still.
>
> I have verified NTP clock is fine and sync'd to same source on all the
mons
> and connectivity between the mons at both ports is functional. The
> situation the PGs were in during the subsequent MON fault leads me to
> believe the problem is more complex than typical monitor election issues.
> Also of note the backing disk of the to-be-leader MON is mostly idle.
>
> At this point I am interested in taking backups of all the mon stores and
> injecting a modified single-mon monmap in a mon and seeing if we can get
> back up but am also concerned that that single mon will be the de-facto
> leader and also be unresponsive. Interested in any suggestions from the
> wider community. Thanks!
>
>
> Respectfully,
>
> *Wes Dillingham*
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
> w...@wesdillingham.com
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io