Hi Wes,

although I don't have seen this exact issue, we did investigate a mon sync issue two years ago. The customer also has 5 MONs and two of them get out of quorum regularly in addition to the long sync times. For the syncing issue we found some workarounds (paxos settings), but we never got to investigate the failing qourum properly. But what we found was that those servers have different hardware, the two failing MON servers had weaker CPUs. They're currently in the process of replacing the old hardware, so hopefully in a couple of months, we'll see if the quorum issue still persists. They didn't want to follow our recommendation to reduce the number of MONs to three, unfortunately.
So a couple of questions:

- are the MON servers on the same hardware?
- are there any configuration differences between the MONs in 'ceph config dump'?
- how large is the mon store?

Regards,
Eugen

Zitat von Wesley Dillingham <w...@wesdillingham.com>:

Tracker issue made here with some additional details:
https://tracker.ceph.com/issues/71501

Cluster version 18.2.4

I came to assist with a non-functional cluster which had OSDs erroneously
--force purged and led to multiple (6) degraded + inactive PGs (4 remaining
shards in a 4+2) and 1 remapped+incomplete PG (3 shards in a 4+2)

In an effort to restore order to the cluster and get backfill working: the
degraded PGs shared a common primary OSD and that OSD was restarted.
Additionally min_size was dropped from 5 to 4 for this pool (a temporary
measure while the cluster recovered).

This caused the inactive degraded PGs to go active and start their
backfill. The PGs steadily worked on backfill for a few hours.

Near immediately after the final of the 6 degraded PGs finished its
backfill the monitor quorum broke and the cluster became unresponsive. In
this state 2 of the 5 mons showed 100% cpu usage.

In an attempt to fix the mon quorum some combination of monitor service
restart attempts occurred and ultimately all ceph services were brought
down in order to isolate the MON issue.

As the situation stands currently any combination of starting 3 MONs
(quorum eligible at 3) causes the lowest MON (the to-be-leader) to hit a
100% cpu with fn_monstore thread and render unresponsive the admin socket
of that mon only (the other 2 respond via admin socket and show either
probing or electing).

I have captured logs of the to-be-leader mon (its daemon started first)
with debug_mon = 20 and debug_ms = 20 (I should probably recapture with
debug_paxos = 20). The pegged cpu only occurs once the third MON is started
(seems to be an election issue). The MON has been allowed to run for hours
with no progress in that state. Eventually it seems the leader goes into
the (leader) state (Claims "I win" in the logs) but the other 2 mons
continue their election cycle. In states of probing or electing still.

I have verified NTP clock is fine and sync'd to same source on all the mons
and connectivity between the mons at both ports is functional. The
situation the PGs were in during the subsequent MON fault leads me to
believe the problem is more complex than typical monitor election issues.
Also of note the backing disk of the to-be-leader MON is mostly idle.

At this point I am interested in taking backups of all the mon stores and
injecting a modified single-mon monmap in a mon and seeing if we can get
back up but am also concerned that that single mon will be the de-facto
leader and also be unresponsive. Interested in any suggestions from the
wider community. Thanks!


Respectfully,

*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
w...@wesdillingham.com
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to