Thanks for the update, glad you got it back up. It's still irritating
that a full osd host would cause the MON quorum to break, that would
be highly unexpected. Maybe it was just a coincidence, hard to tell.
I'll try to reproduce this in a test cluster next week or so.
Zitat von Wesley Dillingham <w...@wesdillingham.com>:
Upon generating and looking at the more verbose logs [1] at the MON's boot
time it seemed the leader MON was hung up on trying to replay ad infinitum
a failed orch/cephadm task which had previously failed with ( OSError:
[Errno 28] No space left on device) for an OSD-only host. In the moments
before the MON quorum broke /var on an OSD-only host became full (a
ceph-objectstore-stool pg export was happening into /var at that time to
attempt to fix the incomplete pg).
This cluster was brought back to life by modifying the monmap to a single
mon, which luckily didn't reproduce the issue when reducded to a single
MON and then recreating more MONs from there to get back at 5. Subsequent
PG export and Imports fixed the incomplete PG. So the ultimate root cause
wasnt completely discovered but the cluster was restored.
[1] - https://tracker.ceph.com/issues/71501#note-4
Respectfully,
*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
w...@wesdillingham.com
On Fri, May 30, 2025 at 12:34 PM Eugen Block <ebl...@nde.ag> wrote:
Okay, and a hardware issue can be ruled out, I assume?
To get the cluster up again I would also consider starting one MON
only with a modified monmap. I haven't looked into the tracker though,
so maybe there's something in the logs.
Zitat von Wesley Dillingham <w...@wesdillingham.com>:
> Thanks for the reply Eugen
>
> These are Cisco UCSB-B200-M4, all 5 MONs the same hardware and the mon
> store is around 1.3GB on all 5 mons.
>
> I dont believe I can reach the contents of "ceph config` without a quorum
> but `ceph daemon config diff` of the out-of-quorum and responsive mons
> shows nothing other than what you would expect for bare minimum diff. The
> cluster operator believes no `ceph config set` changes were issued for
what
> it's worth and bash history on the nodes corroborates that. There may be
a
> way to inspect the monitor store for what the mons config db contained
with
> an offline tool but i'm not sure how to do that right now.
>
> Presumably those syncing tunables you tweaked only come into play
if/when a
> mon reaches synchronizing?
>
> Respectfully,
>
> *Wes Dillingham*
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
> w...@wesdillingham.com
>
>
>
>
> On Fri, May 30, 2025 at 11:15 AM Eugen Block <ebl...@nde.ag> wrote:
>
>> Hi Wes,
>>
>> although I don't have seen this exact issue, we did investigate a mon
>> sync issue two years ago. The customer also has 5 MONs and two of them
>> get out of quorum regularly in addition to the long sync times. For
>> the syncing issue we found some workarounds (paxos settings), but we
>> never got to investigate the failing qourum properly. But what we
>> found was that those servers have different hardware, the two failing
>> MON servers had weaker CPUs. They're currently in the process of
>> replacing the old hardware, so hopefully in a couple of months, we'll
>> see if the quorum issue still persists. They didn't want to follow our
>> recommendation to reduce the number of MONs to three, unfortunately.
>> So a couple of questions:
>>
>> - are the MON servers on the same hardware?
>> - are there any configuration differences between the MONs in 'ceph
>> config dump'?
>> - how large is the mon store?
>>
>> Regards,
>> Eugen
>>
>> Zitat von Wesley Dillingham <w...@wesdillingham.com>:
>>
>> > Tracker issue made here with some additional details:
>> > https://tracker.ceph.com/issues/71501
>> >
>> > Cluster version 18.2.4
>> >
>> > I came to assist with a non-functional cluster which had OSDs
erroneously
>> > --force purged and led to multiple (6) degraded + inactive PGs (4
>> remaining
>> > shards in a 4+2) and 1 remapped+incomplete PG (3 shards in a 4+2)
>> >
>> > In an effort to restore order to the cluster and get backfill working:
>> the
>> > degraded PGs shared a common primary OSD and that OSD was restarted.
>> > Additionally min_size was dropped from 5 to 4 for this pool (a
temporary
>> > measure while the cluster recovered).
>> >
>> > This caused the inactive degraded PGs to go active and start their
>> > backfill. The PGs steadily worked on backfill for a few hours.
>> >
>> > Near immediately after the final of the 6 degraded PGs finished its
>> > backfill the monitor quorum broke and the cluster became
unresponsive. In
>> > this state 2 of the 5 mons showed 100% cpu usage.
>> >
>> > In an attempt to fix the mon quorum some combination of monitor
service
>> > restart attempts occurred and ultimately all ceph services were
brought
>> > down in order to isolate the MON issue.
>> >
>> > As the situation stands currently any combination of starting 3 MONs
>> > (quorum eligible at 3) causes the lowest MON (the to-be-leader) to
hit a
>> > 100% cpu with fn_monstore thread and render unresponsive the admin
socket
>> > of that mon only (the other 2 respond via admin socket and show either
>> > probing or electing).
>> >
>> > I have captured logs of the to-be-leader mon (its daemon started
first)
>> > with debug_mon = 20 and debug_ms = 20 (I should probably recapture
with
>> > debug_paxos = 20). The pegged cpu only occurs once the third MON is
>> started
>> > (seems to be an election issue). The MON has been allowed to run for
>> hours
>> > with no progress in that state. Eventually it seems the leader goes
into
>> > the (leader) state (Claims "I win" in the logs) but the other 2 mons
>> > continue their election cycle. In states of probing or electing still.
>> >
>> > I have verified NTP clock is fine and sync'd to same source on all the
>> mons
>> > and connectivity between the mons at both ports is functional. The
>> > situation the PGs were in during the subsequent MON fault leads me to
>> > believe the problem is more complex than typical monitor election
issues.
>> > Also of note the backing disk of the to-be-leader MON is mostly idle.
>> >
>> > At this point I am interested in taking backups of all the mon stores
and
>> > injecting a modified single-mon monmap in a mon and seeing if we can
get
>> > back up but am also concerned that that single mon will be the
de-facto
>> > leader and also be unresponsive. Interested in any suggestions from
the
>> > wider community. Thanks!
>> >
>> >
>> > Respectfully,
>> >
>> > *Wes Dillingham*
>> > LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>> > w...@wesdillingham.com
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io