On Thu, 19 Jun 2025, at 18:39, Eugen Block wrote:
> Zitat von Miles Goodhew <c...@m0les.com>:
> 
> > On Thu, 19 Jun 2025, at 17:48, Eugen Block wrote:
> >> Too bad. :-/ Could you increase the debug log level to 20? Maybe it
> >> gets a bit clearer where exactly it fails.
> >
> > I guess that's in `ceph.conf` with:
> >
> > [mon]
> >     debug_mon = 20
> > ?
> 
> Correct.

Some progress has been made!
The mon.mds print_map() output shows an "in" set of [0,1,2,3] (i.e. size 4, but 
only 2 are actually perceived as "up") and a max_mds value of 2.
With the log level increased to 20, the last dout log we see is on line 1810 
("in 4 max 2"). So none of the other 4 dout logs are seen from lines 1818, 
1834, 1847 or 1855. So that must mean that one of the calls at lines 1816 
(mds_map.isresizable x 2), 1845 (mds_map.get_info) or 1846 (mds_map.is_active) 
would have to cause the crash.

We were toying with methods to see if we could set values to drop-out of this 
code earlier and it was decided that the MDS service was not the most important 
part of the cluster (The Openstack cluster on top of it was more important).

So for a test, we used the `ceph-kvstore-tool` to just trim-off the "mds*" 
prefixes from the DB:

```
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mdsmap
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mds_health
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mds_metadata
  ceph-kvstore-tool leveldb ${DB_PATH} rm health mdsmap
```

(I suspect the "mdsmap" part was the most important, but we're mostly just 
going by feel at this level).

To our surprised delight, the MONs all started-up and formed a quorum. We then 
started-up all the MGRs without issue. We progressively started all the OSDs 
(minor rebalancing from an actually unhealthy disk).

The cluster got back to a "nominally operational except for CephFS" state and 
the Openstack cluster was verified and repaired. All green at COB. The RGW 
services restarted and verified operational by their clients.

So we're leaving it like this for now and conducting a review on Monday. There 
are a few immediate bits of maintenance that can be done, but this whole 
incident puts a fire under the "let's get this updated and moved to supported 
OS/Hardware" plan.

Thanks again for all your help, Eugen - much appreciated!!!

M0les.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to