It gets worse.

It looks like the physical disk backing the 2 failing OSDs is failing. I destroyed the host for one of them - which causes me to flash-back to the nightmare of having a deleted OSD get permanently stuck deleting just like in Pacific. Because I cannot restart the OSD, the deletion could not complete.

The deleted host was a backup mds, I needed a new mds so I told the system to create one on the dell02 machine. I got the same behaviour as for prometheus. The dell02 machine shows in ceph orch ls as having an un-started mds, there's an empty mds logfile created, but no systemd units. And nothing in the cephadm log about the creation of the mds.

The other cephadm log (/var/log/ceph/<fsid>/ceph.cephadm.log) indicates attempts to decommission the old (ceph06) mds, but that machine cannot be contacted as it no longer exists.

I've posted yesterday's and today's ceph.cephadm.log:

 https://www.mousetech.com/share/ceph.cephadm.log-20250326.gz

https://www.mousetech.com/share/ceph.cephadm.log

Latest health report is dismal:

HEALTH_ERR 1 failed cephadm daemon(s); 1 hosts fail cephadm check; insufficient standby MDS daemons available; 2 mgr modules have failed; too many PGs per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
    host ceph06.internal.mousetech.com (10.0.1.56) failed check: Can't communicate with remote host `10.0.1.56`, possibly because the host is not reachable or python3 is not installed on the host. [Errno 113] Connect call failed ('10.0.1.56', 22)
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[ERR] MGR_MODULE_ERROR: 2 mgr modules have failed
    Module 'cephadm' has failed: 'ceph06.internal.mousetech.com'
    Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)

On 3/26/25 16:55, Tim Holloway wrote:
OSD mystery is solved.

Both OSDs were LVM-based imported as vdisks for Ceph VMs. Apparently something scrambled either the VM manager or the host disk subsystem as the VM disks were getting I/O errors and even disappearing from the VM.

I rebooted the physical machine and that cleared it. All OSDs now happy again.

...

Well, it looks like one OSD has been damaged permanently, so I purged it. (:

On 3/26/25 15:08, Tim Holloway wrote:
Sorry, duplicated a URL. The mgr log is

https://www.mousetech.com/share/ceph-mgr.log
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to