[ceph-users] Re: Prometheus anomaly in Reef

Tim Holloway Thu, 27 Mar 2025 04:39:17 -0700

It gets worse.

It looks like the physical disk backing the 2 failing OSDs is failing. Idestroyed the host for one of them - which causes me to flash-back tothe nightmare of having a deleted OSD get permanently stuck deletingjust like in Pacific. Because I cannot restart the OSD, the deletioncould not complete.

The deleted host was a backup mds, I needed a new mds so I told thesystem to create one on the dell02 machine. I got the same behaviour asfor prometheus. The dell02 machine shows in ceph orch ls as having anun-started mds, there's an empty mds logfile created, but no systemdunits. And nothing in the cephadm log about the creation of the mds.

The other cephadm log (/var/log/ceph/<fsid>/ceph.cephadm.log) indicatesattempts to decommission the old (ceph06) mds, but that machine cannotbe contacted as it no longer exists.


I've posted yesterday's and today's ceph.cephadm.log:

 https://www.mousetech.com/share/ceph.cephadm.log-20250326.gz

https://www.mousetech.com/share/ceph.cephadm.log

Latest health report is dismal:

HEALTH_ERR 1 failed cephadm daemon(s); 1 hosts fail cephadm check;insufficient standby MDS daemons available; 2 mgr modules have failed;too many PGs per OSD (648 > max 560)

[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check

host ceph06.internal.mousetech.com (10.0.1.56) failed check: Can'tcommunicate with remote host `10.0.1.56`, possibly because the host isnot reachable or python3 is not installed on the host. [Errno 113]Connect call failed ('10.0.1.56', 22)

[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[ERR] MGR_MODULE_ERROR: 2 mgr modules have failed
    Module 'cephadm' has failed: 'ceph06.internal.mousetech.com'

Module 'prometheus' has failed: gaierror(-2, 'Name or service notknown')

[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)

On 3/26/25 16:55, Tim Holloway wrote:

OSD mystery is solved.
Both OSDs were LVM-based imported as vdisks for Ceph VMs. Apparentlysomething scrambled either the VM manager or the host disk subsystemas the VM disks were getting I/O errors and even disappearing from theVM.
I rebooted the physical machine and that cleared it. All OSDs nowhappy again.
...
Well, it looks like one OSD has been damaged permanently, so I purgedit. (:
On 3/26/25 15:08, Tim Holloway wrote:
Sorry, duplicated a URL. The mgr log is

https://www.mousetech.com/share/ceph-mgr.log
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prometheus anomaly in Reef

Reply via email to