Hi,
I'm running a Ceph cluster 19.2.2, 23 nodes, 152 OSDs, cephadm deployed. Most
SAS SSDs, 12 NVMe SSDs.
Yesterday we experienced a total power failure and everything went down hard.
Also our Ceph cluster. There were a couple of things, but this stood out after
it got back up:
[ERR] OSD_UNREACHABLE: 2 osds(s) are not reachable
osd.53's public address is not in '192.168.11.0/24' subnet
osd.86's public address is not in '192.168.11.0/24' subnet
ceph -s did not say reduced data {availability,redundancy} which is a bit
"off", given that both OSDs are in separate hosts, failure domain=host. There
must have been PGs with less than 3 replicas and also PGs with just one replica
left?
So I manually restarted those OSDs with systemctl , a recovery process started
and all our VMs, "magically" started booting now. I'm also surprised that the
recovery process only started when those OSDs got back up.
I didn't make too much of the above, but now this morning, I'm looking at the
kernel ring buffer of our PVE nodes and I notice the logs below. Just a single
"blip". All at the same time on all of our PVE nodes (ceph clients):
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100
e179035): osd53 down
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100
e179050): osd53 up
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100
e179057): osd86 down
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100
e179074): osd86 up
I don't see anything weird in the Ceph cluster itself, neither in the log files
of the ODS.
I'm not sure what to make from this. Why would this happen and what would you
do?
Thanks for your insights,
Wannes Smet
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]