Hi,
the MGR doesn't always report the correct PG status, so don't rely on
that too much. Sometimes it's necessary to restart primary OSDs for
stuck PGs, although a repeer could have been sufficient. Your Ceph
clients had to refresh their osdmap, that's when they notice that
there had been down OSDs. It's not a real-time log in this case, no
need to worry. It's a common question though, I think we also asked it
8 to 10 years ago. ;-)
Regards,
Eugen
Zitat von Wannes Smet via ceph-users <[email protected]>:
Hi,
I'm running a Ceph cluster 19.2.2, 23 nodes, 152 OSDs, cephadm
deployed. Most SAS SSDs, 12 NVMe SSDs.
Yesterday we experienced a total power failure and everything went
down hard. Also our Ceph cluster. There were a couple of things, but
this stood out after it got back up:
[ERR] OSD_UNREACHABLE: 2 osds(s) are not reachable
osd.53's public address is not in '192.168.11.0/24' subnet
osd.86's public address is not in '192.168.11.0/24' subnet
ceph -s did not say reduced data {availability,redundancy} which is
a bit "off", given that both OSDs are in separate hosts, failure
domain=host. There must have been PGs with less than 3 replicas and
also PGs with just one replica left?
So I manually restarted those OSDs with systemctl , a recovery
process started and all our VMs, "magically" started booting now.
I'm also surprised that the recovery process only started when those
OSDs got back up.
I didn't make too much of the above, but now this morning, I'm
looking at the kernel ring buffer of our PVE nodes and I notice the
logs below. Just a single "blip". All at the same time on all of our
PVE nodes (ceph clients):
[Sat May 30 22:03:46 2026] libceph
(e8020818-2100-11f0-8a12-9cdc71772100 e179035): osd53 down
[Sat May 30 22:03:46 2026] libceph
(e8020818-2100-11f0-8a12-9cdc71772100 e179050): osd53 up
[Sat May 30 22:03:46 2026] libceph
(e8020818-2100-11f0-8a12-9cdc71772100 e179057): osd86 down
[Sat May 30 22:03:46 2026] libceph
(e8020818-2100-11f0-8a12-9cdc71772100 e179074): osd86 up
I don't see anything weird in the Ceph cluster itself, neither in
the log files of the ODS.
I'm not sure what to make from this. Why would this happen and what
would you do?
Thanks for your insights,
Wannes Smet
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]