Hi,
A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.
if all OSDs come back (stable) the recovery should eventually finish.
B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above
here is the output of ceph health detail:
HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 134 pgs
backfill_wait; 11 pgs backfilling; 69 pgs degraded; 14 pgs down; 2 pgs
incomplete; 14 pgs peering; 6 pgs recovery_wait; 69 pgs stuck degraded; 16 pgs
stuck inactive; 167 pgs stuck
Hi,
after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.
what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:
too many PGs per OSD