Wow, lots of good information there! The hosts are the failure domain, so
normally we just do one host at a time as you suggested. Good point on the
rebalance waiting, its usually just me waiting for someone to go replace the
failed drive, though in the case of swapping out old working drives
We recently upgraded all our clusters to rocky 9.4 and reef 18.2.4. Two of
the clusters show the rgw metrics in the ceph dashboard and the other two
don't. I made sure the firewalls were open for ceph-exporter and that
Prometheus was gathering the stats on all 4 clusters. For the clusters that
a
Thanks for noting this, I just imported our last cluster and couldn't get
ceph-exporter to start. I noticed that the images it was using for
node-exporter and ceph-exporter were not the same as the other clusters!
Wish this was in the adoption documentation. I have a running list of all
the thing