Dear list,

I'm currently maintaining several Ceph (prod) installations. One of them 
consists of 3 MON hosts and 6 OSD hosts hosting 40 OSDs in total. And there are 
5 separate Proxmox-Hosts - they only host the VMs and use the storage provided 
by Ceph, but they are not part of Ceph.

The worst case happened: due to an outage, all these hosts crashed hardly the 
same time.

Last week, I began to restart (only the Ceph hosts; Proxmox servers are still 
down). Ceph was very unhappy with the situation as a whole - one OSD host (and 
its 6 OSDs) is completely gone, some hardware issues (33 OSDs left, networking, 
PSU, I'm working on it) and 73 out of 129 PGs inconsistent.

Meanwhile, the overall status of the cluster is "HEALTHY" again.
But nearly every day, one or two PGs get damaged. Never on the same OSDs. And 
there is no traffic on the storage as the virtualization hosts are not running. 
I see no further reason in the logs: everything is fine, scrub starts and 
leaves one or more PGs damaged. Repairing them is successful, but maybe next 
night, another PG is stuck.

Do you have hints to investigate this any further? I would love to understand 
more before starting the Proxmox cluster again. Using Ceph 18.2.4 (Proxmox 
packages).

Thanks a lot,
  Marianne

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to