[ceph-users] Re: EC cluster cascade failures and performance problems

2020-11-19 Thread Paul Kramme
Hi Igor, we store 400TB backups (RDB snapshots) on the cluster, depending on the schedule we replace all data every one to two weeks, so we are deleting data every day. Yes, the OSDs are killed with messages like "heartbeat_check: no reply from 10.244.0.27:6852 osd.37 ever...", if that is what yo

[ceph-users] Re: EC cluster cascade failures and performance problems

2020-11-19 Thread Igor Fedotov
Hi Paul, any chances you initiated massive data removal recently? Are there any suicide timeouts in OSD logs prior to OSD failures? Any log output containing "slow operation observed" there? Please also note the following PR and tracker comments which might be relevant for your case. https