Hi Igor,
we store 400TB backups (RDB snapshots) on the cluster, depending on the
schedule we replace all data every one to two weeks, so we are deleting
data every day.
Yes, the OSDs are killed with messages like "heartbeat_check: no reply
from 10.244.0.27:6852 osd.37 ever...", if that is what yo
Hi Paul,
any chances you initiated massive data removal recently?
Are there any suicide timeouts in OSD logs prior to OSD failures? Any
log output containing "slow operation observed" there?
Please also note the following PR and tracker comments which might be
relevant for your case.
https