I think a deep scrub would eventually catch this right?
On Wed, May 22, 2019 at 2:56 AM Eugen Block <ebl...@nde.ag> wrote: > Hi Alex, > > > The cluster has been idle at the moment being new and all. I > > noticed some disk related errors in dmesg but that was about it. > > It looked to me for the next 20 - 30 minutes the failure has not > > been detected. All osds were up and in and health was OK. OSD logs > > had no smoking gun either. > > After 30 minutes, I restarted the OSD container and it failed to > > start as expected. > > if the cluster doesn't have to read or write to specific OSDs (or > sectors on that OSD) the failure won't be detected immediately. We had > an issue last year where one of the SSDs (used for rocksdb and wal) > had a failure, but that was never reported. We discovered that when we > tried to migrate the lvm to a new device and got read errors. > > > Later on, I performed the same operation during the fio bench mark > > and OSD failed immediately. > > This confirms our experience, if there's data to read/write on that > disk the failure will be detected. > Please note that this was in a Luminous cluster, I don't know if and > how Nautilus has improved in sensing disk failures. > > Regards, > Eugen > > > Zitat von Alex Litvak <alexander.v.lit...@gmail.com>: > > > Hello cephers, > > > > I know that there was similar question posted 5 years ago. However > > the answer was inconclusive for me. > > I installed a new Nautilus 14.2.1 cluster and started pre-production > > testing. I followed RedHat document and simulated a soft disk > > failure by > > > > # echo 1 > /sys/block/sdc/device/delete > > > > The cluster has been idle at the moment being new and all. I > > noticed some disk related errors in dmesg but that was about it. > > It looked to me for the next 20 - 30 minutes the failure has not > > been detected. All osds were up and in and health was OK. OSD logs > > had no smoking gun either. > > After 30 minutes, I restarted the OSD container and it failed to > > start as expected. > > > > Later on, I performed the same operation during the fio bench mark > > and OSD failed immediately. > > > > My question is: Should the disk problem have been detected quick > > enough even on the idle cluster? I thought Nautilus has the means to > > sense failure before intensive IO hit the disk. > > Am I wrong to expect that? > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com