Re: [ceph-users] Failed Disk simulation question

solarflow99 Fri, 24 May 2019 13:37:22 -0700

I think a deep scrub would eventually catch this right?


On Wed, May 22, 2019 at 2:56 AM Eugen Block <ebl...@nde.ag> wrote:

> Hi Alex,
>
> > The cluster has been idle at the moment being new and all.  I
> > noticed some disk related errors in dmesg but that was about it.
> > It looked to me for the next 20 - 30 minutes the failure has not
> > been detected.  All osds were up and in and health was OK. OSD logs
> > had no smoking gun either.
> > After 30 minutes, I restarted the OSD container and it failed to
> > start as expected.
>
> if the cluster doesn't have to read or write to specific OSDs (or
> sectors on that OSD) the failure won't be detected immediately. We had
> an issue last year where one of the SSDs (used for rocksdb and wal)
> had a failure, but that was never reported. We discovered that when we
> tried to migrate the lvm to a new device and got read errors.
>
> > Later on, I performed the same operation during the fio bench mark
> > and OSD failed immediately.
>
> This confirms our experience, if there's data to read/write on that
> disk the failure will be detected.
> Please note that this was in a Luminous cluster, I don't know if and
> how Nautilus has improved in sensing disk failures.
>
> Regards,
> Eugen
>
>
> Zitat von Alex Litvak <alexander.v.lit...@gmail.com>:
>
> > Hello cephers,
> >
> > I know that there was similar question posted 5 years ago.  However
> > the answer was inconclusive for me.
> > I installed a new Nautilus 14.2.1 cluster and started pre-production
> > testing.  I followed RedHat document and simulated a soft disk
> > failure by
> >
> > #  echo 1 > /sys/block/sdc/device/delete
> >
> > The cluster has been idle at the moment being new and all.  I
> > noticed some disk related errors in dmesg but that was about it.
> > It looked to me for the next 20 - 30 minutes the failure has not
> > been detected.  All osds were up and in and health was OK. OSD logs
> > had no smoking gun either.
> > After 30 minutes, I restarted the OSD container and it failed to
> > start as expected.
> >
> > Later on, I performed the same operation during the fio bench mark
> > and OSD failed immediately.
> >
> > My question is:  Should the disk problem have been detected quick
> > enough even on the idle cluster? I thought Nautilus has the means to
> > sense failure before intensive IO hit the disk.
> > Am I wrong to expect that?
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failed Disk simulation question

Reply via email to