I've seen something like this a few times.

Once, I lost the battery in my battery backed RAID card.  That caused all
the OSDs on that host to be slow, which triggered slow request notices
pretty much cluster wide.  It was only when I histogrammed the slow request
notices that I saw most of them were on a single node.  I compared the disk
latency graphs between nodes, and saw that one node had a much higher write
latency. This took me a while to track down.

Another time, I had a consume HDD that was slowly failing.  It would hit a
group of bad sector, remap, repeat.  SMART warned me about it, so I
replaced the disk after the second slow request alerts.  This was pretty
straight forward to diagnose, only because smartd notified me.


I both cases, I saw "slow request" notices on the affect disks.  Your
osd.284 says osd.186 and osd.177 are being slow, but osd.186 and osd.177
don't claim to be slow.

It's possible that their is another disk that is slow, causing osd.186 and
osd.177 replication to slow down.  With the PG distribution over OSDs, one
disk being a little slow can affect a large number of OSDs.


If SMART doesn't show you a disk is failing, I'd start looking for disks
(the disk itself, not the OSD daemon) with a high latency around your
problem times.  If you focus on the problem times, give it a +/- 10 minutes
window.  Sometimes it takes a little while for the disk slowness to spread
out enough for Ceph to complain.


On Wed, Apr 15, 2015 at 3:20 PM, Dominik Mostowiec <
dominikmostow...@gmail.com> wrote:

> Hi,
> From few days we notice on our cluster many slow request.
> Cluster:
> ceph version 0.67.11
> 3 x mon
> 36 hosts -> 10 osd ( 4T ) + 2 SSD (journals)
> Scrubbing and deep scrubbing is disabled but count of slow requests is
> still increasing.
> Disk utilisation is very small after we have disabled scrubbings.
> Log from one write with slow with debug osd = 20/20
> osd.284 - master: http://pastebin.com/xPtpNU6n
> osd.186 - replica: http://pastebin.com/NS1gmhB0
> osd.177 - replica: http://pastebin.com/Ln9L2Z5Z
>
> Can you help me find what is reason of it?
>
> --
> Regards
> Dominik
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to