Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

mart.v Mon, 25 Feb 2019 08:55:38 -0800

Thanks for the hint, but it seems that I/O is not the trigger. This is I/O
in the last 3 hours: http://prntscr.com/mpy0g9 - nothing above 350 iops,
which is IMHO very small load for SSD. Within this 3 hour windows I
experienced the REQUEST_SLOW three times (different minutes though).




These times are different each day so it is not a periodic task.




Martin


---------- Původní e-mail ----------
Od: David Turner <drakonst...@gmail.com>
Komu: mart.v <mar...@seznam.cz>
Datum: 22. 2. 2019 12:23:37
Předmět: Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time
"
Can you correlate the times to scheduled tasks inside of any VMs? For
instance if you have several Linux VMs with the updatedb command installed
that by default they will all be scanning their disks at the same time each
day to see where files are. Other common culprits could be scheduled
backups, db cleanup, etc. Do you track cluster io at all? When I first
configured a graphing tool on my home cluster I found the updatedb/locate 
command happening with a drastic io spike at the same time every day. I also
found a spike when a couple Windows VMs were checking for updates
automatically.



On Fri, Feb 22, 2019, 4:28 AM mart.v <mar...@seznam.cz
(mailto:mar...@seznam.cz)> wrote:

"

Hello everyone,




I'm experiencing a strange behaviour. My cluster is relatively small (43 
OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected
via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with
different pools. Descibed error is only on the SSD part of the cluster.




I noticed that few times a day the cluster slows down a bit and I have
discovered this in logs:




2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0
(http://172.16.254.101:6789/0) 1794159 : cluster [WRN] Health check failed:
27 slow requests are blocked > 32 sec. Implicated osds 10,22,33 (REQUEST_
SLOW)
2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0
(http://172.16.254.101:6789/0) 1794169 : cluster [WRN] Health check update:
199 slow requests are blocked > 32 sec. Implicated osds 0,4,5,6,7,8,9,10,
12,16,17,19,20,21,22,25,26,33,41 (REQUEST_SLOW)
2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0
(http://172.16.254.101:6789/0) 1794183 : cluster [WRN] Health check update:
448 slow requests are blocked > 32 sec. Implicated osds 0,3,4,5,6,7,8,9,
10,12,15,16,17,19,20,21,22,24,25,26,33,41 (REQUEST_SLOW)
2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0
(http://172.16.254.101:6789/0) 1794210 : cluster [WRN] Health check update:
388 slow requests are blocked > 32 sec. Implicated osds 4,8,10,16,24,33 
(REQUEST_SLOW)
2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0
(http://172.16.254.101:6789/0) 1794214 : cluster [INF] Health check cleared:
REQUEST_SLOW (was: 18 slow requests are blocked > 32 sec. Implicated osds 
8,16)





"ceph health detail" shows nothing more




It is happening through the whole day and the times can't be linked to any
read or write intensive task (e.g. backup). I also tried to disable
scrubbing, but it kept on going. These errors were not there since
beginning, but unfortunately I cannot track the day they started (it is 
beyond my logs).





Any ideas?




Thank you!

Martin

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
(http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com)
"
"

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

Reply via email to