Re: [ceph-users] Ceph writes stall for long perioids with no disk/network activity

Mariusz Gronczewski Thu, 07 Aug 2014 05:22:22 -0700

> > 
> > I've often wished for some sort of bottleneck finder for ceph. An easy
> > way for the system to say where it is experiencing critical latencies
> > e.g. network, journals, osd data disks, etc. This would assist
> > troubleshooting and initial deployments immensely.
> 
> As mentioned above, it's tricky. 
> Most certainly desirable, but the ole Mark I eyeball and wetware is quite
> good at spotting these when presented with appropriate input like atop.
>


Is there any stats from OSD perf dump that could help with that ?
I've wrote simple wrapper to collectd to get op_ and subop_
rw/w/r_latency but I'm not certain if it will show problems with
underlying storage, so far everytime I evicted "slow" (3-5x times
bigger latency than other ones) osd another took its place.

I'm guessing probably because that OSD got "short end of the CRUSH" and
got loaded with a bit more requests so other OSDs were waiting for that
one.

Also, is there any way to correlate results of
dump_historic_ops between OSDs ? I've noticed that in my case longest one are 
usually "waiting for subops from X, Y" and except for time there is no other 
information to correlate that for example op on osd.1 waited for subop on osd.5 
and that subop on osd.5 was slow because of y
-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com
<mailto:mariusz.gronczew...@efigence.com>

signature.asc
Description: PGP signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph writes stall for long perioids with no disk/network activity

Reply via email to