Hello,

We are running 0.80.5 on our production cluster and we are seeing slow requests 
when deleting rbd snapshots. We have now reduced snapshot counts to 4 weeklies 
but it seems that the snapshot count is not a factor of this problem. The 
cluster is practically unresponsive so long that clients timeout.

Here are top ten slowest requests per osd from last night (times in seconds):

1       /var/log/ceph/ceph-osd.46.log   1920
2       /var/log/ceph/ceph-osd.42.log   1455
3       /var/log/ceph/ceph-osd.74.log   1292
4       /var/log/ceph/ceph-osd.77.log   1170
5       /var/log/ceph/ceph-osd.48.log   1083
6       /var/log/ceph/ceph-osd.0.log    960
7       /var/log/ceph/ceph-osd.40.log   960
8       /var/log/ceph/ceph-osd.57.log   960
9       /var/log/ceph/ceph-osd.61.log   960
10      /var/log/ceph/ceph-osd.76.log   960

Some OSDs don't report slow requests at all,  they are not evenly distributed.

Currently we run journals on the osd sata drives, but are considering upgrading 
to SSD journals. However, we do not have any performance problems other than 
when deleting snapshots.

Is there any way to mitigate the problem other than investing on SSD journals?

-- 
  Eino Tuominen
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to