Hi, folks,

I have a middle-size Ceph cluster as cinder backup for openstack (queens). 
Duing testing, one Ceph node went down unexpected and powered up again ca 10 
minutes later, Ceph cluster starts PG recovery. To my surprise,  VM IOPS drops 
dramatically during Ceph recovery, from ca. 13K IOPS to about 400, a factor of 
1/30, and I did put a stringent throttling on backfill and recovery, with the 
following ceph parameters

    osd_max_backfills = 1
    osd_recovery_max_active = 1
    osd_client_op_priority=63
    osd_recovery_op_priority=1
    osd_recovery_sleep = 0.5

The most weird thing is, 
1) when there is no IO activity from any VM (ALL VMs are quiet except the 
recovery IO), the recovery bandwidth is ca. 10MiB/s, 2 objects/s. Seems like 
recovery throttle setting is working properly
2) when using FIO testing inside a VM, the recovery bandwith is going up 
quickly, reaching above 200MiB/s, 60 objects/s. FIO IOPS performance inside VM, 
however, is only at 400 IOPS/s (8KiB block size), around 3MiB/s. Obvious 
recovery throttling DOES NOT work properly
3) If i stop the FIO testing in VM, the recovery bandwith then goes down to  
10MiB/s, 2 objects/s again, strange enough.

How can this weird behavior happen? I just wonder, is there a method to 
configure recovery bandwith to a specific value, or the number of recovery 
objects per second? this may give better control of bakcfilling/recovery, 
instead of the faulty logic or relative osd_client_op_priority vs 
osd_recovery_op_priority.

any ideas or suggests to make the recovery under control?

best regards,

Samuel





huxia...@horebdata.cn
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to