Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Nick Fisk Thu, 09 Feb 2017 11:37:06 -0800

Building now

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Samuel 
Just
Sent: 09 February 2017 19:22
To: Nick Fisk <n...@fisk.me.uk>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on 
master) passed a rados suite.  It adds a configurable limit to the number of 
pgs which can be trimming on any OSD (default: 2).  PGs trimming will be in 
snaptrim state, PGs waiting to trim will be in snaptrim_wait state.  I suspect 
this'll be adequate to throttle the amount of trimming.  If not, I can try to 
add an explicit limit to the rate at which the work items trickle into the 
queue.  Can someone test this branch?   Tester beware: this has not merged into 
master yet and should only be run on a disposable cluster.

-Sam

On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <n...@fisk.me.uk 
<mailto:n...@fisk.me.uk> > wrote:

Yeah it’s probably just the fact that they have more PG’s so they will hold 
more data and thus serve more IO. As they have a fixed IO limit, they will 
always hit the limit first and become the bottleneck.

The main problem with reducing the filestore queue is that I believe you will 
start to lose the benefit of having IO’s queued up on the disk, so that the 
scheduler can re-arrange them to action them in the most efficient manor as the 
disk head moves across the platters. You might possibly see up to a 20% hit on 
performance, in exchange for more consistent client latency. 

From: Steve Taylor [mailto:steve.tay...@storagecraft.com 
<mailto:steve.tay...@storagecraft.com> ] 
Sent: 07 February 2017 20:35
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; ceph-users@lists.ceph.com 
<mailto:ceph-users@lists.ceph.com> 

Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Thanks, Nick.

One other data point that has come up is that nearly all of the blocked 
requests that are waiting on subops are waiting for OSDs with more PGs than the 
others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB OSDs. 
The cluster is well balanced based on OSD capacity, so those 7 OSDs 
individually have 33% more PGs than the others and are causing almost all of 
the blocked requests. It appears that maps updates are generally not blocking 
long enough to show up as blocked requests.

I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. I’ll 
test some more when the PG counts per OSD are more balanced and see what I get. 
I’ll also play with the filestore queue. I was telling some of my colleagues 
yesterday that this looked likely to be related to buffer bloat somewhere. I 
appreciate the suggestion.

  _____  

<http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/1/GrSPF56Fv6UuTsRTz1TnrQ/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>

Steve Taylor | Senior Software Engineer |  
<http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/2/HleRei3YWDdicmCuDoWytA/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
 StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 <tel:(801)%20871-2799>  | 

  _____  

If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _____  

From: Nick Fisk [mailto:n...@fisk.me.uk <mailto:n...@fisk.me.uk> ] 
Sent: Tuesday, February 7, 2017 10:25 AM
To: Steve Taylor <steve.tay...@storagecraft.com 
<mailto:steve.tay...@storagecraft.com> >; ceph-users@lists.ceph.com 
<mailto:ceph-users@lists.ceph.com> 
Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Hi Steve,

>From what I understand, the issue is not with the queueing in Ceph, which is 
>correctly moving client IO to the front of the queue. The problem lies below 
>what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s 
>leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost 
>in large disk queues surrounded by all the snap trim IO’s.

The workaround Sam is working on will limit the amount of snap trims that are 
allowed to run, which I believe will have a similar effect to the sleep 
parameters in pre-jewel clusters, but without pausing the whole IO thread.

Ultimately the solution requires Ceph to be able to control the queuing of IO’s 
at the lower levels of the kernel. Whether this is via some sort of tagging per 
IO (currently CFQ is only per thread/process) or some other method, I don’t 
know. I was speaking to Sage and he thinks the easiest method might be to 
shrink the filestore queue so that you don’t get buffer bloat at the disk 
level. You should be able to test this out pretty easily now by changing the 
parameter, probably around a queue of 5-10 would be about right for spinning 
disks. It’s a trade off of peak throughput vs queue latency though.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 07 February 2017 17:01
To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

As I look at more of these stuck ops, it looks like more of them are actually 
waiting on subops than on osdmap updates, so maybe there is still some headway 
to be made with the weighted priority queue settings. I do see OSDs waiting for 
map updates all the time, but they aren’t blocking things as much as the subops 
are. Thoughts?

  _____  

<http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/3/gmxBQ4dulhCLgdaXYwjzXQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFQUFIZFhfTlY4QUFBQUFBQUFBQUYzZ2RxNEFBRE5KQld3QUFBQUFBQUNSWHdCWW1nTDJ2Mkpqcl9PLVIyTzI0MEpiWXN5WWVnQUFsQkkvMS9vY3RoeTZnc3VsLTlHSlk1TENwY2FBL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>

Steve Taylor | Senior Software Engineer |  
<http://xo4t.mj.am/lnk/ADsAAGVExY4AAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYmjkujXMGsfv0QI2IkdzdMPHbOwAAlBI/4/FusWt4f2DrtfAg_Rl1Xzpg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFQUFIZFhfTlY4QUFBQUFBQUFBQUYzZ2RxNEFBRE5KQld3QUFBQUFBQUNSWHdCWW1nTDJ2Mkpqcl9PLVIyTzI0MEpiWXN5WWVnQUFsQkkvMi90RU1EODM0ZHVnOEZpWWx6QmRuRERnL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
 StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 <tel:(801)%20871-2799>  | 

  _____  

If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _____  

From: Steve Taylor 
Sent: Tuesday, February 7, 2017 9:13 AM
To: 'ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> ' 
<ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Sorry, I lost the previous thread on this. I apologize for the resulting 
incomplete reply.

The issue that we’re having with Jewel, as David Turner mentioned, is that we 
can’t seem to throttle snap trimming sufficiently to prevent it from blocking 
I/O requests. On further investigation, I encountered 
osd_op_pq_max_tokens_per_priority, which should be able to be used in 
conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue 
positions for various operations using costs if I understand correctly. I’m 
testing with RBDs using 4MB objects, so in order to leave plenty of room in the 
weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority 
to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially 
reserve 32MB in the queue for client I/O operations, which are prioritized 
higher and therefore shouldn’t get blocked.

I still see blocked I/O requests, and when I dump in-flight ops, they show ‘op 
must wait for map.’ I assume this means that what’s blocking the I/O requests 
at this point is all of the osdmap updates caused by snap trimming, and not the 
actual snap trimming itself starving the ops of op threads. Hammer is able to 
mitigate this with osd_snap_trim_sleep by directly throttling snap trimming and 
therefore causing less frequent osdmap updates, but there doesn’t seem to be a 
good way to accomplish the same thing with Jewel.

First of all, am I understanding these settings correctly? If so, are there 
other settings that could potentially help here, or do we just need something 
like Sam already mentioned that can sort of reserve threads for client I/O 
requests? Even then it seems like we might have issues if we can’t also 
throttle snap trimming. We delete a LOT of RBD snapshots on a daily basis, 
which we recognize is an extreme use case. Just wondering if there’s something 
else to try or if we need to start working toward implementing something new 
ourselves to handle our use case better.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Reply via email to