Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Josh Durgin Fri, 15 Sep 2017 18:13:25 -0700

(Sorry for top posting, this email client isn't great at editing)


The mitigation strategy I mentioned before of forcing backfill could be 
backported to jewel, but I don't think it's a very good option for RBD users 
without SSDs.


In luminous there is a command (something like 'ceph pg force-recovery') that 
you can use to prioritize recovery of particular PGs (and thus rbd images with 
some scripting). This would at least let you limit the scope of affected 
images. A couple folks from OVH added it for just this purpose.


Neither of these is an ideal workaround, but I haven't thought of a better one 
for existing versions.


Josh


Sent from Nine
________________________________
From: Florian Haas <flor...@hastexo.com>
Sent: Sep 15, 2017 3:43 PM
To: Josh Durgin
Cc: ceph-users@lists.ceph.com; Christian Theune
Subject: Re: [ceph-users] Clarification on sequence of recovery and client ops 
after OSDs rejoin cluster (also, slow requests)

> On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin <jdur...@redhat.com> wrote: 
> >> So this affects just writes. Then I'm really not following the 
> >> reasoning behind the current behavior. Why would you want to wait for 
> >> the recovery of an object that you're about to clobber anyway? Naïvely 
> >> thinking an object like that would look like a candidate for 
> >> *eviction* from the recovery queue, not promotion to a higher 
> >> priority. Is this because the write could be a partial write, whereas 
> >> recovery would need to cover the full object? 
> > 
> > 
> > Generally most writes are partial writes - for RBD that's almost always 
> > the case - often writes are 512b or 4kb. It's also true for e.g. RGW 
> > bucket index updates (adding an omap key/value pair). 
>
> Sure, makes sense. 
>
> >> This is all under the disclaimer that I have no detailed 
> >> knowledge of the internals so this is all handwaving, but would a more 
> >> logical sequence of events not look roughly like this: 
> >> 
> >> 1. Are all replicas of the object available? If so, goto 4. 
> >> 2. Is the write a full object write? If so, goto 4. 
> >> 3. Read the local copy of the object, splice in the partial write, 
> >> making it a full object write. 
> >> 4. Evict the object from the recovery queue. 
> >> 5. Replicate the write. 
> >> 
> >> Forgive the silly use of goto; I'm wary of email clients mangling 
> >> indentation if I were to write this as a nested if block. :) 
> > 
> > 
> > This might be a useful optimization in some cases, but it would be 
> > rather complex to add to the recovery code. It may be worth considering 
> > at some point - same with deletes or other cases where the previous data 
> > is not needed. 
>
> Uh, yeah, waiting for an object to recover just so you can then delete 
> it, and blocking the delete I/O in the process, does also seem rather 
> very strange. 
>
> I think we do agree that any instance of I/O being blocked upward of 
> 30s in a VM is really really bad, but the way you describe it, I see 
> little chance for a Ceph-deploying cloud operator to ever make a 
> compelling case to their customers that such a thing is unlikely to 
> happen. And I'm not even sure if a knee-jerk reaction to buy faster 
> hardware would be a very prudent investment: it's basically all just a 
> factor of (a) how much I/O happens on a cluster during an outage, (b) 
> how many nodes/OSDs will be affected by that outage. Neither is very 
> predictable, and only (b) is something you have any influence over in 
> a cloud environment. Beyond a certain threshold of either (a) or (b), 
> the probability of *recovery* slowing a significant number of VMs to a 
> crawl approximates 1. 
>
> For an rgw bucket index pool, that's usually a sufficiently small 
> amount of data that allows you to sprinkle a few fast drives 
> throughout your cluster, create a ruleset with a separate root 
> (pre-Luminous) or making use of classes (Luminous and later), and then 
> assign that ruleset to the pool. But for RBD storage, that's usually 
> not an option — not at non-prohibitive cost, anyway. 
>
> Can you share your suggested workaround / mitigation strategy for 
> users that are currently being bitten by this behavior? If async 
> recovery lands in mimic with no chance of a backport, then it'll be a 
> while before LTS users get any benefit out of it. 
>
> Cheers, 
> Florian

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Reply via email to