(Sorry for top posting, this email client isn't great at editing)
The mitigation strategy I mentioned before of forcing backfill could be backported to jewel, but I don't think it's a very good option for RBD users without SSDs. In luminous there is a command (something like 'ceph pg force-recovery') that you can use to prioritize recovery of particular PGs (and thus rbd images with some scripting). This would at least let you limit the scope of affected images. A couple folks from OVH added it for just this purpose. Neither of these is an ideal workaround, but I haven't thought of a better one for existing versions. Josh Sent from Nine ________________________________ From: Florian Haas <flor...@hastexo.com> Sent: Sep 15, 2017 3:43 PM To: Josh Durgin Cc: ceph-users@lists.ceph.com; Christian Theune Subject: Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests) > On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin <jdur...@redhat.com> wrote: > >> So this affects just writes. Then I'm really not following the > >> reasoning behind the current behavior. Why would you want to wait for > >> the recovery of an object that you're about to clobber anyway? Naïvely > >> thinking an object like that would look like a candidate for > >> *eviction* from the recovery queue, not promotion to a higher > >> priority. Is this because the write could be a partial write, whereas > >> recovery would need to cover the full object? > > > > > > Generally most writes are partial writes - for RBD that's almost always > > the case - often writes are 512b or 4kb. It's also true for e.g. RGW > > bucket index updates (adding an omap key/value pair). > > Sure, makes sense. > > >> This is all under the disclaimer that I have no detailed > >> knowledge of the internals so this is all handwaving, but would a more > >> logical sequence of events not look roughly like this: > >> > >> 1. Are all replicas of the object available? If so, goto 4. > >> 2. Is the write a full object write? If so, goto 4. > >> 3. Read the local copy of the object, splice in the partial write, > >> making it a full object write. > >> 4. Evict the object from the recovery queue. > >> 5. Replicate the write. > >> > >> Forgive the silly use of goto; I'm wary of email clients mangling > >> indentation if I were to write this as a nested if block. :) > > > > > > This might be a useful optimization in some cases, but it would be > > rather complex to add to the recovery code. It may be worth considering > > at some point - same with deletes or other cases where the previous data > > is not needed. > > Uh, yeah, waiting for an object to recover just so you can then delete > it, and blocking the delete I/O in the process, does also seem rather > very strange. > > I think we do agree that any instance of I/O being blocked upward of > 30s in a VM is really really bad, but the way you describe it, I see > little chance for a Ceph-deploying cloud operator to ever make a > compelling case to their customers that such a thing is unlikely to > happen. And I'm not even sure if a knee-jerk reaction to buy faster > hardware would be a very prudent investment: it's basically all just a > factor of (a) how much I/O happens on a cluster during an outage, (b) > how many nodes/OSDs will be affected by that outage. Neither is very > predictable, and only (b) is something you have any influence over in > a cloud environment. Beyond a certain threshold of either (a) or (b), > the probability of *recovery* slowing a significant number of VMs to a > crawl approximates 1. > > For an rgw bucket index pool, that's usually a sufficiently small > amount of data that allows you to sprinkle a few fast drives > throughout your cluster, create a ruleset with a separate root > (pre-Luminous) or making use of classes (Luminous and later), and then > assign that ruleset to the pool. But for RBD storage, that's usually > not an option — not at non-prohibitive cost, anyway. > > Can you share your suggested workaround / mitigation strategy for > users that are currently being bitten by this behavior? If async > recovery lands in mimic with no chance of a backport, then it'll be a > while before LTS users get any benefit out of it. > > Cheers, > Florian
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com