On Mon, Sep 18, 2017 at 4:11 AM Florian Haas <flor...@hastexo.com> wrote:
> On 09/16/2017 01:36 AM, Gregory Farnum wrote: > > On Mon, Sep 11, 2017 at 1:10 PM Florian Haas <flor...@hastexo.com > > <mailto:flor...@hastexo.com>> wrote: > > > > On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick > > <patrick.mcl...@sony.com <mailto:patrick.mcl...@sony.com>> wrote: > > > > > > On 2017-09-08 06:06 PM, Gregory Farnum wrote: > > > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick > > <patrick.mcl...@sony.com <mailto:patrick.mcl...@sony.com>> wrote: > > > > > > > >> On a related note, we are very curious why the snapshot id is > > > >> incremented when a snapshot is deleted, this creates lots > > > >> phantom entries in the deleted snapshots set. Interleaved > > > >> deletions and creations will cause massive fragmentation in > > > >> the interval set. The only reason we can come up for this > > > >> is to track if anything changed, but I suspect a different > > > >> value that doesn't inject entries in to the interval set might > > > >> be better for this purpose. > > > > Yes, it's because having a sequence number tied in with the > > snapshots > > > > is convenient for doing comparisons. Those aren't leaked snapids > > that > > > > will make holes; when we increment the snapid to delete > something we > > > > also stick it in the removed_snaps set. (I suppose if you > alternate > > > > deleting a snapshot with adding one that does increase the size > > until > > > > you delete those snapshots; hrmmm. Another thing to avoid doing I > > > > guess.) > > > > > > > > > > > > > Fair enough, though it seems like these limitations of the > > > snapshot system should be documented. > > > > This is why I was so insistent on numbers, formulae or even > > rules-of-thumb to predict what works and what does not. Greg's "one > > snapshot per RBD per day is probably OK" from a few months ago seemed > > promising, but looking at your situation it's probably not that > useful > > a rule. > > > > > > > We most likely would > > > have used a completely different strategy if it was documented > > > that certain snapshot creation and removal patterns could > > > cause the cluster to fall over over time. > > > > I think right now there are probably very few people, if any, who > > could *describe* the pattern that causes this. That complicates > > matters of documentation. :) > > > > > > > >>> It might really just be the osdmap update processing -- that > would > > > >>> make me happy as it's a much easier problem to resolve. But > > I'm also > > > >>> surprised it's *that* expensive, even at the scales you've > > described. > > > > ^^ This is what I mean. It's kind of tough to document things if > we're > > still in "surprised that this is causing harm" territory. > > > > > > > >> That would be nice, but unfortunately all the data is pointing > > > >> to PGPool::Update(), > > > > Yes, that's the OSDMap update processing I referred to. This is > good > > > > in terms of our ability to remove it without changing client > > > > interfaces and things. > > > > > > That is good to hear, hopefully this stuff can be improved soon > > > then. > > > > Greg, can you comment on just how much potential improvement you see > > here? Is it more like "oh we know we're doing this one thing horribly > > inefficiently, but we never thought this would be an issue so we > shied > > away from premature optimization, but we can easily reduce 70% CPU > > utilization to 1%" or rather like "we might be able to improve this > by > > perhaps 5%, but 100,000 RBDs is too many if you want to be using > > snapshotting at all, for the foreseeable future"? > > > > > > I got the chance to discuss this a bit with Patrick at the Open Source > > Summit Wednesday (good to see you!). > > > > So the idea in the previously-referenced CDM talk essentially involves > > changing the way we distribute snap deletion instructions from a > > "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that > > gets trimmed once the OSDs report to the manager that they've finished > > removing that snapid. This should entirely resolve the CPU burn they're > > seeing during OSDMap processing on the nodes, as it shrinks the > > intersection operation down from "all the snaps" to merely "the snaps > > not-done-deleting". > > > > The other reason we maintain the full set of deleted snaps is to prevent > > client operations from re-creating deleted snapshots — we filter all > > client IO which includes snaps against the deleted_snaps set in the PG. > > Apparently this is also big enough in RAM to be a real (but much > > smaller) problem. > > > > Unfortunately eliminating that is a lot harder > > Just checking here, for clarification: what is "that" here? Are you > saying that eliminating the full set of deleted snaps is harder than > introducing a deleting_snaps member, or that both are harder than > potential mitigation strategies that were previously discussed in this > thread? Eliminating the full set we store on the OSD node is much harder than converting the OSDMap to specify deleting_ rather than deleted_snaps — the former at minimum requires changes to the client protocol and we’re not actually sure how to do it; the latter can be done internally to the cluster and has a well-understood algorithm to implement. > and a permanent fix will > > involve changing the client protocol in ways nobody has quite figured > > out how to do. But Patrick did suggest storing the full set of deleted > > snaps on-disk and only keeping in-memory the set which covers snapids in > > the range we've actually *seen* from clients. I haven't gone through the > > code but that seems broadly feasible — the hard part will be working out > > the rules when you have to go to disk to read a larger part of the > > deleted_snaps set. (Perfectly feasible.) > > > > PRs are of course welcome! ;) > > Right, so all of the above is about how this can be permanently fixed by > what looks to be a fairly invasive rewrite of some core functionality — > which is of course a good discussion to have, but it would be good to > also have a suggestion for users who want to avoid running into the > situation that Patrick and team are in, right now. So at the risk of > sounding obnoxiously repetitive, can I reiterate this earlier question > of mine? > > > This is why I was so insistent on numbers, formulae or even > > rules-of-thumb to predict what works and what does not. Greg's "one > > snapshot per RBD per day is probably OK" from a few months ago seemed > > promising, but looking at your situation it's probably not that useful > > a rule. > > Is there something that you can suggest here, perhaps taking into > account the discussion you had with Patrick last week? > I think I’ve already shared everything I have on this. Try to treat sequential snaps the same way and don’t create a bunch of holes in the interval set. > Cheers, > Florian > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com