On Mon, Sep 18, 2017 at 4:11 AM Florian Haas <flor...@hastexo.com> wrote:

> On 09/16/2017 01:36 AM, Gregory Farnum wrote:
> > On Mon, Sep 11, 2017 at 1:10 PM Florian Haas <flor...@hastexo.com
> > <mailto:flor...@hastexo.com>> wrote:
> >
> >     On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick
> >     <patrick.mcl...@sony.com <mailto:patrick.mcl...@sony.com>> wrote:
> >     >
> >     > On 2017-09-08 06:06 PM, Gregory Farnum wrote:
> >     > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick
> >     <patrick.mcl...@sony.com <mailto:patrick.mcl...@sony.com>> wrote:
> >     > >
> >     > >> On a related note, we are very curious why the snapshot id is
> >     > >> incremented when a snapshot is deleted, this creates lots
> >     > >> phantom entries in the deleted snapshots set. Interleaved
> >     > >> deletions and creations will cause massive fragmentation in
> >     > >> the interval set. The only reason we can come up for this
> >     > >> is to track if anything changed, but I suspect a different
> >     > >> value that doesn't inject entries in to the interval set might
> >     > >> be better for this purpose.
> >     > > Yes, it's because having a sequence number tied in with the
> >     snapshots
> >     > > is convenient for doing comparisons. Those aren't leaked snapids
> >     that
> >     > > will make holes; when we increment the snapid to delete
> something we
> >     > > also stick it in the removed_snaps set. (I suppose if you
> alternate
> >     > > deleting a snapshot with adding one that does increase the size
> >     until
> >     > > you delete those snapshots; hrmmm. Another thing to avoid doing I
> >     > > guess.)
> >     > >
> >     >
> >     >
> >     > Fair enough, though it seems like these limitations of the
> >     > snapshot system should be documented.
> >
> >     This is why I was so insistent on numbers, formulae or even
> >     rules-of-thumb to predict what works and what does not. Greg's "one
> >     snapshot per RBD per day is probably OK" from a few months ago seemed
> >     promising, but looking at your situation it's probably not that
> useful
> >     a rule.
> >
> >
> >     > We most likely would
> >     > have used a completely different strategy if it was documented
> >     > that certain snapshot creation and removal patterns could
> >     > cause the cluster to fall over over time.
> >
> >     I think right now there are probably very few people, if any, who
> >     could *describe* the pattern that causes this. That complicates
> >     matters of documentation. :)
> >
> >
> >     > >>> It might really just be the osdmap update processing -- that
> would
> >     > >>> make me happy as it's a much easier problem to resolve. But
> >     I'm also
> >     > >>> surprised it's *that* expensive, even at the scales you've
> >     described.
> >
> >     ^^ This is what I mean. It's kind of tough to document things if
> we're
> >     still in "surprised that this is causing harm" territory.
> >
> >
> >     > >> That would be nice, but unfortunately all the data is pointing
> >     > >> to PGPool::Update(),
> >     > > Yes, that's the OSDMap update processing I referred to. This is
> good
> >     > > in terms of our ability to remove it without changing client
> >     > > interfaces and things.
> >     >
> >     > That is good to hear, hopefully this stuff can be improved soon
> >     > then.
> >
> >     Greg, can you comment on just how much potential improvement you see
> >     here? Is it more like "oh we know we're doing this one thing horribly
> >     inefficiently, but we never thought this would be an issue so we
> shied
> >     away from premature optimization, but we can easily reduce 70% CPU
> >     utilization to 1%" or rather like "we might be able to improve this
> by
> >     perhaps 5%, but 100,000 RBDs is too many if you want to be using
> >     snapshotting at all, for the foreseeable future"?
> >
> >
> > I got the chance to discuss this a bit with Patrick at the Open Source
> > Summit Wednesday (good to see you!).
> >
> > So the idea in the previously-referenced CDM talk essentially involves
> > changing the way we distribute snap deletion instructions from a
> > "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that
> > gets trimmed once the OSDs report to the manager that they've finished
> > removing that snapid. This should entirely resolve the CPU burn they're
> > seeing during OSDMap processing on the nodes, as it shrinks the
> > intersection operation down from "all the snaps" to merely "the snaps
> > not-done-deleting".
> >
> > The other reason we maintain the full set of deleted snaps is to prevent
> > client operations from re-creating deleted snapshots — we filter all
> > client IO which includes snaps against the deleted_snaps set in the PG.
> > Apparently this is also big enough in RAM to be a real (but much
> > smaller) problem.
> >
> > Unfortunately eliminating that is a lot harder
>
> Just checking here, for clarification: what is "that" here? Are you
> saying that eliminating the full set of deleted snaps is harder than
> introducing a deleting_snaps member, or that both are harder than
> potential mitigation strategies that were previously discussed in this
> thread?


Eliminating the full set we store on the OSD node is much harder than
converting the OSDMap to specify deleting_ rather than deleted_snaps — the
former at minimum requires changes to the client protocol and we’re not
actually sure how to do it; the latter can be done internally to the
cluster and has a well-understood algorithm to implement.


> and a permanent fix will
> > involve changing the client protocol in ways nobody has quite figured
> > out how to do. But Patrick did suggest storing the full set of deleted
> > snaps on-disk and only keeping in-memory the set which covers snapids in
> > the range we've actually *seen* from clients. I haven't gone through the
> > code but that seems broadly feasible — the hard part will be working out
> > the rules when you have to go to disk to read a larger part of the
> > deleted_snaps set. (Perfectly feasible.)
> >
> > PRs are of course welcome! ;)
>
> Right, so all of the above is about how this can be permanently fixed by
> what looks to be a fairly invasive rewrite of some core functionality —
> which is of course a good discussion to have, but it would be good to
> also have a suggestion for users who want to avoid running into the
> situation that Patrick and team are in, right now. So at the risk of
> sounding obnoxiously repetitive, can I reiterate this earlier question
> of mine?
>
> > This is why I was so insistent on numbers, formulae or even
> > rules-of-thumb to predict what works and what does not. Greg's "one
> > snapshot per RBD per day is probably OK" from a few months ago seemed
> > promising, but looking at your situation it's probably not that useful
> > a rule.
>
> Is there something that you can suggest here, perhaps taking into
> account the discussion you had with Patrick last week?
>

I think I’ve already shared everything I have on this. Try to treat
sequential snaps the same way and don’t create a bunch of holes in the
interval set.




> Cheers,
> Florian
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to