On Fri, Oct 7, 2016 at 4:46 AM, John Spray <jsp...@redhat.com> wrote:

> On Fri, Oct 7, 2016 at 1:05 AM, Kjetil Jørgensen <kje...@medallia.com>
> wrote:
> > Hi,
> >
> > context (i.e. what we're doing): We're migrating (or trying to) migrate
> off
> > of an nfs server onto cephfs, for a workload that's best described as
> "big
> > piles" of hardlinks. Essentially, we have a set of "sources":
> > foo/01/<aa><rest-of-md5>
> > foo/0b/<0b><rest-of-md5>
> > .. and so on
> > bar/02/..
> > bar/0c/..
> > .. and so on
> >
> > foo/bar/friends have been "cloned" numerous times to a set of names that
> > over the course of weeks end up being recycled again, the clone is
> > essentially cp -L foo copy-1-of-foo.
> >
> > We're doing "incremental" rsyncs of this onto cephfs, so the sense of
> "the
> > original source of the hardlink" will end up moving around, depending on
> the
> > whims of rsync. (if it matters, I found some allusion to "if the original
> > file hardlinked is deleted, ...".
>
> This might not be much help but... have you thought about making your
> application use hardlinks less aggressively?  They have an intrinsinc
> overhead in any system that stores inodes locally to directories (like
> we do) because you have to take an extra step to resolve them.
>
>
Under "normal" circumstances, this isn't "all that bad", the serious
hammering is
coming from trying migrate to cephfs, where I think we've for the time being
abandoned using hardlinks and take the space-penalty for now. Under "normal"
circumstances it isn't that bad (if my nfs-server stats is to be believed,
it's between
5e5 - and 1.5e6 hardlinks created and unlinked per day, it actually seems a
bit low).


> In CephFS, resolving a hard link involves reading the dentry (where we
> would usually have the inode inline), and then going and finding an
> object from the data pool by the inode number, reading the "backtrace"
> (i.e.path) from that object and then going back to the metadata pool
> to traverse that path.  It's all very fast if your metadata fits in
> your MDS cache, but will slow down a lot otherwise, especially as your
> metadata IOs are now potentially getting held up by anything hammering
> your data pool.
>
> By the way, if your workload is relatively little code and you can
> share it, it sounds like it would be a useful hardlink stress test for
> our test suite


I'll let you know if I manage to reproduce, I'm on-and-off-again trying to
tease this
out on a separate ceph cluster with a "synthetic" load that's close to
equivalent.


> ...
>
> > For RBD the ceph cluster have mostly been rather well behaved, the
> problems
> > we have had have for the most part been self-inflicted. Before
> introducing
> > the hardlink spectacle to cephfs, the same filesystem were used for
> > light-ish read-mostly loads, beint mostly un-eventful. (That being said,
> we
> > did patch it for
> >
> > Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f
> 0ae071bd06),
> > clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
> >
> > The problems we're facing:
> >
> > Maybe a "non-problem" I have ~6M strays sitting around
>
> So as you hint above, when the original file is deleted, the inode
> goes into a stray dentry.  The next time someone reads the file via
> one of its other links, the inode gets "reintegrated" (via
> eval_remote_stray()) into the dentry it was read from.
>
> > Slightly more problematic, I have duplicate stray(s) ? See log excercepts
> > below. Also; rados -p cephfs_metadata listomapkeys 60X.00000000 did/does
> > seem to agree with there being duplicate strays (assuming 60X.00000000 is
> > the directory indexes for the stray catalogs), caveat "not a perfect
> > snapshot", listomapkeys issued in serial fashion.
> > We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here
> for
> > more context)
>
> When you say you stumbled across it, do you mean that you actually had
> this same deep scrub error on your system, or just that you found the
> ticket?


No - we have done "ceph pg repair", as we did end up with single degraded
objects
in the metadata pool during heavy rsync of "lot of hardlinks".


> > There's been a couple of instances of invalid backtrace(s), mostly
> solved by
> > either mds:scrub_path or just unlinking the files/directories in question
> > and re-rsync-ing.
> >
> > mismatch between head items and fnode.fragstat (See below for more of the
> > log excercept), appeared to have been solved by mds:scrub_path
> >
> >
> > Duplicate stray(s), ceph-mds complains (a lot, during rsync):
> > 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
> > badness: got (but i already had) [inode 10003f25eaf [...2,head]
> > ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
> > (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25
> 03:02:50.000000
> > 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log
> [ERR] :
> > loaded dup inode 10003f25eaf [2,head] v36792929 at
> ~mds0/stray3/10003f25eaf,
> > but inode 10003f25eaf.head v38836572 already exists at
> > ~mds0/stray0/10003f25eaf
>
> Is your workload doing lots of delete/create cycles of hard links to
> the same inode?
>

Yes. Essentially, every few days we create a snapshot of our applications
state and turn into templates that can be deployed for testing. The snapshot
contains among other thing(s) this tree of files/hardlinks. What we hardlink
have individual files never mutate, they're either created or unlinked. The
templates are instantiated a number of times (where we hardlink back to the
templates) and used for testing, some live 2 hours some live months/years.
When we do create the snapshots, we hardlink back again to the previous
snapshot where possible, and the previous snapshot falls off a cliff when
it's
2 cycles old. So the "origin file" slides over time. (For
NFS-exported-ext4, this
worked out fabulously, as it saved us some terabytes and some amount of
network IO).


>
> I wonder if we are seeing a bug where a new stray is getting created
> before the old one has been properly removed, due to some bogus
> assumption in the code that stray unlinks don't need to be persisted
> as rigorously.
>
> >
> > I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
> > immediately useful, beyond slightly-easier-to-follow the control-flow of
> > src/mds/CDir.cc without becoming much wiser.
> > 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched
> pos
> > 310473 marker 'I' dname '100022e8617 [2,head]
> > 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
> > (head, '100022e8617')
> > 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
> > (10002a81c10,head)
> > 2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
> > badness: got (but i already had) [inode 100022e8617 [...2,head]
> > ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)
> > (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25
> 03:38:01.000000
> > 2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log
> [ERR] :
> > loaded dup inode 100022e8617 [2,head] v39284583 at
> ~mds0/stray6/100022e8617,
> > but inode 100022e8617.head v39303851 already exists at
> > ~mds0/stray9/100022e8617
> >
> >
> > 2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> > mismatch between head items and fnode.fragstat! printing dentries
> > 2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> > get_num_head_items() = 36; fnode.fragstat.nfiles=53
> > fnode.fragstat.nsubdirs=0
> > 2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> > mismatch between child accounted_rstats and my rstats!
> > 2016-09-25 06:23:50.947803 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> > total of child dentrys: n(v0 b19365007 36=36+0)
> > 2016-09-25 06:23:50.947806 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> my
> > rstats:              n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0)
> >
> > The slightly sad thing is - I suspect all of this is probably from
> something
> > that "happened at some time in the past", and running mds with debugging
> > will make my users very unhappy as writing/formatting all that log is not
> > exactly cheap. (debug_mds=20/20, quickly ended up with mds beacon marked
> as
> > laggy).
> >
> > Bonus question: In terms of "understanding how cephfs works" is
> > doc/dev/mds_internals it ? :) Given that making "minimal reproducible
> > test-cases" so far is turning to be quite elusive from the "top down"
> > approach, I'm finding myself looking inside the box to try to figure out
> how
> > we got where we are.
>
> There isn't a comprehensive set of up to date internals docs anywhere
> unfortunately.  The original papers are still somewhat useful for a
> high level view (http://ceph.com/papers/weil-ceph-osdi06.pdf) although
> in the case of hard links in particular the mechanism has changed
> completely since then.
>
> However you should feel free to ask about any specific things (either
> here or on IRC).
>
> If you could narrow down any of these issues into reproducers it would
> be extremely useful.
>
>
I'll let you know if/when we do :)

Cheers,
-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to