[ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

Kjetil Jørgensen Thu, 06 Oct 2016 17:06:08 -0700

Hi,

context (i.e. what we're doing): We're migrating (or trying to) migrate off
of an nfs server onto cephfs, for a workload that's best described as "big
piles" of hardlinks. Essentially, we have a set of "sources":
foo/01/<aa><rest-of-md5>
foo/0b/<0b><rest-of-md5>
.. and so on
bar/02/..
bar/0c/..
.. and so on


foo/bar/friends have been "cloned" numerous times to a set of names that
over the course of weeks end up being recycled again, the clone is
essentially cp -L foo copy-1-of-foo.

We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the
original source of the hardlink" will end up moving around, depending on
the whims of rsync. (if it matters, I found some allusion to "if the
original file hardlinked is deleted, ...".

For RBD the ceph cluster have mostly been rather well behaved, the problems
we have had have for the most part been self-inflicted. Before introducing
the hardlink spectacle to cephfs, the same filesystem were used for
light-ish read-mostly loads, beint mostly un-eventful. (That being said, we
did patch it for

Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),
clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.

The problems we're facing:

   - Maybe a "non-problem" I have ~6M strays sitting around
   - Slightly more problematic, I have duplicate stray(s) ? See log
   excercepts below. Also; rados -p cephfs_metadata listomapkeys 60X.00000000
   did/does seem to agree with there being duplicate strays (assuming
   60X.00000000 is the directory indexes for the stray catalogs), caveat "not
   a perfect snapshot", listomapkeys issued in serial fashion.
   - We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here
   for more context)
   - There's been a couple of instances of invalid backtrace(s), mostly
   solved by either mds:scrub_path or just unlinking the files/directories in
   question and re-rsync-ing.
   - mismatch between head items and fnode.fragstat (See below for more of
   the log excercept), appeared to have been solved by mds:scrub_path


Duplicate stray(s), ceph-mds complains (a lot, during rsync):
2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
 badness: got (but i already had) [inode 10003f25eaf [...2,head]
~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
(iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.000000
2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR] :
loaded dup inode 10003f25eaf [2,head] v36792929 at
~mds0/stray3/10003f25eaf, but inode 10003f25eaf.head v38836572 already
exists at ~mds0/stray0/10003f25eaf

I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
immediately useful, beyond slightly-easier-to-follow the control-flow
of src/mds/CDir.cc without becoming much wiser.
2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched
pos 310473 marker 'I' dname '100022e8617 [2,head]
2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
(head, '100022e8617')
2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
(10002a81c10,head)
2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
 badness: got (but i already had) [inode 100022e8617 [...2,head]
~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)
(iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.000000
2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR] :
loaded dup inode 100022e8617 [2,head] v39284583 at
~mds0/stray6/100022e8617, but inode 100022e8617.head v39303851 already
exists at ~mds0/stray9/100022e8617


2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
mismatch between head items and fnode.fragstat! printing dentries
2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
get_num_head_items() = 36; fnode.fragstat.nfiles=53
fnode.fragstat.nsubdirs=0
2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
mismatch between child accounted_rstats and my rstats!
2016-09-25 06:23:50.947803 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
total of child dentrys: n(v0 b19365007 36=36+0)
2016-09-25 06:23:50.947806 7ffb653b8700  1 mds.0.cache.dir(10003439a33) my
rstats:              n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0)

The slightly sad thing is - I suspect all of this is probably from
something that "happened at some time in the past", and running mds with
debugging will make my users very unhappy as writing/formatting all that
log is not exactly cheap. (debug_mds=20/20, quickly ended up with mds
beacon marked as laggy).

Bonus question: In terms of "understanding how cephfs works" is
doc/dev/mds_internals it ? :) Given that making "minimal reproducible
test-cases" so far is turning to be quite elusive from the "top down"
approach, I'm finding myself looking inside the box to try to figure out
how we got where we are.

(And many thanks for ceph-dencoder, it satisfies my pathological need to
look inside of things).

Cheers,
-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

Reply via email to