On Fri, Oct 7, 2016 at 4:46 AM, John Spray <jsp...@redhat.com> wrote:
> On Fri, Oct 7, 2016 at 1:05 AM, Kjetil Jørgensen <kje...@medallia.com> > wrote: > > Hi, > > > > context (i.e. what we're doing): We're migrating (or trying to) migrate > off > > of an nfs server onto cephfs, for a workload that's best described as > "big > > piles" of hardlinks. Essentially, we have a set of "sources": > > foo/01/<aa><rest-of-md5> > > foo/0b/<0b><rest-of-md5> > > .. and so on > > bar/02/.. > > bar/0c/.. > > .. and so on > > > > foo/bar/friends have been "cloned" numerous times to a set of names that > > over the course of weeks end up being recycled again, the clone is > > essentially cp -L foo copy-1-of-foo. > > > > We're doing "incremental" rsyncs of this onto cephfs, so the sense of > "the > > original source of the hardlink" will end up moving around, depending on > the > > whims of rsync. (if it matters, I found some allusion to "if the original > > file hardlinked is deleted, ...". > > This might not be much help but... have you thought about making your > application use hardlinks less aggressively? They have an intrinsinc > overhead in any system that stores inodes locally to directories (like > we do) because you have to take an extra step to resolve them. > > Under "normal" circumstances, this isn't "all that bad", the serious hammering is coming from trying migrate to cephfs, where I think we've for the time being abandoned using hardlinks and take the space-penalty for now. Under "normal" circumstances it isn't that bad (if my nfs-server stats is to be believed, it's between 5e5 - and 1.5e6 hardlinks created and unlinked per day, it actually seems a bit low). > In CephFS, resolving a hard link involves reading the dentry (where we > would usually have the inode inline), and then going and finding an > object from the data pool by the inode number, reading the "backtrace" > (i.e.path) from that object and then going back to the metadata pool > to traverse that path. It's all very fast if your metadata fits in > your MDS cache, but will slow down a lot otherwise, especially as your > metadata IOs are now potentially getting held up by anything hammering > your data pool. > > By the way, if your workload is relatively little code and you can > share it, it sounds like it would be a useful hardlink stress test for > our test suite I'll let you know if I manage to reproduce, I'm on-and-off-again trying to tease this out on a separate ceph cluster with a "synthetic" load that's close to equivalent. > ... > > > For RBD the ceph cluster have mostly been rather well behaved, the > problems > > we have had have for the most part been self-inflicted. Before > introducing > > the hardlink spectacle to cephfs, the same filesystem were used for > > light-ish read-mostly loads, beint mostly un-eventful. (That being said, > we > > did patch it for > > > > Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f > 0ae071bd06), > > clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4. > > > > The problems we're facing: > > > > Maybe a "non-problem" I have ~6M strays sitting around > > So as you hint above, when the original file is deleted, the inode > goes into a stray dentry. The next time someone reads the file via > one of its other links, the inode gets "reintegrated" (via > eval_remote_stray()) into the dentry it was read from. > > > Slightly more problematic, I have duplicate stray(s) ? See log excercepts > > below. Also; rados -p cephfs_metadata listomapkeys 60X.00000000 did/does > > seem to agree with there being duplicate strays (assuming 60X.00000000 is > > the directory indexes for the stray catalogs), caveat "not a perfect > > snapshot", listomapkeys issued in serial fashion. > > We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here > for > > more context) > > When you say you stumbled across it, do you mean that you actually had > this same deep scrub error on your system, or just that you found the > ticket? No - we have done "ceph pg repair", as we did end up with single degraded objects in the metadata pool during heavy rsync of "lot of hardlinks". > > There's been a couple of instances of invalid backtrace(s), mostly > solved by > > either mds:scrub_path or just unlinking the files/directories in question > > and re-rsync-ing. > > > > mismatch between head items and fnode.fragstat (See below for more of the > > log excercept), appeared to have been solved by mds:scrub_path > > > > > > Duplicate stray(s), ceph-mds complains (a lot, during rsync): > > 2016-09-30 20:00:21.978314 7ffb653b8700 0 mds.0.cache.dir(603) _fetched > > badness: got (but i already had) [inode 10003f25eaf [...2,head] > > ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0) > > (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 > 03:02:50.000000 > > 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log > [ERR] : > > loaded dup inode 10003f25eaf [2,head] v36792929 at > ~mds0/stray3/10003f25eaf, > > but inode 10003f25eaf.head v38836572 already exists at > > ~mds0/stray0/10003f25eaf > > Is your workload doing lots of delete/create cycles of hard links to > the same inode? > Yes. Essentially, every few days we create a snapshot of our applications state and turn into templates that can be deployed for testing. The snapshot contains among other thing(s) this tree of files/hardlinks. What we hardlink have individual files never mutate, they're either created or unlinked. The templates are instantiated a number of times (where we hardlink back to the templates) and used for testing, some live 2 hours some live months/years. When we do create the snapshots, we hardlink back again to the previous snapshot where possible, and the previous snapshot falls off a cliff when it's 2 cycles old. So the "origin file" slides over time. (For NFS-exported-ext4, this worked out fabulously, as it saved us some terabytes and some amount of network IO). > > I wonder if we are seeing a bug where a new stray is getting created > before the old one has been properly removed, due to some bogus > assumption in the code that stray unlinks don't need to be persisted > as rigorously. > > > > > I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything > > immediately useful, beyond slightly-easier-to-follow the control-flow of > > src/mds/CDir.cc without becoming much wiser. > > 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched > pos > > 310473 marker 'I' dname '100022e8617 [2,head] > > 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup > > (head, '100022e8617') > > 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606) miss -> > > (10002a81c10,head) > > 2016-09-30 20:43:51.910762 7ffb653b8700 0 mds.0.cache.dir(606) _fetched > > badness: got (but i already had) [inode 100022e8617 [...2,head] > > ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0) > > (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 > 03:38:01.000000 > > 2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log > [ERR] : > > loaded dup inode 100022e8617 [2,head] v39284583 at > ~mds0/stray6/100022e8617, > > but inode 100022e8617.head v39303851 already exists at > > ~mds0/stray9/100022e8617 > > > > > > 2016-09-25 06:23:50.947761 7ffb653b8700 1 mds.0.cache.dir(10003439a33) > > mismatch between head items and fnode.fragstat! printing dentries > > 2016-09-25 06:23:50.947779 7ffb653b8700 1 mds.0.cache.dir(10003439a33) > > get_num_head_items() = 36; fnode.fragstat.nfiles=53 > > fnode.fragstat.nsubdirs=0 > > 2016-09-25 06:23:50.947782 7ffb653b8700 1 mds.0.cache.dir(10003439a33) > > mismatch between child accounted_rstats and my rstats! > > 2016-09-25 06:23:50.947803 7ffb653b8700 1 mds.0.cache.dir(10003439a33) > > total of child dentrys: n(v0 b19365007 36=36+0) > > 2016-09-25 06:23:50.947806 7ffb653b8700 1 mds.0.cache.dir(10003439a33) > my > > rstats: n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0) > > > > The slightly sad thing is - I suspect all of this is probably from > something > > that "happened at some time in the past", and running mds with debugging > > will make my users very unhappy as writing/formatting all that log is not > > exactly cheap. (debug_mds=20/20, quickly ended up with mds beacon marked > as > > laggy). > > > > Bonus question: In terms of "understanding how cephfs works" is > > doc/dev/mds_internals it ? :) Given that making "minimal reproducible > > test-cases" so far is turning to be quite elusive from the "top down" > > approach, I'm finding myself looking inside the box to try to figure out > how > > we got where we are. > > There isn't a comprehensive set of up to date internals docs anywhere > unfortunately. The original papers are still somewhat useful for a > high level view (http://ceph.com/papers/weil-ceph-osdi06.pdf) although > in the case of hard links in particular the mechanism has changed > completely since then. > > However you should feel free to ask about any specific things (either > here or on IRC). > > If you could narrow down any of these issues into reproducers it would > be extremely useful. > > I'll let you know if/when we do :) Cheers, -- Kjetil Joergensen <kje...@medallia.com> SRE, Medallia Inc Phone: +1 (650) 739-6580
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com