Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

Janek Bevendorff Wed, 24 Jul 2019 00:14:52 -0700


which version?


Nautilus, 14.2.2.

try mounting cephfs on a machine/vm with small memory (4G~8G), thenrsync your date into mount point of that machine.

I could try running it in a memory-limited Docker container, but isn'tthere a better way to achieve the same thing? This sounds like a bug tome. A client having too much memory and failing to free its capabilitiesshouldn't crash the server. If the server decides to drop things fromits cache, the client has to deal with it.

Also in the long run, limiting the client's memory isn't a practicalsolution. We are planning to use the CephFS from our compute cluster,whose nodes have (and need) many more times the RAM that our storageservers have.


            The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32
            threads (Dual
            CPU with 8 physical cores each) and 128GB RAM. CPU usage
            is rather mild.
            While MDSs are trying to rejoin, they tend to saturate a
            single thread
            shortly, but nothing spectacular. During normal operation,
            none of the
            cores is particularly under load.

            > While migrating to a Nautilus cluster recently, we had
            up to 14
            > million inodes open, and we increased the cache limit to
            16GiB. Other
            > than warnings about oversized cache, this caused no issues.

            I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB.
            Other than getting
            rid of the cache size warnings (and sometimes allowing an
            MDS to rejoin
            without being kicked again after a few seconds), it did
            not change much
            in terms of the actual problem. Right now I can change it
            to whatever I
            want, it doesn't do anything, because rank 0 keeps being
            trashed anyway
            (the other ranks are fine, but the CephFS is down anyway).
            Is there
            anything useful I can give you to debug this? Otherwise I
            would try
            killing the MDS daemons so I can at least restore the
            CephFS to a
            semi-operational state.


            >
            > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
            >> Hi,
            >>
            >> Disclaimer: I posted this before to the cheph.io
            <http://cheph.io> mailing list, but from
            >> the answers I didn't get and a look at the archives, I
            concluded that
            >> that list is very dead. So apologies if anyone has read
            this before.
            >>
            >> I am trying to copy our storage server to a CephFS. We
            have 5 MONs in
            >> our cluster and (now) 7 MDS with max_mds = 4. The list
            (!) of files I am
            >> trying to copy is about 23GB, so it's a lot of files. I
            am copying them
            >> in batches of 25k using 16 parallel rsync processes
            over a 10G link.
            >>
            >> I started out with 5 MDSs / 2 active, but had repeated
            issues with
            >> immense and growing cache sizes far beyond the
            theoretical maximum of
            >> 400k inodes which the 16 rsync processes could keep
            open at the same
            >> time. The usual inode count was between 1 and 4 million
            and the cache
            >> size between 20 and 80GB on average.
            >>
            >> After a while, the MDSs started failing under this load
            by either
            >> crashing or being kicked from the quorum. I tried
            increasing the max
            >> cache size, max log segments, and beacon grace period,
            but to no avail.
            >> A crashed MDS often needs minutes to rejoin.
            >>
            >> The MDSs fail with the following message:
            >>
            >>    -21> 2019-07-22 14:00:05.877 7f67eacec700  1
            heartbeat_map is_healthy
            >> 'MDSRank' had timed out after 15
            >>    -20> 2019-07-22 14:00:05.877 7f67eacec700  0
            mds.beacon.XXX Skipping
            >> beacon heartbeat to monitors (last acked 24.0042s ago);
            MDS internal
            >> heartbeat is not healthy!
            >>
            >> I found the following thread, which seems to be about
            the same general
            >> issue:
            >>
            >>
            
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
            >>
            >> Unfortunately, it does not really contain a solution
            except things I
            >> have tried already. Though it does give some
            explanation as to why the
            >> MDSs pile up so many open inodes. It appears like Ceph
            can't handle many
            >> (write-only) operations on different files, since the
            clients keep their
            >> capabilities open and the MDS can't evict them from its
            cache. This is
            >> very baffling to me, since how am I supposed to use a
            CephFS if I cannot
            >> fill it with files before?
            >>
            >> The next thing I tried was increasing the number of
            active MDSs. Three
            >> seemed to make it worse, but four worked surprisingly well.
            >> Unfortunately, the crash came eventually and the rank-0
            MDS got kicked.
            >> Since then the standbys have been (not very
            successfully) playing
            >> round-robin to replace it, only to be kicked
            repeatedly. This is the
            >> status quo right now and it has been going for hours
            with no end in
            >> sight. The only option might be to kill all MDSs and
            let them restart
            >> from empty caches.
            >>
            >> While trying to rejoin, the MDSs keep logging the
            above-mentioned error
            >> message followed by
            >>
            >> 2019-07-23 17:53:37.386 7f3b135a5700 0
            mds.0.cache.ino(0x100019693f8)
            >> have open dirfrag * but not leaf in fragtree_t(*^3):
            [dir 0x100019693f8
            >> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0
            >> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1
            replicated=1
            >> 0x5642a2ff7700]
            >>
            >> and then finally
            >>
            >> 2019-07-23 17:53:48.786 7fb02bc08700 1 mds.XXX Map has
            assigned me to
            >> become a standby
            >>
            >> The other thing I noticed over the last few days is
            that after a
            >> sufficient number of failures, the client locks up
            completely and the
            >> mount becomes unresponsive, even after the MDSs are
            back. Sometimes this
            >> lock-up is so catastrophic that I cannot even unmount
            the share with
            >> umount -lf anymore and a reboot of the machine lets the
            kernel panic.
            >> This looks like a bug to me.
            >>
            >> I hope somebody can provide me with tips to stabilize
            our setup. I can
            >> move data through our RadosGWs over 7x10Gbps from 130
            nodes in parallel,
            >> no problem. But I cannot even rsync a few TB of files
            from a single node
            >> to the CephFS without knocking out the MDS daemons.
            >>
            >> Any help is greatly appreciated!
            >>
            >> Janek
            >>
            >> _______________________________________________
            >> ceph-users mailing list
            >> ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
            _______________________________________________
            ceph-users mailing list
            ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

Reply via email to