Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

Janek Bevendorff Wed, 24 Jul 2019 00:36:06 -0700

I mean kernel version


Oh, of course. 4.15.0-54 on Ubuntu 18.04 LTS.

Right now I am also experiencing a different phenomenon. Since I wrappedit up yesterday, the MDS machines have been trying to rejoin, but couldonly handle a few hundred up to a few hundred thousand inodes per secondbefore crashing.

I had a look at the machines and the daemons had trouble allocatingmemory. There weren't many processes running and none of them consumedmore than 5GB, yet all 128 GB were used (and not freeable, so it wasn'tjust the page cache). Thus I suppose there must also be a memory leaksomewhere. No running process had this much memory allocated, so it musthave been allocated from kernel space. I am rebooting the machines rightnow as a last resort.

    try mounting cephfs on a machine/vm with small memory (4G~8G),
    then rsync your date into mount point of that machine.


    I could try running it in a memory-limited Docker container, but
    isn't there a better way to achieve the same thing? This sounds
    like a bug to me. A client having too much memory and failing to
    free its capabilities shouldn't crash the server. If the server
    decides to drop things from its cache, the client has to deal with it.

    Also in the long run, limiting the client's memory isn't a
    practical solution. We are planning to use the CephFS from our
    compute cluster, whose nodes have (and need) many more times the
    RAM that our storage servers have.


                The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz
                with 32 threads (Dual
                CPU with 8 physical cores each) and 128GB RAM. CPU
                usage is rather mild.
                While MDSs are trying to rejoin, they tend to
                saturate a single thread
                shortly, but nothing spectacular. During normal
                operation, none of the
                cores is particularly under load.

                > While migrating to a Nautilus cluster recently, we
                had up to 14
                > million inodes open, and we increased the cache
                limit to 16GiB. Other
                > than warnings about oversized cache, this caused no
                issues.

                I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB.
                Other than getting
                rid of the cache size warnings (and sometimes
                allowing an MDS to rejoin
                without being kicked again after a few seconds), it
                did not change much
                in terms of the actual problem. Right now I can
                change it to whatever I
                want, it doesn't do anything, because rank 0 keeps
                being trashed anyway
                (the other ranks are fine, but the CephFS is down
                anyway). Is there
                anything useful I can give you to debug this?
                Otherwise I would try
                killing the MDS daemons so I can at least restore the
                CephFS to a
                semi-operational state.


                >
                > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
                >> Hi,
                >>
                >> Disclaimer: I posted this before to the cheph.io
                <http://cheph.io> mailing list, but from
                >> the answers I didn't get and a look at the
                archives, I concluded that
                >> that list is very dead. So apologies if anyone has
                read this before.
                >>
                >> I am trying to copy our storage server to a
                CephFS. We have 5 MONs in
                >> our cluster and (now) 7 MDS with max_mds = 4. The
                list (!) of files I am
                >> trying to copy is about 23GB, so it's a lot of
                files. I am copying them
                >> in batches of 25k using 16 parallel rsync
                processes over a 10G link.
                >>
                >> I started out with 5 MDSs / 2 active, but had
                repeated issues with
                >> immense and growing cache sizes far beyond the
                theoretical maximum of
                >> 400k inodes which the 16 rsync processes could
                keep open at the same
                >> time. The usual inode count was between 1 and 4
                million and the cache
                >> size between 20 and 80GB on average.
                >>
                >> After a while, the MDSs started failing under this
                load by either
                >> crashing or being kicked from the quorum. I tried
                increasing the max
                >> cache size, max log segments, and beacon grace
                period, but to no avail.
                >> A crashed MDS often needs minutes to rejoin.
                >>
                >> The MDSs fail with the following message:
                >>
                >>    -21> 2019-07-22 14:00:05.877 7f67eacec700  1
                heartbeat_map is_healthy
                >> 'MDSRank' had timed out after 15
                >>    -20> 2019-07-22 14:00:05.877 7f67eacec700  0
                mds.beacon.XXX Skipping
                >> beacon heartbeat to monitors (last acked 24.0042s
                ago); MDS internal
                >> heartbeat is not healthy!
                >>
                >> I found the following thread, which seems to be
                about the same general
                >> issue:
                >>
                >>
                
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
                >>
                >> Unfortunately, it does not really contain a
                solution except things I
                >> have tried already. Though it does give some
                explanation as to why the
                >> MDSs pile up so many open inodes. It appears like
                Ceph can't handle many
                >> (write-only) operations on different files, since
                the clients keep their
                >> capabilities open and the MDS can't evict them
                from its cache. This is
                >> very baffling to me, since how am I supposed to
                use a CephFS if I cannot
                >> fill it with files before?
                >>
                >> The next thing I tried was increasing the number
                of active MDSs. Three
                >> seemed to make it worse, but four worked
                surprisingly well.
                >> Unfortunately, the crash came eventually and the
                rank-0 MDS got kicked.
                >> Since then the standbys have been (not very
                successfully) playing
                >> round-robin to replace it, only to be kicked
                repeatedly. This is the
                >> status quo right now and it has been going for
                hours with no end in
                >> sight. The only option might be to kill all MDSs
                and let them restart
                >> from empty caches.
                >>
                >> While trying to rejoin, the MDSs keep logging the
                above-mentioned error
                >> message followed by
                >>
                >> 2019-07-23 17:53:37.386 7f3b135a5700  0
                mds.0.cache.ino(0x100019693f8)
                >> have open dirfrag * but not leaf in
                fragtree_t(*^3): [dir 0x100019693f8
                >> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0
                cv=0/0
                >> state=1140850688 f() n() hs=17033+0,ss=0+0 |
                child=1 replicated=1
                >> 0x5642a2ff7700]
                >>
                >> and then finally
                >>
                >> 2019-07-23 17:53:48.786 7fb02bc08700  1 mds.XXX
                Map has assigned me to
                >> become a standby
                >>
                >> The other thing I noticed over the last few days
                is that after a
                >> sufficient number of failures, the client locks up
                completely and the
                >> mount becomes unresponsive, even after the MDSs
                are back. Sometimes this
                >> lock-up is so catastrophic that I cannot even
                unmount the share with
                >> umount -lf anymore and a reboot of the machine
                lets the kernel panic.
                >> This looks like a bug to me.
                >>
                >> I hope somebody can provide me with tips to
                stabilize our setup. I can
                >> move data through our RadosGWs over 7x10Gbps from
                130 nodes in parallel,
                >> no problem. But I cannot even rsync a few TB of
                files from a single node
                >> to the CephFS without knocking out the MDS daemons.
                >>
                >> Any help is greatly appreciated!
                >>
                >> Janek
                >>
                >> _______________________________________________
                >> ceph-users mailing list
                >> ceph-users@lists.ceph.com
                <mailto:ceph-users@lists.ceph.com>
                >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                _______________________________________________
                ceph-users mailing list
                ceph-users@lists.ceph.com
                <mailto:ceph-users@lists.ceph.com>
                http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

Reply via email to