Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

Yan, Zheng Wed, 24 Jul 2019 00:23:57 -0700

On Wed, Jul 24, 2019 at 3:13 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:


>
> which version?
>
> Nautilus, 14.2.2.
>

I mean kernel version


> try mounting cephfs on a machine/vm with small memory (4G~8G), then rsync
> your date into mount point of that machine.
>
> I could try running it in a memory-limited Docker container, but isn't
> there a better way to achieve the same thing? This sounds like a bug to me.
> A client having too much memory and failing to free its capabilities
> shouldn't crash the server. If the server decides to drop things from its
> cache, the client has to deal with it.
>
> Also in the long run, limiting the client's memory isn't a practical
> solution. We are planning to use the CephFS from our compute cluster, whose
> nodes have (and need) many more times the RAM that our storage servers have.
>
>
>
>
>>
>>
>>
>> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
>> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
>> While MDSs are trying to rejoin, they tend to saturate a single thread
>> shortly, but nothing spectacular. During normal operation, none of the
>> cores is particularly under load.
>>
>> > While migrating to a Nautilus cluster recently, we had up to 14
>> > million inodes open, and we increased the cache limit to 16GiB. Other
>> > than warnings about oversized cache, this caused no issues.
>>
>> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
>> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
>> without being kicked again after a few seconds), it did not change much
>> in terms of the actual problem. Right now I can change it to whatever I
>> want, it doesn't do anything, because rank 0 keeps being trashed anyway
>> (the other ranks are fine, but the CephFS is down anyway). Is there
>> anything useful I can give you to debug this? Otherwise I would try
>> killing the MDS daemons so I can at least restore the CephFS to a
>> semi-operational state.
>>
>>
>> >
>> > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>> >> Hi,
>> >>
>> >> Disclaimer: I posted this before to the cheph.io mailing list, but
>> from
>> >> the answers I didn't get and a look at the archives, I concluded that
>> >> that list is very dead. So apologies if anyone has read this before.
>> >>
>> >> I am trying to copy our storage server to a CephFS. We have 5 MONs in
>> >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I
>> am
>> >> trying to copy is about 23GB, so it's a lot of files. I am copying them
>> >> in batches of 25k using 16 parallel rsync processes over a 10G link.
>> >>
>> >> I started out with 5 MDSs / 2 active, but had repeated issues with
>> >> immense and growing cache sizes far beyond the theoretical maximum of
>> >> 400k inodes which the 16 rsync processes could keep open at the same
>> >> time. The usual inode count was between 1 and 4 million and the cache
>> >> size between 20 and 80GB on average.
>> >>
>> >> After a while, the MDSs started failing under this load by either
>> >> crashing or being kicked from the quorum. I tried increasing the max
>> >> cache size, max log segments, and beacon grace period, but to no avail.
>> >> A crashed MDS often needs minutes to rejoin.
>> >>
>> >> The MDSs fail with the following message:
>> >>
>> >>    -21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map
>> is_healthy
>> >> 'MDSRank' had timed out after 15
>> >>    -20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
>> >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
>> >> heartbeat is not healthy!
>> >>
>> >> I found the following thread, which seems to be about the same general
>> >> issue:
>> >>
>> >>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>> >>
>> >> Unfortunately, it does not really contain a solution except things I
>> >> have tried already. Though it does give some explanation as to why the
>> >> MDSs pile up so many open inodes. It appears like Ceph can't handle
>> many
>> >> (write-only) operations on different files, since the clients keep
>> their
>> >> capabilities open and the MDS can't evict them from its cache. This is
>> >> very baffling to me, since how am I supposed to use a CephFS if I
>> cannot
>> >> fill it with files before?
>> >>
>> >> The next thing I tried was increasing the number of active MDSs. Three
>> >> seemed to make it worse, but four worked surprisingly well.
>> >> Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
>> >> Since then the standbys have been (not very successfully) playing
>> >> round-robin to replace it, only to be kicked repeatedly. This is the
>> >> status quo right now and it has been going for hours with no end in
>> >> sight. The only option might be to kill all MDSs and let them restart
>> >> from empty caches.
>> >>
>> >> While trying to rejoin, the MDSs keep logging the above-mentioned error
>> >> message followed by
>> >>
>> >> 2019-07-23 17:53:37.386 7f3b135a5700  0 mds.0.cache.ino(0x100019693f8)
>> >> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8
>> >> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0
>> >> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1
>> >> 0x5642a2ff7700]
>> >>
>> >> and then finally
>> >>
>> >> 2019-07-23 17:53:48.786 7fb02bc08700  1 mds.XXX Map has assigned me to
>> >> become a standby
>> >>
>> >> The other thing I noticed over the last few days is that after a
>> >> sufficient number of failures, the client locks up completely and the
>> >> mount becomes unresponsive, even after the MDSs are back. Sometimes
>> this
>> >> lock-up is so catastrophic that I cannot even unmount the share with
>> >> umount -lf anymore and a reboot of the machine lets the kernel panic.
>> >> This looks like a bug to me.
>> >>
>> >> I hope somebody can provide me with tips to stabilize our setup. I can
>> >> move data through our RadosGWs over 7x10Gbps from 130 nodes in
>> parallel,
>> >> no problem. But I cannot even rsync a few TB of files from a single
>> node
>> >> to the CephFS without knocking out the MDS daemons.
>> >>
>> >> Any help is greatly appreciated!
>> >>
>> >> Janek
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

Reply via email to