Does Ceph-fuse mount also has the same issue?
On Wed, Jul 24, 2019 at 3:35 AM Janek Bevendorff <janek.bevendo...@uni-weimar.de> wrote: > > > I mean kernel version > > Oh, of course. 4.15.0-54 on Ubuntu 18.04 LTS. > > Right now I am also experiencing a different phenomenon. Since I wrapped it > up yesterday, the MDS machines have been trying to rejoin, but could only > handle a few hundred up to a few hundred thousand inodes per second before > crashing. > > I had a look at the machines and the daemons had trouble allocating memory. > There weren't many processes running and none of them consumed more than 5GB, > yet all 128 GB were used (and not freeable, so it wasn't just the page > cache). Thus I suppose there must also be a memory leak somewhere. No running > process had this much memory allocated, so it must have been allocated from > kernel space. I am rebooting the machines right now as a last resort. > > > >> >> try mounting cephfs on a machine/vm with small memory (4G~8G), then rsync >> your date into mount point of that machine. >> >> I could try running it in a memory-limited Docker container, but isn't there >> a better way to achieve the same thing? This sounds like a bug to me. A >> client having too much memory and failing to free its capabilities shouldn't >> crash the server. If the server decides to drop things from its cache, the >> client has to deal with it. >> >> Also in the long run, limiting the client's memory isn't a practical >> solution. We are planning to use the CephFS from our compute cluster, whose >> nodes have (and need) many more times the RAM that our storage servers have. >> >> >> >>> >>> >>> >>> >>> The MDS nodes have Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual >>> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild. >>> While MDSs are trying to rejoin, they tend to saturate a single thread >>> shortly, but nothing spectacular. During normal operation, none of the >>> cores is particularly under load. >>> >>> > While migrating to a Nautilus cluster recently, we had up to 14 >>> > million inodes open, and we increased the cache limit to 16GiB. Other >>> > than warnings about oversized cache, this caused no issues. >>> >>> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting >>> rid of the cache size warnings (and sometimes allowing an MDS to rejoin >>> without being kicked again after a few seconds), it did not change much >>> in terms of the actual problem. Right now I can change it to whatever I >>> want, it doesn't do anything, because rank 0 keeps being trashed anyway >>> (the other ranks are fine, but the CephFS is down anyway). Is there >>> anything useful I can give you to debug this? Otherwise I would try >>> killing the MDS daemons so I can at least restore the CephFS to a >>> semi-operational state. >>> >>> >>> > >>> > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote: >>> >> Hi, >>> >> >>> >> Disclaimer: I posted this before to the cheph.io mailing list, but from >>> >> the answers I didn't get and a look at the archives, I concluded that >>> >> that list is very dead. So apologies if anyone has read this before. >>> >> >>> >> I am trying to copy our storage server to a CephFS. We have 5 MONs in >>> >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am >>> >> trying to copy is about 23GB, so it's a lot of files. I am copying them >>> >> in batches of 25k using 16 parallel rsync processes over a 10G link. >>> >> >>> >> I started out with 5 MDSs / 2 active, but had repeated issues with >>> >> immense and growing cache sizes far beyond the theoretical maximum of >>> >> 400k inodes which the 16 rsync processes could keep open at the same >>> >> time. The usual inode count was between 1 and 4 million and the cache >>> >> size between 20 and 80GB on average. >>> >> >>> >> After a while, the MDSs started failing under this load by either >>> >> crashing or being kicked from the quorum. I tried increasing the max >>> >> cache size, max log segments, and beacon grace period, but to no avail. >>> >> A crashed MDS often needs minutes to rejoin. >>> >> >>> >> The MDSs fail with the following message: >>> >> >>> >> -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy >>> >> 'MDSRank' had timed out after 15 >>> >> -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping >>> >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal >>> >> heartbeat is not healthy! >>> >> >>> >> I found the following thread, which seems to be about the same general >>> >> issue: >>> >> >>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html >>> >> >>> >> Unfortunately, it does not really contain a solution except things I >>> >> have tried already. Though it does give some explanation as to why the >>> >> MDSs pile up so many open inodes. It appears like Ceph can't handle many >>> >> (write-only) operations on different files, since the clients keep their >>> >> capabilities open and the MDS can't evict them from its cache. This is >>> >> very baffling to me, since how am I supposed to use a CephFS if I cannot >>> >> fill it with files before? >>> >> >>> >> The next thing I tried was increasing the number of active MDSs. Three >>> >> seemed to make it worse, but four worked surprisingly well. >>> >> Unfortunately, the crash came eventually and the rank-0 MDS got kicked. >>> >> Since then the standbys have been (not very successfully) playing >>> >> round-robin to replace it, only to be kicked repeatedly. This is the >>> >> status quo right now and it has been going for hours with no end in >>> >> sight. The only option might be to kill all MDSs and let them restart >>> >> from empty caches. >>> >> >>> >> While trying to rejoin, the MDSs keep logging the above-mentioned error >>> >> message followed by >>> >> >>> >> 2019-07-23 17:53:37.386 7f3b135a5700 0 mds.0.cache.ino(0x100019693f8) >>> >> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8 >>> >> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0 >>> >> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1 >>> >> 0x5642a2ff7700] >>> >> >>> >> and then finally >>> >> >>> >> 2019-07-23 17:53:48.786 7fb02bc08700 1 mds.XXX Map has assigned me to >>> >> become a standby >>> >> >>> >> The other thing I noticed over the last few days is that after a >>> >> sufficient number of failures, the client locks up completely and the >>> >> mount becomes unresponsive, even after the MDSs are back. Sometimes this >>> >> lock-up is so catastrophic that I cannot even unmount the share with >>> >> umount -lf anymore and a reboot of the machine lets the kernel panic. >>> >> This looks like a bug to me. >>> >> >>> >> I hope somebody can provide me with tips to stabilize our setup. I can >>> >> move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel, >>> >> no problem. But I cannot even rsync a few TB of files from a single node >>> >> to the CephFS without knocking out the MDS daemons. >>> >> >>> >> Any help is greatly appreciated! >>> >> >>> >> Janek >>> >> >>> >> _______________________________________________ >>> >> ceph-users mailing list >>> >> ceph-users@lists.ceph.com >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com