Hi Francois, thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache.
In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. Good luck! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <f...@lpnhe.in2p3.fr> Sent: 05 June 2020 14:34:06 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted Le 05/06/2020 à 14:18, Frank Schilder a écrit : > Hi Francois, > >> I was also wondering if setting mds dump cache after rejoin could help ? > Haven't heard of that option. Is there some documentation? I found it on : https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ mds dump cache after rejoin Description Ceph will dump MDS cache contents to a file after rejoining the cache (during recovery). Type Boolean Default false but I don't think it can help in my case, because rejoin occurs after replay and in my case replay never ends ! >> I have : >> osd_op_queue=wpq >> osd_op_queue_cut_off=low >> I can try to set osd_op_queue_cut_off to high, but it will be useful >> only if the mds get active, true ? > I think so. If you have no clients connected, there should not be queue > priority issues. Maybe it is best to wait until your cluster is healthy again > as you will need to restart all daemons. Make sure you set this in [global]. > When I applied that change and after re-starting all OSDs my MDSes had > reconnect issues until I set it on them too. I think all daemons use that > option (the prefix osd_ is misleading). For sure I would prefer not to restart all daemons because the second filesystem is up and running (with production clients). >> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >> which seems reasonable for a mds server with 32/48GB). > This sounds bad. 8GB should not cause any issues. Maybe you are hitting a > bug, I believe there is a regression in Nautilus. There were recent threads > on absurdly high memory use by MDSes. Maybe its worth searching for these in > the list. I will have a look. >> I already force the clients to unmount (and even rebooted the ones from >> which the rsync and the rmdir .snaps were launched). > I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. > without -f or forced by reboot) the MDS should have dropped the clients > already. If it was an unclean unmount it might not be that easy to get the > stale client session out. However, I don't know about that. Moreover when I did that, the mds was already not active but in replay, so for sure the unmount was not acknowledged by any mds ! >> I think that providing more swap maybe the solution ! I will try that if >> I cannot find another way to fix it. > If the memory overrun is somewhat limited, this should allow the MDS to trim > the logs. Will take a while, but it will do eventually. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <f...@lpnhe.in2p3.fr> > Sent: 05 June 2020 13:46:03 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mds behind on trimming - replay until memory > exhausted > > I was also wondering if setting mds dump cache after rejoin could help ? > > > Le 05/06/2020 à 12:49, Frank Schilder a écrit : >> Out of interest, I did the same on a mimic cluster a few months ago, running >> up to 5 parallel rsync sessions without any problems. I moved about 120TB. >> Each rsync was running on a separate client with its own cache. I made sure >> that the sync dirs were all disjoint (no overlap of files/directories). >> >> How many rsync processes are you running in parallel? >> Do you have these settings enabled: >> >> osd_op_queue=wpq >> osd_op_queue_cut_off=high >> >> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting >> the latter removed any behind trimming problems we have seen before. >> >> You are in a somewhat peculiar situation. I think you need to trim client >> caches, which requires an active MDS. If your MDS becomes active for at >> least some time, you could try the following (I'm not an expert here, so >> take with a grain of scepticism): >> >> - reduce the MDS cache memory limit to force recall of caps much earlier >> than now >> - reduce client cach size >> - set "osd_op_queue_cut_off=high" if not already done so, I think this >> requires restart of OSDs, so be careful >> >> At this point, you could watch your restart cycle to see if things improve >> already. Maybe nothing more is required. >> >> If you have good SSDs, you could try to provide temporarily some swap space. >> It saved me once. This will be very slow, but at least it might allow you to >> move forward. >> >> Harder measures: >> >> - stop all I/O from the FS clients, throw users out if necessary >> - ideally, try to cleanly (!) shut down clients or force trimming the cache >> by either >> * umount or >> * sync; echo 3 > /proc/sys/vm/drop_caches >> Either of these might hang for a long time. Do not interrupt and do not >> do this on more than one client at a time. >> >> At some point, your active MDS should be able to hold a full session. You >> should then tune the cache and other parameters such that the MDSes can >> handle your rsync sessions. >> >> My experience is that MDSes overrun their cache limits quite a lot. Since I >> reduced mds_cache_memory_limit to not more than half of what is physically >> available, I have not had any problems again. >> >> Hope that helps. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand <f...@lpnhe.in2p3.fr> >> Sent: 05 June 2020 11:42:42 >> To: ceph-users >> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >> >> Hi all, >> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >> 3 mds (1 active for each fs + one failover). >> We are transfering all the datas (~600M files) from one FS (which was in >> EC 3+2) to the other FS (in R3). >> On the old FS we first removed the snapshots (to avoid strays problems >> when removing files) and the ran some rsync deleting the files after the >> transfer. >> The operation should last a few weeks more to complete. >> But few days ago, we started to have some warning mds behind on trimming >> from the mds managing the old FS. >> Yesterday, I restarted the active mds service to force the takeover by >> the standby mds (basically because the standby is more powerfull and >> have more memory, i.e 48GB over 32). >> The standby mds took the rank 0 and started to replay... the mds behind >> on trimming came back and the number of segments rised as well as the >> memory usage of the server. Finally, it exhausted the memory of the mds >> and the service stopped and the previous mds took rank 0 and started to >> replay... until memory exhaustion and a new switch of mds etc... >> It thus seems that we are in a never ending loop ! And of course, as the >> mds is always in replay, the data are not accessible and the transfers >> are blocked. >> I stopped all the rsync and unmount the clients. >> >> My questions are : >> - Does the mds trim during the replay so we could hope that after a >> while it will purge everything and the mds will be able to become active >> at the end ? >> - Is there a way to accelerate the operation or to fix this situation ? >> >> Thanks for you help. >> F. >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io