That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on?
Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <f...@lpnhe.in2p3.fr> Sent: 08 June 2020 15:27:59 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit : > Hi Francois, > > this sounds great. At least its operational. I guess it is still using a lot > of swap while trying to replay operations. > > I would disconnect cleanly all clients if you didn't do so already, even any > read-only clients. Any extra load will just slow down recovery. My best guess > is, that the MDS is replaying some operations, which is very slow due to > swap. While doing so, the segments to trim will probably keep increasing for > a while until it can start trimming. > > The slow meta-data IO is an operation hanging in some OSD. You should check > which OSD it is (ceph health detail) and check if you can see the operation > in the OSDs OPS queue. I would expect this OSD to have a really long OPS > queue. I have seen meta-data operations hang for a long time. In case this > OSD runs on the same server as your MDS, you will probably have to sit it out. > > If the meta-data operation is the only operation in the queue, the OSD might > need a restart. But be careful, if in doubt ask the list first. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <f...@lpnhe.in2p3.fr> > Sent: 08 June 2020 14:45:13 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory > exhausted > > Hi Franck, > Finally I dit : > ceph config set global mds_beacon_grace 600000 > and create /etc/sysctl.d/sysctl-ceph.conf with > vm.min_free_kbytes=4194303 > and then > sysctl --system > > After that, the mds went to rejoin for a very long time (almost 24 > hours) with errors like : > 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2020-06-07 04:10:36.802 7ff866e2e700 0 > mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors > (last acked 14653.8s ago); MDS internal heartbeat is not healthy! > 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating > possible clock skew, rotating keys expired way too early (before > 2020-06-07 03:10:37.022271) > and also > 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 > 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could > not get service secret for service mds secret_id=10363 > > but at the end the mds went active ! :-) > I let it at rest from sunday afternoon until this morning. > Indeed I was able to connect clients (in read-only for now) and read the > datas. > I checked the clients connected with ceph tell > mds.lpnceph-mds02.in2p3.fr client ls > and disconnected the few clients still there (with umount) and checked > that they were not connected anymore with the same command. > But I still have the following warnings > MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs > mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > > 30 secs, oldest blocked for 75372 secs > MDS_TRIM 1 MDSs behind on trimming > mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) > max_segments: 128, num_segments: 122836 > > and the number of segments is still rising (slowly). > F. > > > Le 08/06/2020 à 12:00, Frank Schilder a écrit : >> Hi Francois, >> >> did you manage to get any further with this? >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Frank Schilder <fr...@dtu.dk> >> Sent: 06 June 2020 15:21:59 >> To: ceph-users; f...@lpnhe.in2p3.fr >> Subject: [ceph-users] Re: mds behind on trimming - replay until memory >> exhausted >> >> I think you have a problem similar to one I have. The priority of beacons >> seems very low. As soon as something gets busy, beacons are ignored or not >> sent. This was part of your log messages from the MDS. It stopped reporting >> to the MONs due to laggy connection. This laggyness is a result of swapping: >> >>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>> work because connection to Monitors appears laggy >> Hence, during the (entire) time you are trying to get the MDS back using >> swap, it will almost certainly stop sending beacons. Therefore, you need to >> disable the time-out temporarily, otherwise the MON will always kill it for >> no real reason. The time-out should be long enough to cover the entire >> recovery period. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand <f...@lpnhe.in2p3.fr> >> Sent: 06 June 2020 11:11 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory >> exhausted >> >> Thanks for the tip, >> I will try that. For now vm.min_free_kbytes = 90112 >> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 >> but this didn't change anything... >> -27> 2020-06-06 06:15:07.373 7f83e3626700 1 >> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >> be laggy; 332.044s since last acked beacon >> Which is the same time since last acked beacon I had before changing the >> parameter. >> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should >> lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. >> F. >> >> >> Le 06/06/2020 à 09:47, Frank Schilder a écrit : >>> Hi Francois, >>> >>> there is actually one more parameter you might consider changing in case >>> the MDS gets kicked out again. For a system under such high memory >>> pressure, the value for the kernel parameter vm.min_free_kbytes might need >>> adjusting. You can check the current value with >>> >>> sysctl vm.min_free_kbytes >>> >>> In your case with heavy swap usage, this value should probably be somewhere >>> between 2-4GB. >>> >>> Careful, do not change this value while memory is in high demand. If not >>> enough memory is available, setting this will immediately OOM kill your >>> machine. Make sure that plenty of pages are unused. Drop page cache if >>> necessary or reboot the machine before setting this value. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Frank Schilder <fr...@dtu.dk> >>> Sent: 06 June 2020 00:36:13 >>> To: ceph-users; f...@lpnhe.in2p3.fr >>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory >>> exhausted >>> >>> Hi Francois, >>> >>> yes, the beacon grace needs to be higher due to the latency of swap. Not >>> sure if 60s will do. For this particular recovery operation, you might want >>> to go much higher (1h) and watch the cluster health closely. >>> >>> Good luck and best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand <f...@lpnhe.in2p3.fr> >>> Sent: 05 June 2020 23:51:04 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory >>> exhausted >>> >>> Hi, >>> Unfortunately adding swap did not solve the problem ! >>> I added 400 GB of swap. It used about 18GB of swap after consuming all >>> the ram and stopped with the following logs : >>> >>> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324691 from mon.1 >>> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324692 from mon.1 >>> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy >>> 'MDSRank' had timed out after 15 >>> 2020-06-05 21:33:59.787 7f251b7e5700 0 >>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy! >>> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy >>> 'MDSRank' had timed out after 15 >>> 2020-06-05 21:34:00.287 7f251b7e5700 0 >>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy! >>> .... >>> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout >>> 'MDSRank' had timed out after 15 >>> 2020-06-05 21:39:06.015 7f251bfe6700 1 >>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>> be laggy; 310.228s since last acked beacon >>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>> work because connection to Monitors appears laggy >>> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep >>> work because connection to Monitors appears laggy >>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324694 from mon.1 >>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map >>> removed me (mds.-1 gid:210070681) from cluster due to lost contact; >>> respawning >>> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! >>> --- begin dump of recent events --- >>> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 >>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>> 2131 rtt 0.930951 >>> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 >>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 >>> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: >>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 >>> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 >>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>> 2132 rtt 1.12294 >>> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 >>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 >>> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick >>> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: >>> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 >>> 19:27:44.290995) >>> ... >>> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324749 from mon.1 >>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324750 from mon.1 >>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map >>> has assigned me to become a standby >>> >>> However, the mons doesn't seems particularly loaded ! >>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did >>> it both for mds and mons daemons because it's seems to be present in >>> both conf). >>> I will tells you if it works. >>> >>> Any other clue ? >>> F. >>> >>> Le 05/06/2020 à 14:44, Frank Schilder a écrit : >>>> Hi Francois, >>>> >>>> thanks for the link. The option "mds dump cache after rejoin" is for >>>> debugging purposes only. It will write the cache after rejoin to a file, >>>> but not drop the cache. This will not help you. I think this was >>>> implemented recently to make it possible to send a cache dump file to >>>> developers after an MDS crash before the restarting MDS changes the cache. >>>> >>>> In your case, I would set osd_op_queue_cut_off during the next regular >>>> cluster service or upgrade. >>>> >>>> My best bet right now is to try to add swap. Maybe someone else reading >>>> this has a better idea or you find a hint in one of the other threads. >>>> >>>> Good luck! >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand<f...@lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 14:34:06 >>>> To: Frank Schilder; ceph-users >>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory >>>> exhausted >>>> >>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>>>> Hi Francois, >>>>> >>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>> Haven't heard of that option. Is there some documentation? >>>> I found it on : >>>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >>>> mds dump cache after rejoin >>>> Description >>>> Ceph will dump MDS cache contents to a file after rejoining the cache >>>> (during recovery). >>>> Type >>>> Boolean >>>> Default >>>> false >>>> >>>> but I don't think it can help in my case, because rejoin occurs after >>>> replay and in my case replay never ends ! >>>> >>>>>> I have : >>>>>> osd_op_queue=wpq >>>>>> osd_op_queue_cut_off=low >>>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>>>> only if the mds get active, true ? >>>>> I think so. If you have no clients connected, there should not be queue >>>>> priority issues. Maybe it is best to wait until your cluster is healthy >>>>> again as you will need to restart all daemons. Make sure you set this in >>>>> [global]. When I applied that change and after re-starting all OSDs my >>>>> MDSes had reconnect issues until I set it on them too. I think all >>>>> daemons use that option (the prefix osd_ is misleading). >>>> For sure I would prefer not to restart all daemons because the second >>>> filesystem is up and running (with production clients). >>>> >>>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>>>> which seems reasonable for a mds server with 32/48GB). >>>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a >>>>> bug, I believe there is a regression in Nautilus. There were recent >>>>> threads on absurdly high memory use by MDSes. Maybe its worth searching >>>>> for these in the list. >>>> I will have a look. >>>> >>>>>> I already force the clients to unmount (and even rebooted the ones from >>>>>> which the rsync and the rmdir .snaps were launched). >>>>> I don't know when the MDS acknowledges this. If is was a clean unmount >>>>> (i.e. without -f or forced by reboot) the MDS should have dropped the >>>>> clients already. If it was an unclean unmount it might not be that easy >>>>> to get the stale client session out. However, I don't know about that. >>>> Moreover when I did that, the mds was already not active but in replay, >>>> so for sure the unmount was not acknowledged by any mds ! >>>> >>>>>> I think that providing more swap maybe the solution ! I will try that if >>>>>> I cannot find another way to fix it. >>>>> If the memory overrun is somewhat limited, this should allow the MDS to >>>>> trim the logs. Will take a while, but it will do eventually. >>>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr> >>>>> Sent: 05 June 2020 13:46:03 >>>>> To: Frank Schilder; ceph-users >>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory >>>>> exhausted >>>>> >>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>> >>>>> >>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>>>> Out of interest, I did the same on a mimic cluster a few months ago, >>>>>> running up to 5 parallel rsync sessions without any problems. I moved >>>>>> about 120TB. Each rsync was running on a separate client with its own >>>>>> cache. I made sure that the sync dirs were all disjoint (no overlap of >>>>>> files/directories). >>>>>> >>>>>> How many rsync processes are you running in parallel? >>>>>> Do you have these settings enabled: >>>>>> >>>>>> osd_op_queue=wpq >>>>>> osd_op_queue_cut_off=high >>>>>> >>>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. >>>>>> Setting the latter removed any behind trimming problems we have seen >>>>>> before. >>>>>> >>>>>> You are in a somewhat peculiar situation. I think you need to trim >>>>>> client caches, which requires an active MDS. If your MDS becomes active >>>>>> for at least some time, you could try the following (I'm not an expert >>>>>> here, so take with a grain of scepticism): >>>>>> >>>>>> - reduce the MDS cache memory limit to force recall of caps much earlier >>>>>> than now >>>>>> - reduce client cach size >>>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this >>>>>> requires restart of OSDs, so be careful >>>>>> >>>>>> At this point, you could watch your restart cycle to see if things >>>>>> improve already. Maybe nothing more is required. >>>>>> >>>>>> If you have good SSDs, you could try to provide temporarily some swap >>>>>> space. It saved me once. This will be very slow, but at least it might >>>>>> allow you to move forward. >>>>>> >>>>>> Harder measures: >>>>>> >>>>>> - stop all I/O from the FS clients, throw users out if necessary >>>>>> - ideally, try to cleanly (!) shut down clients or force trimming the >>>>>> cache by either >>>>>> * umount or >>>>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>>>> Either of these might hang for a long time. Do not interrupt and >>>>>> do not do this on more than one client at a time. >>>>>> >>>>>> At some point, your active MDS should be able to hold a full session. >>>>>> You should then tune the cache and other parameters such that the MDSes >>>>>> can handle your rsync sessions. >>>>>> >>>>>> My experience is that MDSes overrun their cache limits quite a lot. >>>>>> Since I reduced mds_cache_memory_limit to not more than half of what is >>>>>> physically available, I have not had any problems again. >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> >>>>>> ________________________________________ >>>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr> >>>>>> Sent: 05 June 2020 11:42:42 >>>>>> To: ceph-users >>>>>> Subject: [ceph-users] mds behind on trimming - replay until memory >>>>>> exhausted >>>>>> >>>>>> Hi all, >>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>>>> 3 mds (1 active for each fs + one failover). >>>>>> We are transfering all the datas (~600M files) from one FS (which was in >>>>>> EC 3+2) to the other FS (in R3). >>>>>> On the old FS we first removed the snapshots (to avoid strays problems >>>>>> when removing files) and the ran some rsync deleting the files after the >>>>>> transfer. >>>>>> The operation should last a few weeks more to complete. >>>>>> But few days ago, we started to have some warning mds behind on trimming >>>>>> from the mds managing the old FS. >>>>>> Yesterday, I restarted the active mds service to force the takeover by >>>>>> the standby mds (basically because the standby is more powerfull and >>>>>> have more memory, i.e 48GB over 32). >>>>>> The standby mds took the rank 0 and started to replay... the mds behind >>>>>> on trimming came back and the number of segments rised as well as the >>>>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>>>> and the service stopped and the previous mds took rank 0 and started to >>>>>> replay... until memory exhaustion and a new switch of mds etc... >>>>>> It thus seems that we are in a never ending loop ! And of course, as the >>>>>> mds is always in replay, the data are not accessible and the transfers >>>>>> are blocked. >>>>>> I stopped all the rsync and unmount the clients. >>>>>> >>>>>> My questions are : >>>>>> - Does the mds trim during the replay so we could hope that after a >>>>>> while it will purge everything and the mds will be able to become active >>>>>> at the end ? >>>>>> - Is there a way to accelerate the operation or to fix this situation ? >>>>>> >>>>>> Thanks for you help. >>>>>> F. >>>>>> _______________________________________________ >>>>>> ceph-users mailing list --ceph-users@ceph.io >>>>>> To unsubscribe send an email toceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io