[ceph-users] Re: mds behind on trimming - replay until memory exhausted

Frank Schilder Mon, 08 Jun 2020 10:52:19 -0700

That's strange. Maybe there is another problem. Do you have any other health 
warnings that might be related? Is there some recovery/rebalancing going on?


Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand <f...@lpnhe.in2p3.fr>
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different
machines).
F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :
> Hi Francois,
>
> this sounds great. At least its operational. I guess it is still using a lot 
> of swap while trying to replay operations.
>
> I would disconnect cleanly all clients if you didn't do so already, even any 
> read-only clients. Any extra load will just slow down recovery. My best guess 
> is, that the MDS is replaying some operations, which is very slow due to 
> swap. While doing so, the segments to trim will probably keep increasing for 
> a while until it can start trimming.
>
> The slow meta-data IO is an operation hanging in some OSD. You should check 
> which OSD it is (ceph health detail) and check if you can see the operation 
> in the OSDs OPS queue. I would expect this OSD to have a really long OPS 
> queue. I have seen meta-data operations hang for a long time. In case this 
> OSD runs on the same server as your MDS, you will probably have to sit it out.
>
> If the meta-data operation is the only operation in the queue, the OSD might 
> need a restart. But be careful, if in doubt ask the list first.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Francois Legrand <f...@lpnhe.in2p3.fr>
> Sent: 08 June 2020 14:45:13
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
> exhausted
>
> Hi Franck,
> Finally I dit :
> ceph config set global mds_beacon_grace 600000
> and create /etc/sysctl.d/sysctl-ceph.conf with
> vm.min_free_kbytes=4194303
> and then
> sysctl --system
>
> After that, the mds went to rejoin for a very long time (almost 24
> hours) with errors like :
> 2020-06-07 04:10:36.802 7ff866e2e700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2020-06-07 04:10:36.802 7ff866e2e700  0
> mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors
> (last acked 14653.8s ago); MDS internal heartbeat is not healthy!
> 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating
> possible clock skew, rotating keys expired way too early (before
> 2020-06-07 03:10:37.022271)
> and also
> 2020-06-07 04:10:44.942 7ff86d63b700  0 auth: could not find secret_id=10363
> 2020-06-07 04:10:44.942 7ff86d63b700  0 cephx: verify_authorizer could
> not get service secret for service mds secret_id=10363
>
> but at the end the mds went active ! :-)
> I let it at rest from sunday afternoon until this morning.
> Indeed I was able to connect clients (in read-only for now) and read the
> datas.
> I checked the clients connected with ceph tell
> mds.lpnceph-mds02.in2p3.fr client ls
> and disconnected the few clients still there (with umount) and checked
> that they were not connected anymore with the same command.
> But I still have the following warnings
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>       mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked >
> 30 secs, oldest blocked for 75372 secs
> MDS_TRIM 1 MDSs behind on trimming
>       mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128)
> max_segments: 128, num_segments: 122836
>
> and the number of segments is still rising (slowly).
> F.
>
>
> Le 08/06/2020 à 12:00, Frank Schilder a écrit :
>> Hi Francois,
>>
>> did you manage to get any further with this?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <fr...@dtu.dk>
>> Sent: 06 June 2020 15:21:59
>> To: ceph-users; f...@lpnhe.in2p3.fr
>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory 
>> exhausted
>>
>> I think you have a problem similar to one I have. The priority of beacons 
>> seems very low. As soon as something gets busy, beacons are ignored or not 
>> sent. This was part of your log messages from the MDS. It stopped reporting 
>> to the MONs due to laggy connection. This laggyness is a result of swapping:
>>
>>> 2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
>>> work because connection to Monitors appears laggy
>> Hence, during the (entire) time you are trying to get the MDS back using 
>> swap, it will almost certainly stop sending beacons. Therefore, you need to 
>> disable the time-out temporarily, otherwise the MON will always kill it for 
>> no real reason. The time-out should be long enough to cover the entire 
>> recovery period.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Francois Legrand <f...@lpnhe.in2p3.fr>
>> Sent: 06 June 2020 11:11
>> To: Frank Schilder; ceph-users
>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
>> exhausted
>>
>> Thanks for the tip,
>> I will try that. For now vm.min_free_kbytes = 90112
>> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0
>> but this didn't change anything...
>>       -27> 2020-06-06 06:15:07.373 7f83e3626700  1
>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
>> be laggy; 332.044s since last acked beacon
>> Which is the same time since last acked beacon I had before changing the
>> parameter.
>> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should
>> lead to 960 s (16mn).  Thus I think that the bottleneck is elsewhere.
>> F.
>>
>>
>> Le 06/06/2020 à 09:47, Frank Schilder a écrit :
>>> Hi Francois,
>>>
>>> there is actually one more parameter you might consider changing in case 
>>> the MDS gets kicked out again. For a system under such high memory 
>>> pressure, the value for the kernel parameter vm.min_free_kbytes might need 
>>> adjusting. You can check the current value with
>>>
>>> sysctl vm.min_free_kbytes
>>>
>>> In your case with heavy swap usage, this value should probably be somewhere 
>>> between 2-4GB.
>>>
>>> Careful, do not change this value while memory is in high demand. If not 
>>> enough memory is available, setting this will immediately OOM kill your 
>>> machine. Make sure that plenty of pages are unused. Drop page cache if 
>>> necessary or reboot the machine before setting this value.
>>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Frank Schilder <fr...@dtu.dk>
>>> Sent: 06 June 2020 00:36:13
>>> To: ceph-users; f...@lpnhe.in2p3.fr
>>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory 
>>> exhausted
>>>
>>> Hi Francois,
>>>
>>> yes, the beacon grace needs to be higher due to the latency of swap. Not 
>>> sure if 60s will do. For this particular recovery operation, you might want 
>>> to go much higher (1h) and watch the cluster health closely.
>>>
>>> Good luck and best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Francois Legrand <f...@lpnhe.in2p3.fr>
>>> Sent: 05 June 2020 23:51:04
>>> To: Frank Schilder; ceph-users
>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory 
>>> exhausted
>>>
>>> Hi,
>>> Unfortunately adding swap did not solve the problem !
>>> I added 400 GB of swap. It used about 18GB of swap after consuming all
>>> the ram and stopped with the following logs :
>>>
>>> 2020-06-05 21:33:31.967 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
>>> Updating MDS map to version 324691 from mon.1
>>> 2020-06-05 21:33:40.355 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
>>> Updating MDS map to version 324692 from mon.1
>>> 2020-06-05 21:33:59.787 7f251b7e5700  1 heartbeat_map is_healthy
>>> 'MDSRank' had timed out after 15
>>> 2020-06-05 21:33:59.787 7f251b7e5700  0
>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors
>>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy!
>>> 2020-06-05 21:34:00.287 7f251b7e5700  1 heartbeat_map is_healthy
>>> 'MDSRank' had timed out after 15
>>> 2020-06-05 21:34:00.287 7f251b7e5700  0
>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors
>>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy!
>>> ....
>>> 2020-06-05 21:39:05.991 7f251bfe6700  1 heartbeat_map reset_timeout
>>> 'MDSRank' had timed out after 15
>>> 2020-06-05 21:39:06.015 7f251bfe6700  1
>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
>>> be laggy; 310.228s since last acked beacon
>>> 2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
>>> work because connection to Monitors appears laggy
>>> 2020-06-05 21:39:19.838 7f251bfe6700  1 mds.0.322900 skipping upkeep
>>> work because connection to Monitors appears laggy
>>> 2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
>>> Updating MDS map to version 324694 from mon.1
>>> 2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr Map
>>> removed me (mds.-1 gid:210070681) from cluster due to lost contact;
>>> respawning
>>> 2020-06-05 21:39:19.870 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr respawn!
>>> --- begin dump of recent events ---
>>>      -9999> 2020-06-05 19:28:07.982 7f25217f1700  5
>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq
>>> 2131 rtt 0.930951
>>>      -9998> 2020-06-05 19:28:11.053 7f251b7e5700  5
>>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132
>>>      -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient:
>>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0
>>>      -9996> 2020-06-05 19:28:12.176 7f25217f1700  5
>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq
>>> 2132 rtt 1.12294
>>>      -9995> 2020-06-05 19:28:12.176 7f251e7eb700  1
>>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1
>>>      -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick
>>>      -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient:
>>> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05
>>> 19:27:44.290995)
>>> ...
>>> 2020-06-05 21:39:31.092 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr
>>> Updating MDS map to version 324749 from mon.1
>>> 2020-06-05 21:39:35.257 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr
>>> Updating MDS map to version 324750 from mon.1
>>> 2020-06-05 21:39:35.257 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr Map
>>> has assigned me to become a standby
>>>
>>> However, the mons doesn't seems particularly loaded !
>>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did
>>> it both for mds and mons daemons because it's seems to be present in
>>> both conf).
>>> I will tells you if it works.
>>>
>>> Any other clue ?
>>> F.
>>>
>>> Le 05/06/2020 à 14:44, Frank Schilder a écrit :
>>>> Hi Francois,
>>>>
>>>> thanks for the link. The option "mds dump cache after rejoin" is for 
>>>> debugging purposes only. It will write the cache after rejoin to a file, 
>>>> but not drop the cache. This will not help you. I think this was 
>>>> implemented recently to make it possible to send a cache dump file to 
>>>> developers after an MDS crash before the restarting MDS changes the cache.
>>>>
>>>> In your case, I would set osd_op_queue_cut_off during the next regular 
>>>> cluster service or upgrade.
>>>>
>>>> My best bet right now is to try to add swap. Maybe someone else reading 
>>>> this has a better idea or you find a hint in one of the other threads.
>>>>
>>>> Good luck!
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr>
>>>> Sent: 05 June 2020 14:34:06
>>>> To: Frank Schilder; ceph-users
>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory 
>>>> exhausted
>>>>
>>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit :
>>>>> Hi Francois,
>>>>>
>>>>>> I was also wondering if setting mds dump cache after rejoin could help ?
>>>>> Haven't heard of that option. Is there some documentation?
>>>> I found it on :
>>>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
>>>> mds dump cache after rejoin
>>>> Description
>>>> Ceph will dump MDS cache contents to a file after rejoining the cache
>>>> (during recovery).
>>>> Type
>>>> Boolean
>>>> Default
>>>> false
>>>>
>>>> but I don't think it can help in my case, because rejoin occurs after
>>>> replay and in my case replay never ends !
>>>>
>>>>>> I have :
>>>>>> osd_op_queue=wpq
>>>>>> osd_op_queue_cut_off=low
>>>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful
>>>>>> only if the mds get active, true ?
>>>>> I think so. If you have no clients connected, there should not be queue 
>>>>> priority issues. Maybe it is best to wait until your cluster is healthy 
>>>>> again as you will need to restart all daemons. Make sure you set this in 
>>>>> [global]. When I applied that change and after re-starting all OSDs my 
>>>>> MDSes had reconnect issues until I set it on them too. I think all 
>>>>> daemons use that option (the prefix osd_ is misleading).
>>>> For sure I would prefer not to restart all daemons because the second
>>>> filesystem is up and running (with production clients).
>>>>
>>>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB
>>>>>> which seems reasonable for a mds server with 32/48GB).
>>>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a 
>>>>> bug, I believe there is a regression in Nautilus. There were recent 
>>>>> threads on absurdly high memory use by MDSes. Maybe its worth searching 
>>>>> for these in the list.
>>>> I will have a look.
>>>>
>>>>>> I already force the clients to unmount (and even rebooted the ones from
>>>>>> which the rsync and the rmdir .snaps were launched).
>>>>> I don't know when the MDS acknowledges this. If is was a clean unmount 
>>>>> (i.e. without -f or forced by reboot) the MDS should have dropped the 
>>>>> clients already. If it was an unclean unmount it might not be that easy 
>>>>> to get the stale client session out. However, I don't know about that.
>>>> Moreover when I did that, the mds was already not active but in replay,
>>>> so for sure the unmount was not acknowledged by any mds !
>>>>
>>>>>> I think that providing more swap maybe the solution ! I will try that if
>>>>>> I cannot find another way to fix it.
>>>>> If the memory overrun is somewhat limited, this should allow the MDS to 
>>>>> trim the logs. Will take a while, but it will do eventually.
>>>>>
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr>
>>>>> Sent: 05 June 2020 13:46:03
>>>>> To: Frank Schilder; ceph-users
>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory 
>>>>> exhausted
>>>>>
>>>>> I was also wondering if setting mds dump cache after rejoin could help ?
>>>>>
>>>>>
>>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit :
>>>>>> Out of interest, I did the same on a mimic cluster a few months ago, 
>>>>>> running up to 5 parallel rsync sessions without any problems. I moved 
>>>>>> about 120TB. Each rsync was running on a separate client with its own 
>>>>>> cache. I made sure that the sync dirs were all disjoint (no overlap of 
>>>>>> files/directories).
>>>>>>
>>>>>> How many rsync processes are you running in parallel?
>>>>>> Do you have these settings enabled:
>>>>>>
>>>>>>         osd_op_queue=wpq
>>>>>>         osd_op_queue_cut_off=high
>>>>>>
>>>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. 
>>>>>> Setting the latter removed any behind trimming problems we have seen 
>>>>>> before.
>>>>>>
>>>>>> You are in a somewhat peculiar situation. I think you need to trim 
>>>>>> client caches, which requires an active MDS. If your MDS becomes active 
>>>>>> for at least some time, you could try the following (I'm not an expert 
>>>>>> here, so take with a grain of scepticism):
>>>>>>
>>>>>> - reduce the MDS cache memory limit to force recall of caps much earlier 
>>>>>> than now
>>>>>> - reduce client cach size
>>>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this 
>>>>>> requires restart of OSDs, so be careful
>>>>>>
>>>>>> At this point, you could watch your restart cycle to see if things 
>>>>>> improve already. Maybe nothing more is required.
>>>>>>
>>>>>> If you have good SSDs, you could try to provide temporarily some swap 
>>>>>> space. It saved me once. This will be very slow, but at least it might 
>>>>>> allow you to move forward.
>>>>>>
>>>>>> Harder measures:
>>>>>>
>>>>>> - stop all I/O from the FS clients, throw users out if necessary
>>>>>> - ideally, try to cleanly (!) shut down clients or force trimming the 
>>>>>> cache by either
>>>>>>         * umount or
>>>>>>         * sync; echo 3 > /proc/sys/vm/drop_caches
>>>>>>         Either of these might hang for a long time. Do not interrupt and 
>>>>>> do not do this on more than one client at a time.
>>>>>>
>>>>>> At some point, your active MDS should be able to hold a full session. 
>>>>>> You should then tune the cache and other parameters such that the MDSes 
>>>>>> can handle your rsync sessions.
>>>>>>
>>>>>> My experience is that MDSes overrun their cache limits quite a lot. 
>>>>>> Since I reduced mds_cache_memory_limit to not more than half of what is 
>>>>>> physically available, I have not had any problems again.
>>>>>>
>>>>>> Hope that helps.
>>>>>>
>>>>>> Best regards,
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr>
>>>>>> Sent: 05 June 2020 11:42:42
>>>>>> To: ceph-users
>>>>>> Subject: [ceph-users] mds behind on trimming - replay until memory 
>>>>>> exhausted
>>>>>>
>>>>>> Hi all,
>>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
>>>>>> 3 mds (1 active for each fs + one failover).
>>>>>> We are transfering all the datas (~600M files) from one FS (which was in
>>>>>> EC 3+2) to the other FS (in R3).
>>>>>> On the old FS we first removed the snapshots (to avoid strays problems
>>>>>> when removing files) and the ran some rsync deleting the files after the
>>>>>> transfer.
>>>>>> The operation should last a few weeks more to complete.
>>>>>> But few days ago, we started to have some warning mds behind on trimming
>>>>>> from the mds managing the old FS.
>>>>>> Yesterday, I restarted the active mds service to force the takeover by
>>>>>> the standby mds (basically because the standby is more powerfull and
>>>>>> have more memory, i.e 48GB over 32).
>>>>>> The standby mds took the rank 0 and started to replay... the mds behind
>>>>>> on trimming came back and the number of segments rised as well as the
>>>>>> memory usage of the server. Finally, it exhausted the memory of the mds
>>>>>> and the service stopped and the previous mds took rank 0 and started to
>>>>>> replay... until memory exhaustion and a new switch of mds etc...
>>>>>> It thus seems that we are in a never ending loop ! And of course, as the
>>>>>> mds is always in replay, the data are not accessible and the transfers
>>>>>> are blocked.
>>>>>> I stopped all the rsync and unmount the clients.
>>>>>>
>>>>>> My questions are :
>>>>>> - Does the mds trim during the replay so we could hope that after a
>>>>>> while it will purge everything and the mds will be able to become active
>>>>>> at the end ?
>>>>>> - Is there a way to accelerate the operation or to fix this situation ?
>>>>>>
>>>>>> Thanks for you help.
>>>>>> F.
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list --ceph-users@ceph.io
>>>>>> To unsubscribe send an email toceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

Reply via email to