[ceph-users] Re: mds behind on trimming - replay until memory exhausted

Frank Schilder Fri, 05 Jun 2020 09:46:14 -0700

Hi Francois,

thanks for the link. The option "mds dump cache after rejoin" is for debugging 
purposes only. It will write the cache after rejoin to a file, but not drop the 
cache. This will not help you. I think this was implemented recently to make it 
possible to send a cache dump file to developers after an MDS crash before the 
restarting MDS changes the cache.


In your case, I would set osd_op_queue_cut_off during the next regular cluster 
service or upgrade.

My best bet right now is to try to add swap. Maybe someone else reading this 
has a better idea or you find a hint in one of the other threads.

Good luck!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand <f...@lpnhe.in2p3.fr>
Sent: 05 June 2020 14:34:06
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted

Le 05/06/2020 à 14:18, Frank Schilder a écrit :
> Hi Francois,
>
>> I was also wondering if setting mds dump cache after rejoin could help ?
> Haven't heard of that option. Is there some documentation?
I found it on :
https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
mds dump cache after rejoin
Description
Ceph will dump MDS cache contents to a file after rejoining the cache
(during recovery).
Type
Boolean
Default
false

but I don't think it can help in my case, because rejoin occurs after
replay and in my case replay never ends !

>> I have :
>> osd_op_queue=wpq
>> osd_op_queue_cut_off=low
>> I can try to set osd_op_queue_cut_off to high, but it will be useful
>> only if the mds get active, true ?
> I think so. If you have no clients connected, there should not be queue 
> priority issues. Maybe it is best to wait until your cluster is healthy again 
> as you will need to restart all daemons. Make sure you set this in [global]. 
> When I applied that change and after re-starting all OSDs my MDSes had 
> reconnect issues until I set it on them too. I think all daemons use that 
> option (the prefix osd_ is misleading).

For sure I would prefer not to restart all daemons because the second
filesystem is up and running (with production clients).

>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB
>> which seems reasonable for a mds server with 32/48GB).
> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a 
> bug, I believe there is a regression in Nautilus. There were recent threads 
> on absurdly high memory use by MDSes. Maybe its worth searching for these in 
> the list.
I will have a look.

>> I already force the clients to unmount (and even rebooted the ones from
>> which the rsync and the rmdir .snaps were launched).
> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. 
> without -f or forced by reboot) the MDS should have dropped the clients 
> already. If it was an unclean unmount it might not be that easy to get the 
> stale client session out. However, I don't know about that.

Moreover when I did that, the mds was already not active but in replay,
so for sure the unmount was not acknowledged by any mds !

>> I think that providing more swap maybe the solution ! I will try that if
>> I cannot find another way to fix it.
> If the memory overrun is somewhat limited, this should allow the MDS to trim 
> the logs. Will take a while, but it will do eventually.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Francois Legrand <f...@lpnhe.in2p3.fr>
> Sent: 05 June 2020 13:46:03
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] mds behind on trimming - replay until memory 
> exhausted
>
> I was also wondering if setting mds dump cache after rejoin could help ?
>
>
> Le 05/06/2020 à 12:49, Frank Schilder a écrit :
>> Out of interest, I did the same on a mimic cluster a few months ago, running 
>> up to 5 parallel rsync sessions without any problems. I moved about 120TB. 
>> Each rsync was running on a separate client with its own cache. I made sure 
>> that the sync dirs were all disjoint (no overlap of files/directories).
>>
>> How many rsync processes are you running in parallel?
>> Do you have these settings enabled:
>>
>>     osd_op_queue=wpq
>>     osd_op_queue_cut_off=high
>>
>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting 
>> the latter removed any behind trimming problems we have seen before.
>>
>> You are in a somewhat peculiar situation. I think you need to trim client 
>> caches, which requires an active MDS. If your MDS becomes active for at 
>> least some time, you could try the following (I'm not an expert here, so 
>> take with a grain of scepticism):
>>
>> - reduce the MDS cache memory limit to force recall of caps much earlier 
>> than now
>> - reduce client cach size
>> - set "osd_op_queue_cut_off=high" if not already done so, I think this 
>> requires restart of OSDs, so be careful
>>
>> At this point, you could watch your restart cycle to see if things improve 
>> already. Maybe nothing more is required.
>>
>> If you have good SSDs, you could try to provide temporarily some swap space. 
>> It saved me once. This will be very slow, but at least it might allow you to 
>> move forward.
>>
>> Harder measures:
>>
>> - stop all I/O from the FS clients, throw users out if necessary
>> - ideally, try to cleanly (!) shut down clients or force trimming the cache 
>> by either
>>     * umount or
>>     * sync; echo 3 > /proc/sys/vm/drop_caches
>>     Either of these might hang for a long time. Do not interrupt and do not 
>> do this on more than one client at a time.
>>
>> At some point, your active MDS should be able to hold a full session. You 
>> should then tune the cache and other parameters such that the MDSes can 
>> handle your rsync sessions.
>>
>> My experience is that MDSes overrun their cache limits quite a lot. Since I 
>> reduced mds_cache_memory_limit to not more than half of what is physically 
>> available, I have not had any problems again.
>>
>> Hope that helps.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Francois Legrand <f...@lpnhe.in2p3.fr>
>> Sent: 05 June 2020 11:42:42
>> To: ceph-users
>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted
>>
>> Hi all,
>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
>> 3 mds (1 active for each fs + one failover).
>> We are transfering all the datas (~600M files) from one FS (which was in
>> EC 3+2) to the other FS (in R3).
>> On the old FS we first removed the snapshots (to avoid strays problems
>> when removing files) and the ran some rsync deleting the files after the
>> transfer.
>> The operation should last a few weeks more to complete.
>> But few days ago, we started to have some warning mds behind on trimming
>> from the mds managing the old FS.
>> Yesterday, I restarted the active mds service to force the takeover by
>> the standby mds (basically because the standby is more powerfull and
>> have more memory, i.e 48GB over 32).
>> The standby mds took the rank 0 and started to replay... the mds behind
>> on trimming came back and the number of segments rised as well as the
>> memory usage of the server. Finally, it exhausted the memory of the mds
>> and the service stopped and the previous mds took rank 0 and started to
>> replay... until memory exhaustion and a new switch of mds etc...
>> It thus seems that we are in a never ending loop ! And of course, as the
>> mds is always in replay, the data are not accessible and the transfers
>> are blocked.
>> I stopped all the rsync and unmount the clients.
>>
>> My questions are :
>> - Does the mds trim during the replay so we could hope that after a
>> while it will purge everything and the mds will be able to become active
>> at the end ?
>> - Is there a way to accelerate the operation or to fix this situation ?
>>
>> Thanks for you help.
>> F.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

Reply via email to