[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frédéric Nass
Hi Frank, It's possible that certain parameters you modified at some point, which may have helped the MDS to start up, are now slowing down its operation or preventing it from going further. In that case, resetting these parameters to their default values could help. Just a thought. Another th

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
Hi Eugen, as promised the result. Unfortunately, increasing this parameter seems not to help. Was worth a try though. I will keep the MDS running and check again tomorrow. Its really annoying that it doesn't come back. Following the reports of other people who were in a similar situation it sh

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
Hi Eugen, thanks and yes, let's try one thing at a time. I will report back. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Saturday, January 11, 2025 10:39 PM To: Frank Schilder Cc: ceph-users

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Eugen Block
Personally, I would only try one change at a time and wait for a result. Otherwise it can get difficult to tell what exactly helped and what not. I have never played with auth_service_ticket_ttl yet, so I can only refer to the docs here: When the Ceph Storage Cluster sends a ticket for auth

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
Hi Eugen, thanks for your reply! Its a long shot, but worth trying. I will give it a go. Since you are following: I also observed cephx timeouts. I'm considering to increase the ttl for auth tickets. Do you think auth_service_ticket_ttl (default 3600) is the right parameter? If so, can I just c

[ceph-users] Re: OSDs won't come back after upgrade

2025-01-11 Thread Alvaro Soto
But why do you need to disable selinux for the service to work? You shouldn't have an issue. On Fri, Jan 10, 2025, 6:20 PM Jorge Garcia wrote: > Actually, stupid mistake on my part. I had selinux mode as enforcing. > Changed it to disabled, and everything works again. Thanks for the > help! > __

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Eugen Block
Hi Frank, not sure if this already has been mentioned, but this one has 60 seconds timeout: mds_beacon_mon_down_grace ceph config help mds_beacon_mon_down_grace mds_beacon_mon_down_grace - tolerance in seconds for missed MDS beacons to monitors (secs, advanced) Default: 60 Can updat

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
And another small piece of information: Needed to do another restart. This time I managed to capture the approximate length of the period for which the MDS is up and responsive after loading the cache (it reports stats). Its pretty much exactly 60 seconds. This smells like a timeout. Is there a

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
Hi all, my hopes are down again. The MDS might look busy but I'm not sure its doing anything interesting. I now see a lot of these in the log (stripped the heartbeat messages): 2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expir

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

2025-01-11 Thread Frank Schilder
Hi all, new update: after sleeping after the final MDS restart the MDS is doing something! It is still unresponsive, but it does show CPU load of between 150-200% and I really really hope that this is the trimming of stray items. I will try to find out if I get perf to work inside the container