Hi Xiubo.

> IMO evicting the corresponding client could also resolve this issue
> instead of restarting the MDS.

Yes, it can get rid of the stuck caps release request, but it will also make 
any process accessing the file system crash. After a client eviction we usually 
have to reboot the server to get everything back clean. An MDS restart would 
achieve this in a transparent way and when replaying the journal execute the 
pending caps recall successfully without making processes crash - if there 
wasn't the wrong peer issue.

As far as I can tell, the operation is stuck in the MDS because its never 
re-scheduled/re-tried or checked if the condition still exists (the client 
still holds the caps requested). An MDS restart re-schedules all pending 
operations and then it succeeds. In every ceph version so far there were 
examples where hand-shaking between a client and an MDS had small flaws. For 
situations like that I would really like to have a light-weight MDS daemon 
command to force a re-schedule/re-play without having to restart the entire MDS 
and reconnect all its clients from scratch.

It would be great to have light-weight tools available to rectify such simple 
conditions in an as non-disruptive as possible way.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Wednesday, May 10, 2023 4:01 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), 
pending pAsLsXsFsc issued pAsLsXsFsc


On 5/9/23 16:23, Frank Schilder wrote:
> Dear Xiubo,
>
> both issues will cause problems, the one reported in the subject 
> (https://tracker.ceph.com/issues/57244) and the potential follow-up on MDS 
> restart 
> (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XPR6X6TD7372I2YEPJO2L6F).
>  Either one will cause compute jobs on our HPC cluster to hang and users will 
> need to run the jobs again. Our queues are full, so not very popular to loose 
> your spot.
>
> The process in D-state is a user process. Interestingly it is often possible 
> to kill it despite the D-state (if one can find the process) and the stuck 
> recall gets resolved. If I restart the MDS, the stuck process might continue 
> working, but we run a significant risk of other processed getting stuck due 
> to the libceph/MDS wrong peer issue. We actually have these kind of messages
>
> [Mon Mar  6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
> address
> [Mon Mar  6 13:05:18 2023] libceph: wrong peer, want 
> 192.168.32.87:6801/-223958753, got
> 192.168.32.87:6801/-1572619386
>
> all over the HPC cluster and each of them means that some files/dirs are 
> inaccessible on the compute node and jobs either died or are/got stuck there. 
> Every MDS restart bears the risk of such events happening and with many nodes 
> this probability approaches 1 - every time we restart an MDS jobs get stuck.
>
> I have a reproducer for an instance of https://tracker.ceph.com/issues/57244. 
> Unfortunately, this is a big one that I would need to pack into a container. 
> I was not able to reduce it to something small, it seems to depend on a very 
> specific combination of codes with certain internal latencies between threads 
> that trigger a race.
>
> It sounds like you have a patch for https://tracker.ceph.com/issues/57244 
> although its not linked from the tracker item.

IMO evicting the corresponding client could also resolve this issue
instead of restarting the MDS.

Have you tried this ?

Thanks

- Xiubo

>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Xiubo Li <xiu...@redhat.com>
> Sent: Friday, May 5, 2023 2:40 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), 
> pending pAsLsXsFsc issued pAsLsXsFsc
>
>
> On 5/1/23 17:35, Frank Schilder wrote:
>> Hi all,
>>
>> I think we might be hitting a known problem 
>> (https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, 
>> because we have troubles with older kclients that miss the mds restart and 
>> hold on to cache entries referring to the killed instance, leading to 
>> hanging jobs on our HPC cluster.
> Will this cause any issue in your case ?
>
>> I have seen this issue before and there was a process in D-state that 
>> dead-locked itself. Usually, killing this process succeeded and resolved the 
>> issue. However, this time I can't find such a process.
> BTW, what's the D-state process ? A ceph one ?
>
> Thanks
>
>> The tracker mentions that one can delete the file/folder. I have the inode 
>> number, but really don't want to start a find on a 1.5PB file system. Is 
>> there a better way to find what path is causing the issue (ask the MDS 
>> directly, look at a cache dump, or similar)? Is there an alternative to 
>> deletion or MDS fail?
>>
>> Thanks and best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to