Hi Xiubo. > IMO evicting the corresponding client could also resolve this issue > instead of restarting the MDS.
Yes, it can get rid of the stuck caps release request, but it will also make any process accessing the file system crash. After a client eviction we usually have to reboot the server to get everything back clean. An MDS restart would achieve this in a transparent way and when replaying the journal execute the pending caps recall successfully without making processes crash - if there wasn't the wrong peer issue. As far as I can tell, the operation is stuck in the MDS because its never re-scheduled/re-tried or checked if the condition still exists (the client still holds the caps requested). An MDS restart re-schedules all pending operations and then it succeeds. In every ceph version so far there were examples where hand-shaking between a client and an MDS had small flaws. For situations like that I would really like to have a light-weight MDS daemon command to force a re-schedule/re-play without having to restart the entire MDS and reconnect all its clients from scratch. It would be great to have light-weight tools available to rectify such simple conditions in an as non-disruptive as possible way. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiu...@redhat.com> Sent: Wednesday, May 10, 2023 4:01 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc On 5/9/23 16:23, Frank Schilder wrote: > Dear Xiubo, > > both issues will cause problems, the one reported in the subject > (https://tracker.ceph.com/issues/57244) and the potential follow-up on MDS > restart > (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XPR6X6TD7372I2YEPJO2L6F). > Either one will cause compute jobs on our HPC cluster to hang and users will > need to run the jobs again. Our queues are full, so not very popular to loose > your spot. > > The process in D-state is a user process. Interestingly it is often possible > to kill it despite the D-state (if one can find the process) and the stuck > recall gets resolved. If I restart the MDS, the stuck process might continue > working, but we run a significant risk of other processed getting stuck due > to the libceph/MDS wrong peer issue. We actually have these kind of messages > > [Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at > address > [Mon Mar 6 13:05:18 2023] libceph: wrong peer, want > 192.168.32.87:6801/-223958753, got > 192.168.32.87:6801/-1572619386 > > all over the HPC cluster and each of them means that some files/dirs are > inaccessible on the compute node and jobs either died or are/got stuck there. > Every MDS restart bears the risk of such events happening and with many nodes > this probability approaches 1 - every time we restart an MDS jobs get stuck. > > I have a reproducer for an instance of https://tracker.ceph.com/issues/57244. > Unfortunately, this is a big one that I would need to pack into a container. > I was not able to reduce it to something small, it seems to depend on a very > specific combination of codes with certain internal latencies between threads > that trigger a race. > > It sounds like you have a patch for https://tracker.ceph.com/issues/57244 > although its not linked from the tracker item. IMO evicting the corresponding client could also resolve this issue instead of restarting the MDS. Have you tried this ? Thanks - Xiubo > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Xiubo Li <xiu...@redhat.com> > Sent: Friday, May 5, 2023 2:40 AM > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), > pending pAsLsXsFsc issued pAsLsXsFsc > > > On 5/1/23 17:35, Frank Schilder wrote: >> Hi all, >> >> I think we might be hitting a known problem >> (https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, >> because we have troubles with older kclients that miss the mds restart and >> hold on to cache entries referring to the killed instance, leading to >> hanging jobs on our HPC cluster. > Will this cause any issue in your case ? > >> I have seen this issue before and there was a process in D-state that >> dead-locked itself. Usually, killing this process succeeded and resolved the >> issue. However, this time I can't find such a process. > BTW, what's the D-state process ? A ceph one ? > > Thanks > >> The tracker mentions that one can delete the file/folder. I have the inode >> number, but really don't want to start a find on a 1.5PB file system. Is >> there a better way to find what path is causing the issue (ask the MDS >> directly, look at a cache dump, or similar)? Is there an alternative to >> deletion or MDS fail? >> >> Thanks and best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io