[ceph-users] Re: MDS stuck in rejoin

Xiubo Li Mon, 07 Aug 2023 17:59:49 -0700


On 8/7/23 21:54, Frank Schilder wrote:

Dear Xiubo,


I managed to collect some information. It looks like there is nothing in the 
dmesg log around the time the client failed to advance its TID. I collected 
short snippets around the critical time below. I have full logs in case you are 
interested. Its large files, I will need to do an upload for that.

I also have a dump of "mds session ls" output for clients that showed the same 
issue later. Unfortunately, no consistent log information for a single incident.

Here the summary, please let me know if uploading the full package makes sense:

- Status:

On July 29, 2023

ceph status/df/pool stats/health detail at 01:05:14:
   cluster:
     health: HEALTH_WARN
             1 pools nearfull

ceph status/df/pool stats/health detail at 01:05:28:
   cluster:
     health: HEALTH_WARN
             1 clients failing to advance oldest client/flush tid
             1 pools nearfull


Okay, then this could be the root cause.

If the pool nearful it could block flushing the journal logs to the pooland then the MDS couldn't safe reply to the requests and then block themlike this.

Could you fix the pool nearful issue first and then check could you seeit again ?

[...]

On July 31, 2023

ceph status/df/pool stats/health detail at 10:36:16:
   cluster:
     health: HEALTH_WARN
             1 clients failing to advance oldest client/flush tid
             1 pools nearfull

   cluster:
     health: HEALTH_WARN
             1 pools nearfull

- client evict command (date, time, command):

2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457

We have a 1h time difference between the date stamp of the command and the 
dmesg date stamps. However, there seems to be a weird 10min delay from issuing 
the evict command until it shows up in dmesg on the client.

- dmesg:

[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22


This is a known bug and we have fixed it in both kclient and ceph side:

https://tracker.ceph.com/issues/61200

https://tracker.ceph.com/issues/61217

Thanks

- Xiubo

[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3
[Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds3
[Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds7
[Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds2
[Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds4
[Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds1
[Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds0
[Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds5
[Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Monday, July 31, 2023 12:14 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 7/31/23 16:50, Frank Schilder wrote:

Hi Xiubo,

its a kernel client. I actually made a mistake when trying to evict the client 
and my command didn't do anything. I did another evict and this time the client 
IP showed up in the blacklist. Furthermore, the warning disappeared. I asked 
for the dmesg logs from the client node.

Yeah, after the client's sessions are closed the corresponding warning
should be cleared.

Thanks

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Monday, July 31, 2023 4:12 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

Hi Frank,

On 7/30/23 16:52, Frank Schilder wrote:

Hi Xiubo,

it happened again. This time, we might be able to pull logs from the client 
node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection 
and can't do much. My first priority is securing the cluster to avoid damage 
caused by this issue. I did an MDS evict by client ID on the MDS reporting the 
warning with the client ID reported in the warning. For some reason the client 
got blocked on 2 MDSes after this command, one of these is an ordinary stand-by 
daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? 
I'm thinking here about the MDS eating through all its RAM until it crashes 
hard in an irrecoverable state (that was described as a consequence in an old 
post about this warning). If this is a safe state, I can keep it in this state 
until I return from holidays.

Yeah, I think so.

BTW, are u using the kclients or user space clients ? I checked both
kclient and libcephfs, it seems buggy in libcephfs, which could cause
this issue. But for kclient it's okay till now.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 7/26/23 22:13, Frank Schilder wrote:

Hi Xiubo.

... I am more interested in the kclient side logs. Just want to
know why that oldest request got stuck so long.

I'm afraid I'm a bad admin in this case. I don't have logs from the host any 
more, I would have needed the output of dmesg and this is gone. In case it 
happens again I will try to pull the info out.

The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
than our situation. We had no problems with the MDSes, the cache didn't grow 
and the relevant one was also not put into read-only mode. It was just this 
warning showing all the time, health was OK otherwise. I think the warning was 
there for at least 16h before I failed the MDS.

The MDS log contains nothing, this is the only line mentioning this client:

2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : 
client.145678382 does not advance its oldest_client_tid (16121616), 100000 
completed requests recorded in session

Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

Reply via email to