On 8/7/23 21:54, Frank Schilder wrote:
Dear Xiubo,
I managed to collect some information. It looks like there is nothing in the
dmesg log around the time the client failed to advance its TID. I collected
short snippets around the critical time below. I have full logs in case you are
interested. Its large files, I will need to do an upload for that.
I also have a dump of "mds session ls" output for clients that showed the same
issue later. Unfortunately, no consistent log information for a single incident.
Here the summary, please let me know if uploading the full package makes sense:
- Status:
On July 29, 2023
ceph status/df/pool stats/health detail at 01:05:14:
cluster:
health: HEALTH_WARN
1 pools nearfull
ceph status/df/pool stats/health detail at 01:05:28:
cluster:
health: HEALTH_WARN
1 clients failing to advance oldest client/flush tid
1 pools nearfull
Okay, then this could be the root cause.
If the pool nearful it could block flushing the journal logs to the pool
and then the MDS couldn't safe reply to the requests and then block them
like this.
Could you fix the pool nearful issue first and then check could you see
it again ?
[...]
On July 31, 2023
ceph status/df/pool stats/health detail at 10:36:16:
cluster:
health: HEALTH_WARN
1 clients failing to advance oldest client/flush tid
1 pools nearfull
cluster:
health: HEALTH_WARN
1 pools nearfull
- client evict command (date, time, command):
2023-07-31 10:36 ceph tell mds.ceph-11 client evict id=145678457
We have a 1h time difference between the date stamp of the command and the
dmesg date stamps. However, there seems to be a weird 10min delay from issuing
the evict command until it shows up in dmesg on the client.
- dmesg:
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con
state OPEN)
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22
This is a known bug and we have fixed it in both kclient and ceph side:
https://tracker.ceph.com/issues/61200
https://tracker.ceph.com/issues/61217
Thanks
- Xiubo
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3
[Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed (con
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds3
[Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds7
[Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds2
[Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds4
[Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds1
[Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds0
[Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds5
[Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Monday, July 31, 2023 12:14 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin
On 7/31/23 16:50, Frank Schilder wrote:
Hi Xiubo,
its a kernel client. I actually made a mistake when trying to evict the client
and my command didn't do anything. I did another evict and this time the client
IP showed up in the blacklist. Furthermore, the warning disappeared. I asked
for the dmesg logs from the client node.
Yeah, after the client's sessions are closed the corresponding warning
should be cleared.
Thanks
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Monday, July 31, 2023 4:12 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin
Hi Frank,
On 7/30/23 16:52, Frank Schilder wrote:
Hi Xiubo,
it happened again. This time, we might be able to pull logs from the client
node. Please take a look at my intermediate action below - thanks!
I am in a bit of a calamity, I'm on holidays with terrible network connection
and can't do much. My first priority is securing the cluster to avoid damage
caused by this issue. I did an MDS evict by client ID on the MDS reporting the
warning with the client ID reported in the warning. For some reason the client
got blocked on 2 MDSes after this command, one of these is an ordinary stand-by
daemon. Not sure if this is expected.
Main question: is this sufficient to prevent any damaging IO on the cluster?
I'm thinking here about the MDS eating through all its RAM until it crashes
hard in an irrecoverable state (that was described as a consequence in an old
post about this warning). If this is a safe state, I can keep it in this state
until I return from holidays.
Yeah, I think so.
BTW, are u using the kclients or user space clients ? I checked both
kclient and libcephfs, it seems buggy in libcephfs, which could cause
this issue. But for kclient it's okay till now.
Thanks
- Xiubo
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin
On 7/26/23 22:13, Frank Schilder wrote:
Hi Xiubo.
... I am more interested in the kclient side logs. Just want to
know why that oldest request got stuck so long.
I'm afraid I'm a bad admin in this case. I don't have logs from the host any
more, I would have needed the output of dmesg and this is gone. In case it
happens again I will try to pull the info out.
The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent
than our situation. We had no problems with the MDSes, the cache didn't grow
and the relevant one was also not put into read-only mode. It was just this
warning showing all the time, health was OK otherwise. I think the warning was
there for at least 16h before I failed the MDS.
The MDS log contains nothing, this is the only line mentioning this client:
2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] :
client.145678382 does not advance its oldest_client_tid (16121616), 100000
completed requests recorded in session
Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.
Thanks
- Xiubo
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io