Hi All:
The filesystem is hung about 10 minutes, automatic recovery.
OS is CentOS7.6
Lustre version is 2.12.4
Network is Intel Omni-Path


Mds' log:
Nov 19 16:54:17 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 
seconds
Nov 19 16:54:17 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 3 previous 
similar messages
Nov 19 16:54:17 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (1): c: 0, oc: 0, rc: 32
Nov 19 16:54:17 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 3 previous similar 
messages
Nov 19 16:54:17 hwmds1 kernel: LNetError: 
25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60@o2ib added 
to recovery queue. Health = 900
Nov 19 16:54:17 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.5@tcp1, removing former export from same NID
Nov 19 16:54:23 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 
seconds
Nov 19 16:54:23 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (6): 
c: 0, oc: 1, rc: 32
Nov 19 16:54:23 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 
10.0.5.4@tcp1 from 172
.18.0.60@o2ib
Nov 19 16:54:23 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 31 previous 
similar messag
es
Nov 19 16:54:23 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT 
to 12345-10.0
.5.4@tcp1: -113
Nov 19 16:54:23 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 31 previous 
similar mes
sages
Nov 19 16:54:29 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 
seconds
Nov 19 16:54:29 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (6): 
c: 0, oc: 0, rc: 32
Nov 19 16:54:29 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 
10.0.5.3@tcp1 from 172
.18.0.60@o2ib
Nov 19 16:54:29 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 31 previous 
similar messag
es
Nov 19 16:54:29 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT 
to 12345-10.0
.5.3@tcp1: -113
Nov 19 16:54:29 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 31 previous 
similar mes
sages
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 
seconds
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (5): 
c: 0, oc: 0, rc: 32
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60@o2ib added 
to recovery q
ueue. Health = 900
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 
10.0.5.3@tcp1 from 172
.18.0.60@o2ib
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 30 previous 
similar messag
es
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT 
to 12345-10.0
.5.3@tcp1: -113
Nov 19 16:54:41 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 30 previous 
similar mes
sages
Nov 19 16:54:43 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.6@tcp1, removing former export from same NID
Nov 19 16:54:43 hwmds1 kernel: Lustre: MGS: Connection restored to 
427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6@tcp1)
Nov 19 16:54:43 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 
seconds
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 3 previous 
similar messages
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (6): 
c: 0, oc: 0, rc: 32
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 3 previous similar 
messages
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 
10.0.5.4@tcp1 from 172
.18.0.60@o2ib
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:1999:lnet_handle_find_routed_path()) Skipped 31 previous 
similar messag
es
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT 
to 12345-10.0
.5.4@tcp1: -113
Nov 19 16:54:53 hwmds1 kernel: LNetError: 
25121:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 31 previous 
similar mes
sages
Nov 19 16:55:00 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
852ed01f-1ce7-4684-ff85-f72325b5891f (at 10.0.5.8@tcp1) reconnecting
Nov 19 16:55:00 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:00 hwmds1 kernel: Lustre: sjtu-MDT0000: Connection restored to 
bbfdb2d3-9c4e-73f3-49cd-f7c519481608 (at 10.0.5.8@tcp1)
Nov 19 16:55:00 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:06 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
8a066f59-a368-98c0-39fb-3e7ecdccf7c3 (at 10.0.5.4@tcp1) reconnecting
Nov 19 16:55:06 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.4@tcp1, removing former export from same NID
Nov 19 16:55:06 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:08 hwmds1 kernel: Lustre: sjtu-MDT0000: Connection restored to 
427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6@tcp1)
Nov 19 16:55:08 hwmds1 kernel: Lustre: Skipped 4 previous similar messages
Nov 19 16:55:13 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 
seconds
Nov 19 16:55:13 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 6 previous 
similar messages
Nov 19 16:55:13 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (4):
c: 0, oc: 1, rc: 32
Nov 19 16:55:13 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 6 previous similar 
messages
Nov 19 16:55:33 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.6@tcp1, removing former export from same NID
Nov 19 16:55:33 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 16:55:33 hwmds1 kernel: Lustre: MGS: Connection restored to 
427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6@tcp1)
Nov 19 16:55:33 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 16:55:47 hwmds1 kernel: LustreError: 
37012:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired 
after 99s:
 evicting client at 10.0.5.4@tcp1  ns: mdt-sjtu-MDT0000_UUID lock: 
ffff970132f7bcc0/0xc6adca8fa239d9b9 lrc: 4/0,0 mode: PR/PR res: [0
x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 699 type: IBT flags: 
0x60200400000020 nid: 10.0.5.4@tcp1 remote: 0x1c8f6a4e7b5421fc expref
: 11682370 pid: 11858 timeout: 2794572 lvb_type: 0
Nov 19 16:55:49 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 
seconds
Nov 19 16:55:49 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 11 previous 
similar messages
Nov 19 16:55:49 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (6): 
c: 0, oc: 0, rc: 32
Nov 19 16:55:49 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 11 previous similar 
messages
Nov 19 16:55:53 hwmds1 kernel: LNetError: 
25121:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 
172.18.0.58@o2ib added to
 recovery queue. Health = 900
Nov 19 16:55:57 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.4@tcp1, removing former export from same NID
Nov 19 16:55:57 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:55:58 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6@tcp1) reconnecting
Nov 19 16:55:58 hwmds1 kernel: Lustre: Skipped 3 previous similar messages
Nov 19 16:56:23 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.6@tcp1, removing former export from same NID
Nov 19 16:56:23 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 16:56:23 hwmds1 kernel: Lustre: MGS: Connection restored to 
427c85e3-c0dd-55df-867f-180b784b009a (at 10.0.5.6@tcp1)
Nov 19 16:56:23 hwmds1 kernel: Lustre: Skipped 5 previous similar messages
Nov 19 16:56:24 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
852ed01f-1ce7-4684-ff85-f72325b5891f (at 10.0.5.8@tcp1) reconnecting
Nov 19 16:56:30 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.18.0.58@o2ib: 0 seconds
Nov 19 16:56:48 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6@tcp1) reconnecting
Nov 19 16:56:56 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 
seconds
Nov 19 16:56:56 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 12 previous 
similar messages
Nov 19 16:56:56 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (5): 
c: 0, oc: 0, rc: 32
Nov 19 16:56:56 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 12 previous similar 
messages
Nov 19 16:57:14 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.6@tcp1, removing former export from same NID
Nov 19 16:57:14 hwmds1 kernel: Lustre: Skipped 4 previous similar messages
Nov 19 16:57:28 hwmds1 kernel: LNet: Service thread pid 37786 was inactive for 
200.65s. The thread might be hung, or it might only be
 slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 19 16:57:28 hwmds1 kernel: Pid: 37786, comm: mdt00_004 
3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
Nov 19 16:57:28 hwmds1 kernel: Call Trace:
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f60c90>] 
ldlm_completion_ast+0x430/0x860 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f617b1>] 
ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc140c5cb>] 
mdt_object_local_lock+0x50b/0xb20 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc140cc50>] 
mdt_object_lock_internal+0x70/0x360 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc140de0a>] 
mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc1415fe5>] 
mdt_intent_getattr+0x2b5/0x480 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc1412d35>] 
mdt_intent_policy+0x435/0xd80 [mdt]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f47e06>] 
ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
:Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0f70516>] 
ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0ff8bb2>] tgt_enqueue+0x62/0x210 
[ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0fff9ca>] 
tgt_request_handle+0xada/0x1570 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0fa447b>] 
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffc0fa7de4>] ptlrpc_main+0xb34/0x1470 
[ptlrpc]
Nov 19 16:57:28 hwmds1 kernel: [<ffffffff9b6c61f1>] kthread+0xd1/0xe0
Nov 19 16:57:28 hwmds1 kernel: [<ffffffff9bd8dd1d>] 
ret_from_fork_nospec_begin+0x7/0x21
Nov 19 16:57:28 hwmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Nov 19 16:57:28 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312248.37786
Nov 19 16:57:29 hwmds1 kernel: LNet: Service thread pid 37928 was inactive for 
200.87s. The thread might be hung, or it might only be
 slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 19 16:57:29 hwmds1 kernel: Pid: 37928, comm: mdt01_006 
3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
Nov 19 16:57:29 hwmds1 kernel: Call Trace:
Nov 19 16:57:29 hwmds1 kernel: LNet: Service thread pid 486 was inactive for 
200.88s. Watchdog stack traces are limited to 3 per 300 
seconds, skipping this one.
Nov 19 16:57:29 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312249.42759
Nov 19 16:57:30 hwmds1 kernel: LNet: Service thread pid 6301 was inactive for 
200.16s. Watchdog stack traces are limited to 3 per 300
 seconds, skipping this one.
Nov 19 16:57:30 hwmds1 kernel: LNet: Skipped 143 previous similar messages
Nov 19 16:57:30 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312250.6301
Nov 19 16:57:31 hwmds1 kernel: LNet: Service thread pid 6383 was inactive for 
200.42s. Watchdog stack traces are limited to 3 per 300
 seconds, skipping this one.
Nov 19 16:57:31 hwmds1 kernel: LNet: Skipped 22 previous similar messages
Nov 19 16:57:31 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312251.6383
Nov 19 16:57:32 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312252.6336
Nov 19 16:57:33 hwmds1 kernel: LNet: Service thread pid 42111 was inactive for 
200.16s. Watchdog stack traces are limited to 3 per 30
0 seconds, skipping this one.
Nov 19 16:57:33 hwmds1 kernel: LNet: Skipped 28 previous similar messages
Nov 19 16:57:33 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312253.42111
Nov 19 16:57:34 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312254.6297
Nov 19 16:57:35 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312255.6258
Nov 19 16:57:36 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312256.6315
Nov 19 16:57:37 hwmds1 kernel: LNet: Service thread pid 6391 was inactive for 
200.26s. Watchdog stack traces are limited to 3 per 300
 seconds, skipping this one.
Nov 19 16:57:37 hwmds1 kernel: LNet: Skipped 32 previous similar messages
Nov 19 16:57:37 hwmds1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1637312257.6391
Nov 19 16:57:37 hwmds1 kernel: Lustre: MGS: Connection restored to 
d602f485-3852-99ed-1518-2f5a73c38bd9 (at 10.0.5.4@tcp1)
Nov 19 16:57:37 hwmds1 kernel: Lustre: Skipped 9 previous similar messages
Nov 19 16:58:04 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6@tcp1) reconnecting
Nov 19 16:58:04 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:58:29 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.8@tcp1, removing former export from same NID
Nov 19 16:58:29 hwmds1 kernel: Lustre: Skipped 6 previous similar messages
Nov 19 16:59:07 hwmds1 kernel: LustreError: 
6237:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out 
(enqueued a
t 1637312047, 300s ago); not entering recovery in server code, just going back 
to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff96c81f715
f80/0xc6adca8fa23ea12a lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 
bits 0x13/0x0 rrc: 700 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6237 timeout: 0 lvb_type: 0
Nov 19 16:59:07 hwmds1 kernel: LustreError: 
6237:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 4 previous 
similar mes
sages
Nov 19 16:59:07 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 9 
seconds
Nov 19 16:59:07 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 26 previous 
similar messages
Nov 19 16:59:07 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (14):
 c: 0, oc: 0, rc: 32
Nov 19 16:59:07 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 26 previous similar 
messages
Nov 19 16:59:07 hwmds1 kernel: LustreError: 
6253:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out 
(enqueued a:
t 1637312047, 300s ago); not entering recovery in server code, just going back 
to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9709bd320
480/0xc6adca8fa23f872e lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 
bits 0x13/0x0 rrc: 699 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6253 timeout: 0 lvb_type: 0
Nov 19 16:59:07 hwmds1 kernel: LustreError: 
6253:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 70 previous 
similar me
ssages
Nov 19 16:59:08 hwmds1 kernel: LustreError: 
42188:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out 
(enqueued 
at 1637312048, 300s ago); not entering recovery in server code, just going back 
to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff96ed996a
5b00/0xc6adca8fa2414f0e lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 
bits 0x13/0x0 rrc: 699 type: IBT flags: 0x4021000000000
0 nid: local remote: 0x0 expref: -99 pid: 42188 timeout: 0 lvb_type: 0
Nov 19 16:59:08 hwmds1 kernel: LustreError: 
42188:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 29 previous 
similar m
essages
Nov 19 16:59:10 hwmds1 kernel: LustreError: 
6298:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out 
(enqueued a
t 1637312050, 300s ago); not entering recovery in server code, just going back 
to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff970dc9331
d40/0xc6adca8fa2455cb9 lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 
bits 0x13/0x0 rrc: 699 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6298 timeout: 0 lvb_type: 0
Nov 19 16:59:10 hwmds1 kernel: LustreError: 
6298:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 71 previous 
similar me
ssages
Nov 19 16:59:15 hwmds1 kernel: LustreError: 
6335:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out 
(enqueued a
t 1637312055, 300s ago); not entering recovery in server code, just going back 
to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff96c732bd4
fc0/0xc6adca8fa247ba24 lrc: 3/1,0 mode: --/PR res: [0x200000007:0x1:0x0].0x0 
bits 0x13/0x0 rrc: 699 type: IBT flags: 0x40210000000000
 nid: local remote: 0x0 expref: -99 pid: 6335 timeout: 0 lvb_type: 0
Nov 19 16:59:15 hwmds1 kernel: LustreError: 
6335:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 28 previous 
similar me
ssages
Nov 19 16:59:19 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6@tcp1) reconnecting
Nov 19 16:59:19 hwmds1 kernel: Lustre: Skipped 1 previous similar message
Nov 19 16:59:39 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.18.0.58@o2ib: 2 seconds
Nov 19 16:59:39 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 9 previous similar 
messages
Nov 19 17:00:10 hwmds1 kernel: Lustre: MGS: Connection restored to 
bbfdb2d3-9c4e-73f3-49cd-f7c519481608 (at 10.0.5.8@tcp1)
Nov 19 17:00:10 hwmds1 kernel: Lustre: Skipped 13 previous similar messages
Nov 19 17:00:36 hwmds1 kernel: Lustre: sjtu-MDT0000: haven't heard from client 
37d3a752-14aa-0392-4653-9c396f5f2f4e (at 10.0.5.3@tcp1
) in 227 seconds. I think it's dead, and I am evicting it. exp 
ffff96afbaa27400, cur 1637312436 expire 1637312286 last 1637312209
Nov 19 17:00:48 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.18.0.58@o2ib: 3 seconds
Nov 19 17:00:48 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar 
messages
Nov 19 17:00:58 hwmds1 kernel: Lustre: MGS: Received new LWP connection from 
10.0.5.4@tcp1, removing former export from same NID
Nov 19 17:00:58 hwmds1 kernel: Lustre: Skipped 5 previous similar messages
Nov 19 17:01:31 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.18.0.58@o2ib: 1 seconds
Nov 19 17:01:31 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar 
messages
Nov 19 17:01:50 hwmds1 kernel: Lustre: sjtu-MDT0000: Client 
eba02ec3-4f91-bc12-39d7-32641f8e78d0 (at 10.0.5.6@tcp1) reconnecting
Nov 19 17:01:50 hwmds1 kernel: Lustre: Skipped 2 previous similar messages
Nov 19 17:02:51 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.18.0.58@o2ib: 0 seconds
Nov 19 17:02:51 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar 
messages
Nov 19 17:03:23 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 
seconds
Nov 19 17:03:23 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 34 previous 
similar messages
Nov 19 17:03:23 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 
172.18.0.58@o2ib (8): 
c: 0, oc: 0, rc: 32
Nov 19 17:03:23 hwmds1 kernel: LNetError: 
25082:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 34 previous similar 
messages
Nov 19 17:03:23 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.18.0.58@o2ib: 3 seconds
Nov 19 17:03:23 hwmds1 kernel: LNet: 
25082:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 31 previous similar 
messages
Nov 19 17:03:32 hwmds1 kernel: Lustre: sjtu-MDT0000: haven't heard from client 
852ed01f-1ce7-4684-ff85-f72325b5891f (at 10.0.5.8@tcp1
) in 227 seconds. I think it's dead, and I am evicting it. exp 
ffff970e82d5e000, cur 1637312612 expire 1637312462 last 1637312385
Nov 19 17:03:35 hwmds1 kernel: LNet: Service thread pid 6322 completed after 
558.36s. This indicates the system was overloaded (too m
any service threads, or there were not enough hardware resources).
Nov 19 17:03:35 hwmds1 kernel: LNet: Skipped 3 previous similar messages
Nov 19 17:03:35 hwmds1 kernel: LustreError: 
6337:0:(service.c:2128:ptlrpc_server_handle_request()) @@@ Dropping timed-out 
request fro
m 12345-10.0.5.4@tcp1: deadline 100:358s ago#012  req@ffff96b20f0a6300 
x1714140739458176/t0(0) o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 d
l 1637312257 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Nov 19 17:03:35 hwmds1 kernel: Lustre: 
6389:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer 
than estimated 
(100:357s); client may timeout.  req@ffff96bf13b18d80 x1713999142500672/t0(0) 
o38-><?>@<?>:0/0 lens 520/0 e 0 to 0 dl 1637312258 ref 
1 fl Interpret:/0/ffffffff rc 0/-1
Nov 19 17:03:35 hwmds1 kernel: Lustre: 
6389:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous 
similar message
Nov 19 17:03:35 hwmds1 kernel: LustreError: 
6337:0:(service.c:2128:ptlrpc_server_handle_request()) Skipped 2 previous 
similar message
s
Nov 19 17:03:35 hwmds1 kernel: LustreError: 
51417:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 
10.0.5.8@tcp1 arrived 
at 1637312615 with bad export cookie 14316320842890482053
Nov 19 17:03:40 hwmds1 kernel: LNetError: 
25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60@o2ib added 
to recovery q
ueue. Health = 900
Nov 19 17:03:44 hwmds1 kernel: Lustre: 
6281:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for sent del
ay: [sent 1637312617/real 0]  req@ffff96ba37651680 x1713962196237696/t0(0) 
o104->[email protected]@tcp1:15/16 lens 296/224 e 0 to
 1 dl 1637312624 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Nov 19 17:03:44 hwmds1 kernel: Lustre: 
6281:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 11 previous similar 
messages
Nov 19 17:03:51 hwmds1 kernel: LNetError: 
25082:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.18.0.60@o2ib added 
to recovery q
ueue. Health = 900
Nov 19 17:04:38 hwmds1 kernel: LNetError: 
25121:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 
172.18.0.58@o2ib added to
 recovery queue. Health = 900
Nov 19 17:04:47 hwmds1 kernel: Lustre: sjtu-MDT0000: Connection restored to 
bbfdb2d3-9c4e-73f3-49cd-f7c519481608 (at 10.0.5.8@tcp1)






 
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to