Re: [lustre-discuss] Connectivity issues after client crash

Jesse Stroik Tue, 03 Dec 2024 12:30:31 -0800

Hi Shane,

I realize this is quite an old post but I think it is worth responding for 
posterity and because I suspect others who upgrade may run into this issue.


I'm observing some similar issues to what you describe. They started this 
weekend for us on two of our servers which were upgraded to rocky 8 and lustre 
2.15.5 a couple of months ago. Like you, we run infiniband. We also route to 
ethernet clients and our lustre routers are now running rocky 8 and either 
lustre 2.15.5 or 2.15.3.

The behavior that we observe is widely similar to yours though we have a few 
observations that may be relevant. I suspect that if we can limit the 
infiniband communication to certain segments or LIDs it seems to behave fine.

Today i was able to reproduce the problem when routing to an ethernet client 
that ran our robinhood policy engine. With the file systems behaving normally 
to our infiniband network clients, I enabled routing on them again and tested 
with all lrouters enabled or just individual lrouters enabled. In all cases 
that i tested, mounting one of those file systems resulted in the communication 
problems on the MDS and OSS servers for that file system. In one case an MDS 
panicked as soon as that client mounted its file system. I put an IB card into 
that policy engine client, removed all lnet routing for those file systems and 
lnet behavior was stable with that client mounting those file systems and 
ingesting the changelogs.

When the servers have failed lnet communication, they can sometimes be fixed by 
stopping the lustre service / unmounting the OSTs and MDTs and restarting / 
reconfiguring lnet. If i'm unable to do that, rebooting the affected systems 
seems to be the only solution and they sometimes hang indefinitely trying to 
shut down lnet necessitating a power cycle.

When the issue first started I did suspect the infiniband fabric and looked for 
problems with that, though, like you, I didn't find anything conclusive. A 
recent change we've had is our OpenSM subnet manager definitely restarted and 
moved when we updated the kernel and lustre-client versions on two of our 
lrouters. However, even when lnet is unable to communicate, the nodes can all 
still ping each other with ipv4 ping IPoIB which suggests the subnet manager 
and infiniband are working.

I did also attempt manual lnet peer discovery on individual systems while they 
were affected to see if i could recover them that way. Discovery failed on 
those systems, but only when we were already observing the communication 
problem with lnet.

Jesse

________________________________________
From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Nehring, Shane R [LAS] via lustre-discuss <lustre-discuss@lists.lustre.org>
Sent: Wednesday, December 21, 2022 11:49 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Connectivity issues after client crash

Hello all,

I'm hoping that someone might be able to help me with an issue I've been seeing
periodically since updating to 2.15.x.

Back in July we remade our scratch storage volume with 2.15 after running 2.12
for a long time. As part of the upgrade we reinstalled our oss and mds nodes
with rhel 8 so we'd finally be able to take advantage of project quotas. For
background, our nodes are connected via omnipath and we have a mix of el7
clients, el8 clients, and el9 clients. We've been piecemeal updating our clients
to el9 over the past few months with the goal of being 99% el9 clients by mid
January.

Since upgrading we've seen recurring connectivity issues arise in the cluster
from time to time which seem to be very strongly correlated with a client
crashing. The fabric itself seems fine. There's no evidence of error or any
packet loss, so I cannot confidently blame it.

On an oss that's having trouble communicating we see the following messages for
various osts:

kernel: LNetError: 31282:0:(o2iblnd_cb.c:3358:kiblnd_check_txs_locked()) Timed
out tx: active_txs(WSQ:100), 19 seconds
kernel: LNetError: 31282:0:(o2iblnd_cb.c:3358:kiblnd_check_txs_locked()) Skipped
6 previous similar messages
kernel: LNetError: 31282:0:(o2iblnd_cb.c:3428:kiblnd_check_conns()) Timed out
RDMA with 172.16.100.19@o2ib (4): c: 31, oc: 0, rc: 31
kernel: LNetError: 31282:0:(o2iblnd_cb.c:3428:kiblnd_check_conns()) Skipped 6
previous similar messages
kernel: LustreError: 109351:0:(ldlm_lib.c:3543:target_bulk_io()) @@@ network
error on bulk WRITE  req@00000000f848e208 x1751146351599040/t0(0) o4-
>10ccde33-01ef-47cf-873d-da9a1b6bb1ea@172.16.100.19@o2ib:19/0 lens 488/448 e 0
to 0 dl 1671634949 ref 1 fl Interpret:/2/0 rc 0/0 job:'873032'
kernel: Lustre: work-OST0007: Bulk IO write error with 10ccde33-01ef-47cf-873d-
da9a1b6bb1ea (at 172.16.100.19@o2ib), client will retry: rc = -110
kernel: Lustre: Skipped 5 previous similar messages
kernel: LustreError: 109351:0:(ldlm_lib.c:3543:target_bulk_io()) @@@ network
error on bulk WRITE  req@000000008176189a x1751153771751616/t0(0) o4-
>63e88958-d0b1-4c7f-8413-da96b181cd92@172.16.100.25@o2ib:94/0 lens 488/448 e 0
to 0 dl 1671635024 ref 1 fl Interpret:/0/0 rc 0/0 job:'873641'
Lustre: work-OST000d: Bulk IO write error with 63e88958-d0b1-4c7f-8413-
da96b181cd92 (at 172.16.100.25@o2ib), client will retry: rc = -110

An already connected client will see messages along the lines of:

kernel: Lustre: 3736:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1671634931/real 1671634931]
req@ffff94a0ed490000 x1751139078233344/t0(0) o9-
>work-OST0001-osc-ffff95552570a000@172.16.100.253@o2ib:28/4 lens 224/224 e 0 to
1

A stack trace on one of the clients of an ls against the volume that hangs:

[<ffffffffc1576d05>] cl_sync_io_wait+0x1c5/0x480 [obdclass]
[<ffffffffc1572943>] cl_lock_request+0x1d3/0x210 [obdclass]
[<ffffffffc17db1d9>] cl_glimpse_lock+0x329/0x380 [lustre]
[<ffffffffc17db5a5>] cl_glimpse_size0+0x255/0x280 [lustre]
[<ffffffffc1793cdc>] ll_getattr_dentry+0x50c/0x9c0 [lustre]
[<ffffffffc17941ae>] ll_getattr+0x1e/0x20 [lustre]
[<ffffffff99a53d49>] vfs_getattr+0x49/0x80
[<ffffffff99a53e55>] vfs_fstatat+0x75/0xc0
[<ffffffff99a54261>] SYSC_newlstat+0x31/0x60
[<ffffffff99a546ce>] SyS_newlstat+0xe/0x10
[<ffffffff99f99f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

If we try to reboot a client to clear the issue it won't be able to mount the
filesystem; the mount process hangs in D until it times out. A stack trace from
the mount:
[<0>] llog_process_or_fork+0x2de/0x570 [obdclass]
[<0>] llog_process+0x10/0x20 [obdclass]
[<0>] class_config_parse_llog+0x1eb/0x3e0 [obdclass]
[<0>] mgc_process_cfg_log+0x659/0xc90 [mgc]
[<0>] mgc_process_log+0x667/0x800 [mgc]
[<0>] mgc_process_config+0x42b/0x6e0 [mgc]
[<0>] obd_process_config.constprop.0+0x76/0x1a0 [obdclass]
[<0>] lustre_process_log+0x562/0x8f0 [obdclass]
[<0>] ll_fill_super+0x6ec/0x1020 [lustre]
[<0>] lustre_fill_super+0xe4/0x470 [lustre]
[<0>] mount_nodev+0x41/0x90
[<0>] legacy_get_tree+0x24/0x40
[<0>] vfs_get_tree+0x22/0xb0
[<0>] do_new_mount+0x176/0x310
[<0>] __x64_sys_mount+0x103/0x140
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae


When this has happened in the past we've had to shut down the entire lustre
backend, and if we're lucky clients would start recovering once everything was
back up again. The last time this happened we actually had to shut down the
entire cluster.

In this particular case rebooting the osses and mds/mgs has stopped the errors
on the server side, but el9 and el8 clients are now seeing:

kernel: LNetError: 149017:0:(o2iblnd_cb.c:966:kiblnd_post_tx_locked()) Error -22
posting transmit to 172.16.100.250@o2ib
kernel: LNetError: 149017:0:(o2iblnd_cb.c:966:kiblnd_post_tx_locked()) Skipped
31 previous similar messages
and
LustreError: 3585788:0:(file.c:5096:ll_inode_revalidate_fini()) work: revalidate
FID [0x200000007:0x1:0x0] error: rc = -4

el7 clients have recovered seemingly without issue.

Wondering if anyone might have any suggestions on where to look as to why this
is breaking, or pointers as to how to recover from this situation without
rebooting if possible.



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Connectivity issues after client crash

Reply via email to