Continuous network failures is a very challenging environment for a network filesystem. Even though there are server-side resends of lock callbacks, eventually the client will miss two or three callbacks and the server has no choice but to evict it from the filesystem if it wants to make progress with other client requests.
This can also cause problems for other clients, since they are waiting to get a lock that the broken client is holding, which makes the whole filesystem "hang" until the client finally gets the callback, or is evicted. We've discussed a few potential solutions for this, but nothing has been implemented yet: - put clients with continual network errors into the "dog house" and they cannot use the filesystem until their network is repaired, which is drastic for that client (though improves life for other clients) - change clients with continual network errors from writeback cache to cacheless/lockless/sync, which will hurt their performance but still allow the client to access the filesystem, without impact other clients. Cheers, Andreas On May 29, 2025, at 00:19, zufei chen via lustre-discuss <[email protected]> wrote: I. Background: 1 Four physical nodes, each physical machine deploys 2 virtual machines: lustre-mds-nodexx (containing 2 MDTs internally) and lustre-oss-nodexx (containing 8 OSTs and an MGS in one of them). 2 Two RoCE network interfaces, ens6f0np0 and ens6f1np1, on the physical machines are virtualized and passed through to the virtual machines (service1 and service2). 3 Using Lustre version 2.15.5 with Pacemaker. 4 A client is running vdbench workloads. 5 Simulating network interface flapping on ens6f0np0 on one of the physical nodes using the following script: for i in {1..10}; do ifconfig ens6f0np0 down; sleep 20; ifconfig ens6f0np0 up; sleep 30; done II. Problem: 1 After running the network flapping script for a while, the business experiences EIO errors, leading to service interruption. 2 This issue is almost reproducible every time. III. Preliminary Analysis: The issue is suspected to be caused by lock callback timeouts, which lead to the server evicting the client. IV. Relevant Logs: Server: May 27 12:09:19 lustre-oss-node40 kernel: LustreError: 13958:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 268s: evicting client at 10.255.153.118@o2ib ns: filter-PFStest-OST0005_UUID lock: 00000000d705f0d0/0x7bcb4583f93039cb lrc: 3/0,0 mode: PR/PR res: [0x6936:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.118@o2ib remote: 0x977d715b44c72ae8 expref: 12723 pid: 14457 timeout: 60814 lvb_type: 1 Client: May 27 12:09:27 rocky9vm2 kernel: Lustre: PFStest-OST0005-osc-ff49d5028d989800: Connection to PFStest-OST0005 (at 10.255.153.242@o2ib) was lost; in-progress operations using this service will wait for recovery to complete. V. Additional Information IP Configuration in Virtual Machines: | Virtual Machine | Service | IP Address | | ----------------- | -------- | -------------- | | lustre-mds-node32 | service1 | 10.255.153.236 | | | service2 | 10.255.153.237 | | lustre-oss-node32 | service1 | 10.255.153.238 | | | service2 | 10.255.153.239 | | lustre-mds-node40 | service1 | 10.255.153.240 | | | service2 | 10.255.153.241 | | lustre-oss-node40 | service1 | 10.255.153.242 | | | service2 | 10.255.153.243 | | lustre-mds-node41 | service1 | 10.255.153.244 | | | service2 | 10.255.153.245 | | lustre-oss-node41 | service1 | 10.255.153.246 | | | service2 | 10.255.153.247 | | lustre-mds-node42 | service1 | 10.255.153.248 | | | service2 | 10.255.153.249 | | lustre-oss-node42 | service1 | 10.255.153.250 | | | service2 | 10.255.153.251 | 2 Policy Routing Configuration on Server (Example: lustre-oss-node40): cat /etc/iproute2/rt_tables # # reserved values # 255 local 254 main 253 default 0 unspec # # local # #1 inr.ruhep 263 service1 271 service2 [root@lustre-oss-node40 ~]# ip route show table service1 10.255.153.0/24<http://10.255.153.0/24> dev service1 scope link src 10.255.153.242 [root@lustre-oss-node40 ~]# ip route show table service2 10.255.153.0/24<http://10.255.153.0/24> dev service2 scope link src 10.255.153.243 [root@lustre-oss-node40 ~]# ip rule list 0: from all lookup local 32764: from 10.255.153.243 lookup service2 32765: from 10.255.153.242 lookup service1 32766: from all lookup main 32767: from all lookup default [root@lustre-oss-node40 ~]# ip route 10.255.153.0/24<http://10.255.153.0/24> dev service2 proto kernel scope link src 10.255.153.243 metric 101 10.255.153.0/24<http://10.255.153.0/24> dev service1 proto kernel scope link src 10.255.153.242 metric 102 3 /etc/modprobe.d/lustre.conf: options lnet networks="o2ib(service2)[0,1],o2ib(service1)[0,1]" options libcfs cpu_npartitions=2 options mdt max_mod_rpcs_per_client=128 options mdt mds_io_num_cpts=[0,1] options mdt mds_num_cpts=[0,1] options mdt mds_rdpg_num_cpts=[0,1] options mds mds_num_threads=512 options ost oss_num_threads=512 options ost oss_cpts=[0,1] options ost oss_io_cpts=[0,1] options lnet portal_rotor=1 options lnet lnet_recovery_limit=10 options ptlrpc ldlm_enqueue_min=260 VI. Other Attempts 1 Reduced LNet Timeout and Increased Retry Count: Both server and client have reduced the LNet timeout and increased the retry count, but the issue persists. lnetctl set transaction_timeout 10 lnetctl set retry_count 3 lnetctl set health_sensitivity 1 2 Set Recovery Limit: Both server and client have set the recovery limit, but the issue persists. lnetctl set recovery_limit 10 3 Simulated Network flapping Using iptables: Simulated network flapping using iptables in the virtual machines, but the issue persists. #!/bin/bash for j in {1..1000}; do date; echo -e "\nIteration $j: Starting single-port network flapping\n"; for i in {1..10}; do echo -e " ==== Iteration $i down ===="; date; sudo iptables -I INPUT 1 -i service1 -j DROP; sudo iptables -I OUTPUT 1 -o service1 -j DROP; sleep 20; echo -e " ==== Iteration $i up ===="; date; sudo iptables -D INPUT -i service1 -j DROP; sudo iptables -D OUTPUT -o service1 -j DROP; sleep 30s; done echo -e "\nIteration $j: Ending single-port network flapping\n"; date; sudo iptables -L INPUT -v | grep -i service1 sudo iptables -L OUTPUT -v | grep -i service1 sleep 120; done VII. Any Suggestions? Dear all, I would appreciate any suggestions or insights you might have regarding this issue. Thank you! _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
