Thank you for your response.
May I ask whether the retry for failed lock callback messages is handled by the 
lnet layer or the lock module itself? If it's the lnet layer that does the 
retrying, then theoretically, in the case of a single network interface flap, 
the retry should eventually succeed. For the lock module, this would just mean 
a bit of a delay, right?
Also, what's your opinion on whether we should configure the two network 
interfaces in bond4 mode to address the issue of a single network interface 
flapping?



[email protected]
 
From: Andreas Dilger
Date: 2025-05-31 14:28
To: zufei chen
CC: lustre-discuss
Subject: Re: [lustre-discuss] Client Eviction and EIO Errors During Simulated 
Network Flapping (Lustre 2.15.5 + RoCE)
Continuous network failures is a very challenging environment for a network 
filesystem.  Even though there are server-side resends of lock callbacks, 
eventually the client will miss two or three callbacks and the server has no 
choice but to evict it from the filesystem if it wants to make progress with 
other client requests.  

This can also cause problems for other clients, since they are waiting to get a 
lock that the broken client is holding, which makes the whole filesystem "hang" 
until the client finally gets the callback, or is evicted. 

We've discussed a few potential solutions for this, but nothing has been 
implemented yet:
- put clients with continual network errors into the "dog house" and they 
cannot use the filesystem until their network is repaired, which is drastic for 
that client (though improves life for other clients)
- change clients with continual network errors from writeback cache to 
cacheless/lockless/sync, which will hurt their performance but still allow the 
client to access the filesystem, without impact other clients. 

Cheers, Andreas

On May 29, 2025, at 00:19, zufei chen via lustre-discuss 
<[email protected]> wrote:

 

I. Background:
1 Four physical nodes, each physical machine deploys 2 virtual machines: 
lustre-mds-nodexx (containing 2 MDTs internally) and lustre-oss-nodexx 
(containing 8 OSTs and an MGS in one of them).
2 Two RoCE network interfaces, ens6f0np0 and ens6f1np1, on the physical 
machines are virtualized and passed through to the virtual machines (service1 
and service2).
3 Using Lustre version 2.15.5 with Pacemaker.
4 A client is running vdbench workloads.
5 Simulating network interface flapping on ens6f0np0 on one of the physical 
nodes using the following script:
for i in {1..10}; do ifconfig ens6f0np0 down; sleep 20; ifconfig ens6f0np0 up; 
sleep 30; done

II. Problem:
1 After running the network flapping script for a while, the business 
experiences EIO errors, leading to service interruption.
2 This issue is almost reproducible every time.

III. Preliminary Analysis:
The issue is suspected to be caused by lock callback timeouts, which lead to 
the server evicting the client.

IV. Relevant Logs:
Server:
May 27 12:09:19 lustre-oss-node40 kernel: LustreError: 
13958:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired 
after 268s: evicting client at 10.255.153.118@o2ib  ns: 
filter-PFStest-OST0005_UUID lock: 00000000d705f0d0/0x7bcb4583f93039cb 
        lrc: 3/0,0 mode: PR/PR res: [0x6936:0x0:0x0].0x0 rrc: 3 type: EXT 
[0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 
nid: 10.255.153.118@o2ib remote: 0x977d715b44c72ae8 expref: 12723 pid: 14457 
timeout: 60814 lvb_type: 1

Client:
May 27 12:09:27 rocky9vm2 kernel: Lustre: PFStest-OST0005-osc-ff49d5028d989800: 
Connection to PFStest-OST0005 (at 10.255.153.242@o2ib) was lost; in-progress 
operations using this service will wait for recovery to complete.

V. Additional Information
IP Configuration in Virtual Machines:
| Virtual Machine   | Service  | IP Address     |
| ----------------- | -------- | -------------- |
| lustre-mds-node32 | service1 | 10.255.153.236 |
|                   | service2 | 10.255.153.237 |
| lustre-oss-node32 | service1 | 10.255.153.238 |
|                   | service2 | 10.255.153.239 |
| lustre-mds-node40 | service1 | 10.255.153.240 |
|                   | service2 | 10.255.153.241 |
| lustre-oss-node40 | service1 | 10.255.153.242 |
|                   | service2 | 10.255.153.243 |
| lustre-mds-node41 | service1 | 10.255.153.244 |
|                   | service2 | 10.255.153.245 |
| lustre-oss-node41 | service1 | 10.255.153.246 |
|                   | service2 | 10.255.153.247 |
| lustre-mds-node42 | service1 | 10.255.153.248 |
|                   | service2 | 10.255.153.249 |
| lustre-oss-node42 | service1 | 10.255.153.250 |
|                   | service2 | 10.255.153.251 |

2 Policy Routing Configuration on Server (Example: lustre-oss-node40):

cat /etc/iproute2/rt_tables
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep
263     service1
271     service2

[root@lustre-oss-node40 ~]# ip route show table service1
10.255.153.0/24 dev service1 scope link src 10.255.153.242
[root@lustre-oss-node40 ~]# ip route show table service2
10.255.153.0/24 dev service2 scope link src 10.255.153.243
[root@lustre-oss-node40 ~]# ip rule list
0:      from all lookup local
32764:  from 10.255.153.243 lookup service2
32765:  from 10.255.153.242 lookup service1
32766:  from all lookup main
32767:  from all lookup default
[root@lustre-oss-node40 ~]# ip route
10.255.153.0/24 dev service2 proto kernel scope link src 10.255.153.243 metric 
101
10.255.153.0/24 dev service1 proto kernel scope link src 10.255.153.242 metric 
102

3 /etc/modprobe.d/lustre.conf:
options lnet networks="o2ib(service2)[0,1],o2ib(service1)[0,1]"
options libcfs cpu_npartitions=2
options mdt max_mod_rpcs_per_client=128
options mdt mds_io_num_cpts=[0,1]
options mdt mds_num_cpts=[0,1]
options mdt mds_rdpg_num_cpts=[0,1]
options mds mds_num_threads=512
options ost oss_num_threads=512
options ost oss_cpts=[0,1]
options ost oss_io_cpts=[0,1]
options lnet portal_rotor=1
options lnet lnet_recovery_limit=10
options ptlrpc ldlm_enqueue_min=260

VI. Other Attempts
1 Reduced LNet Timeout and Increased Retry Count:
Both server and client have reduced the LNet timeout and increased the retry 
count, but the issue persists.
lnetctl set transaction_timeout 10
lnetctl set retry_count 3
lnetctl set health_sensitivity 1

2 Set Recovery Limit:
Both server and client have set the recovery limit, but the issue persists.
lnetctl set recovery_limit 10

3 Simulated Network flapping Using iptables:
Simulated network flapping using iptables in the virtual machines, but the 
issue persists.
#!/bin/bash
for j in {1..1000}; do
    date;
    echo -e "\nIteration $j: Starting single-port network flapping\n";
    for i in {1..10}; do
        echo -e " ==== Iteration $i down ===="; date;
        sudo iptables -I INPUT 1 -i service1 -j DROP;
        sudo iptables -I OUTPUT 1 -o service1 -j DROP;
        sleep 20;
        echo -e " ==== Iteration $i up ===="; date;
        sudo iptables -D INPUT -i service1 -j DROP;
        sudo iptables -D OUTPUT -o service1 -j DROP;
        sleep 30s;
    done
    echo -e "\nIteration $j: Ending single-port network flapping\n"; date;
    sudo iptables -L INPUT -v | grep -i service1
    sudo iptables -L OUTPUT -v | grep -i service1
    sleep 120;
done

VII. Any Suggestions?
Dear all, I would appreciate any suggestions or insights you might have 
regarding this issue. Thank you!
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to