Hi Riccardo, I would check if the OSTs on this OSS have been registered with the correct NIDs (o2ib1) on the MGS:
$ lctl --device MGS llog_print <fsname>-client and look for the NIDs in setup/add_conn for the OSTs in question. Best, Stephane > On Sep 28, 2021, at 9:52 AM, Riccardo Veraldi <[email protected]> > wrote: > > Hello. > > I have a lustre setup where the MDS (172.21.156.112) is on tcp1 while the > OSSes are on o2ib1. > > I am using Lustre 2.12.7 on RHEL 7.9 > > All the clients can see the MDS correctly as a tcp1 peer: > > peer: > - primary nid: 172.21.156.112@tcp1 > Multi-Rail: True > peer ni: > - nid: 172.21.156.112@tcp1 > state: NA > > > This is by design because the MDS has no IB interface. So the MDS to OSSes > traffic and MDS to Clients traffic is on tcp1, while clients to OSSes traffic > is meant to be on o2ib1. > > I have 1 MDS (tcp1) And 12 OSSes (tcp1, o2ib1) and a bunch of 20 clients > (tcp1, o2ib1). > > All is fine but not for one of the OSSes (172.21.164.116@o2ib1, > 172.21.156.102@tcp1). > > Even though it is configured the same as all the other ones, traffic only > goes through tcp1 and not o2ib1. > > Even if I force the peer settings to use o2ib, it ignores it and the tcp1 > peer is added anyway > > this is lnet.conf on the MDS > > p2nets: > - net-spec: o2ib1 > interfaces: > 0: ib0 > - net-spec: tcp1 > interfaces: > 0: eno1 > global: > discovery: 0 > > > > this is lnet.conf on OSSes > > ip2nets: > - net-spec: o2ib1 > interfaces: > 0: ib0 > - net-spec: tcp1 > interfaces: > 0: enp1s0f0 > global: > discovery: 0 > > > > I also tried this on the lustre clients side: > > peer: > - primary nid: 172.21.164.116@o2ib1 > Multi-Rail: False > peer ni: > - nid: 172.21.164.116@o2ib1 > > enforcing the peer settings to o2ib1. > > This is ignored and the peer is added by its tcp1 LNET interface. > > - primary nid: 172.21.156.102@tcp1 > Multi-Rail: True > peer ni: > - nid: 172.21.156.102@tcp1 > state: NA > > All of the hosts involved have discovery set to 0. > > Nevertheless the peer setting for that specific OSS is using tcp1 and not > o2ib. > > This is disrupting because traffic goes to tcp1 for that specific OSS and it > is of course slower than IB. > > I had to deactivate the OSTs on that specific OSS. > > How may I Fix this issue ? > > Here is the complete peer list from the lustre client side and as you can see > there is that specific OSS included as tcp1 peer. > > even if I do "lnetctl peer del --nid 172.21.156.102@tcp1 --prim_nid > 172.21.156.102@tcp1" the entry is added automatically after a while. > > lnetctl peer show > peer: > - primary nid: 172.21.156.112@tcp1 > Multi-Rail: True > peer ni: > - nid: 172.21.156.112@tcp1 > state: NA > - primary nid: 172.21.164.111@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.111@o2ib1 > state: NA > - primary nid: 172.21.164.117@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.117@o2ib1 > state: NA > - primary nid: 172.21.164.112@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.112@o2ib1 > state: NA > - primary nid: 172.21.164.119@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.119@o2ib1 > state: NA > - primary nid: 172.21.164.114@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.114@o2ib1 > state: NA > - primary nid: 172.21.164.120@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.120@o2ib1 > state: NA > - primary nid: 172.21.156.102@tcp1 > Multi-Rail: True > peer ni: > - nid: 172.21.156.102@tcp1 > state: NA > - primary nid: 172.21.164.116@o2ib1 > Multi-Rail: False > peer ni: > - nid: 172.21.164.116@o2ib1 > state: NA > - primary nid: 172.21.164.110@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.110@o2ib1 > state: NA > - primary nid: 172.21.164.115@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.115@o2ib1 > state: NA > - primary nid: 172.21.164.118@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.118@o2ib1 > state: NA > - primary nid: 172.21.164.113@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.113@o2ib1 > state: NA > - primary nid: 172.21.164.121@o2ib1 > Multi-Rail: True > peer ni: > - nid: 172.21.164.121@o2ib1 > state: NA > > > thanks for looking at this. > > Rick > > > > > > > > > > > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
