Hi Makie,
Yes, sorry, that should be:
From the client (172.18.178.216):
lnetctl ping 172.18.185.8@o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.185.8@o2ib: Input/output error
From the server (172.18.185.8):
lnetctl ping 172.18.178.216@o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.178.216@o2ib: Input/output error
And yet a standard ping works.
Pinging to/from other clients and other OSSs works. i.e. the file system
is fully functional and in production, just this client and one or two
others are having problems.
We are a link down on the core-edge switch link on the edge switch with
this client attached. Given that a standard ping works, connectivity is
there. But perhaps there is some rdma issue?
Cheers,
Alastair.
On Wed, 4 Sep 2024, Makia Minich wrote:
[You don't often get email from ma...@systemfabricworks.com. Learn why this is
important at https://aka.ms/LearnAboutSenderIdentification ]
[EXTERNAL EMAIL]
The IP for the nid in your “net show” isn’t any of the nids you pinged. Is an
address misconfigured somewhere?
On Sep 4, 2024, at 2:52 AM, Alastair Basden via lustre-discuss
<lustre-discuss@lists.lustre.org> wrote:
Hi,
We are having some Lnet issues, and wonder if anyone can advise.
Client is 2.15.5, server is 2.12.6.
Fabric is IB.
The file system mounts, but OSTs on a couple of OSSs are not contactable.
Client and servers can ping each other over the IB network.
However, a lnetctl ping fails to/from the bad OSSs to this client. To other
clients it's all fine.
i.e. for most of the clients it is working well, just one or two not so.
Server to client:
lnetctl ping 172.18.178.201@o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.178.201@o2ib: Input/output error
Client to server:
anage:
- ping:
errno: -1
descr: failed to ping 172.18.185.10@o2ib: Input/output error
And the o2ib network is noted as down:
lnetctl net show --net o2ib --verbose
net:
- net type: o2ib
local NI(s):
- nid: 172.18.178.216@o2ib
status: down
interfaces:
0: ibs1f0
statistics:
send_count: 45032
recv_count: 45030
drop_count: 0
tunables:
peer_timeout: 100
peer_credits: 32
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 16
map_on_demand: 1
concurrent_sends: 32
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
dev cpt: 0
CPT: "[0,1]"
Could this be a hardware error, even though the IB is working?
Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?
Are there any suggestions on how to bring up the lnet network or fix the
problems?
Thanks,
Alastair.
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org