Re: [lustre-discuss] [email protected]

Robert Anderson via lustre-discuss Tue, 20 Sep 2022 15:18:39 -0700

Thank you for the reply!

The consultant has pulled logs during the errors, but I have not.  I will do 
some tests with logging on and at least review these myself.


It's been a few days since we did the testing and I've been restoring backups 
to the new lustre from one of the clients that works.

The command you sent does show some high drops on the MDS (storage04) from my 
logs files.  The others have less than a dozen drop_count.  Interesting that 
nearly all of them are on ACK, assuming I'm reading this correctly.  I'm not 
sure if that's consistent with the description of the client connection to the 
server locking up. I guess I could see the server responding with ack's that 
are never heard.  But that seems like a lot of them.

Maybe just related to 30 TB of restored data?

[root@storage03 rea]# ssh storage04 lnetctl net show -v 3
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 183350
              recv_count: 183346
              drop_count: 4
          sent_stats:
              put: 183350
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 183325
              get: 0
              reply: 0
              ack: 21
              hello: 0
          dropped_stats:
              put: 4
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11]"
    - net type: o2ib
      local NI(s):
        - nid: 10.200.200.54@o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 521383624
              recv_count: 694477804
              drop_count: 3749557
          sent_stats:
              put: 521380810
              get: 2814
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 523296327
              get: 1415
              reply: 1397
              ack: 171178665
              hello: 0
          dropped_stats:
              put: 25
              get: 0
              reply: 0
              ack: 3749532
              hello: 0
          health stats:
              health value: 1000
              interrupts: 0
              dropped: 136
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11]"



Robert E. Anderson

University of NH / Research Computing Center / Data Center Operations


John Villa wrote:



CAUTION: This email originated from outside of the University System. Do not 
click links or open attachments unless you recognize the sender and know the 
content is safe.


Hello Robert,
Sorry to hear that you're experiencing this issue. All signs here point to an 
lnet configuration issue. Just to be safe did you verify that the IB cables 
connecting this storage are clear of error counters? Also, would it be possible 
to recreate the error you're experiencing and capture some debugging data? In 
case you're not familiar here is how you'd generate the debug data:

$ lctl set_param debug=+net (can verify if this is here with get_param prior)
$ lctl dk > 
lnet.dk<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flnet.dk%2F&data=05%7C01%7CRobert.E.Anderson%40unh.edu%7Cfdc83fb065c04233a8a008da9b536f4a%7Cd6241893512d46dc8d2bbe47e25f5666%7C0%7C0%7C637993080253830251%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=Wx27leNug6ivXSlNYPNWuSVp5HoN%2FpwVl%2Fp1Vhn9Ag0%3D&reserved=0>
# reproduce the error #
$ lctl dk 
lnet.dk<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flnet.dk%2F&data=05%7C01%7CRobert.E.Anderson%40unh.edu%7Cfdc83fb065c04233a8a008da9b536f4a%7Cd6241893512d46dc8d2bbe47e25f5666%7C0%7C0%7C637993080253830251%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=Wx27leNug6ivXSlNYPNWuSVp5HoN%2FpwVl%2Fp1Vhn9Ag0%3D&reserved=0>

Also, what does an "lnetctl net show -v 3" on both a failed client and one of 
the storage nodes show?

Thank You,
John Villa

On Tue, Sep 20, 2022 at 5:36 PM Robert Anderson via lustre-discuss 
<[email protected]<mailto:[email protected]>> wrote:
Looking for any suggestions on where to look, or what to change, to gain more 
information on RDMA timeout erors.

We have a CentOS (release 7.9.2009, kernel 3.10.0-1160.66.1.el7.x86_64)
 4 system (2 MDS + 2 OSS) zfs backed Lustre 2.12.9 server  on AMD (EPYC 7402P). 
 Each of these four has a single connection EDR card (Mellanox Technologies 
MT27800 Family [ConnectX-5]).

The above Lustre should serve 54 linux compute node clients (heterogeneous 
hardware) running the same CentOS & kernel over 3 EDR switches (4 cables each 
from center to edge switches). Hardware is a mix of Intel CPUs & a few AMD, and 
have at least 5 variations of EDR and FDR single connection cards.

All of these clients have been successfully utilizing a ClusterStor2000 
appliance, as well as a BeeGFS (7.2.6) storage cluster over the same fabric.  
The IB network is not reporting any errors, and running at full rates to each 
client.  Only errors appear to be at the lnet level, and start on the client 
side.  The server eventually times out the client for no reponse

We setup this server with the help of a very competent consultant and have had 
them back to help debug this problem.  But we are having difficulty finding 
anything to log or look at that helps.


On some nodes (mostly older nodes) an RDMA timeout occurs on the client that 
locks up the clients communication to one of the Lustre targets.  We found that 
Bonnie++ and fio tests could cause these client lockups to one of the OSS 
nodes.  Swapping hardware between server nodes didn't reveal any pattern, and 
eventually we moved an entirely new node which failed in the same manner.  
About 2/3 of the clients can be made to lockup, but no pattern in their 
individual hardware or network location seems to exist.   The remaining nodes 
seem to work perfectly with this new Lustre storage.  The same sort of client 
RDM lockup has also been generated to a MDS with a simple "ls" of the lustre 
root after mounting.


Below are client and MDS  log from that simplest of cases.

 Sep 15 10:05:56 node113 systemd: Started Session 85 of user root.
Sep 15 10:06:08 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:08 node113 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, 
npartitions: 2
Sep 15 10:06:09 node113 kernel: alg: No test for adler32 (adler32-zlib)
Sep 15 10:06:09 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:09 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:09 node113 kernel: Lustre: Lustre: Build Version: 2.12.9
Sep 15 10:06:09 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: LNet: Using FMR for registration
Sep 15 10:06:10 node113 kernel: LNet: Added LNI 10.200.200.113@o2ib 
[8/256/0/180]
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS module 
signing key: b4c2eda7b66318967d50ba12b9cb1
8dcee995f32' err -11
Sep 15 10:06:11 node113 kernel: Lustre: Mounted lz01-client
Sep 15 10:06:56 node113 kernel: Lustre: 
13799:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for slow reply: [sent 1663250809/real 1663250809]  req@ffff899095d04800 
x1744044838356992/t0(0) o101->lz01-MDT000
[email protected]@o2ib:12/10 lens 576/968 e 0 to 1 dl 
1663250816 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Sep 15 10:06:56 node113 kernel: Lustre: lz01-MDT0000-mdc-ffff8990be2fb800: 
Connection to lz01-MDT0000 (at 10.200.200.5
4@o2ib) was lost; in progress operations using this service will wait for 
recovery to complete
Sep 15 10:07:02 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx: active_
txs, 0 seconds
Sep 15 10:07:02 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with 10.200
.200.54@o2ib (12): c: 4, oc: 0, rc: 8
Sep 15 10:07:03 node113 kernel: Lustre: 
13725:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for sent delay: [sent 1663250816/real 0]  req@ffff899095d04c80 
x1744044838357056/t0(0) o400->MGC10.200.200.54@[email protected]@o2ib:26/25 
lens 224/224 e 0 to 1 dl 1663250823 ref 2 fl Rpc:XN/0/
ffffffff rc 0/-1
Sep 15 10:07:03 node113 kernel: LustreError: 166-1: MGC10.200.200.54@o2ib: 
Connection to MGS (at 10.200.200.54@o2ib) was lost; in progress operations 
using this service will fail
Sep 15 10:07:09 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx: active_txs, 
0 seconds
Sep 15 10:07:09 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with 
10.200.200.54@o2ib (5): c: 7, oc: 0, rc: 8
Sep 15 10:07:10 node113 systemd-logind: New session 86 of user root.
Sep 15 10:07:10 node113 systemd: Started Session 86 of user root.
Sep 15 10:07:54 node113 systemd-logind: New session 87 of user root.
Sep 15 10:07:54 node113 systemd: Started Session 87 of user root.
Sep 15 10:10:34 node113 systemd-logind: Removed session 86.
Sep 15 10:12:47 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx: active_txs, 
0 seconds
Sep 15 10:12:47 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with 
10.200.200.54@o2ib (5): c: 7, oc: 0, rc: 8
Sep 15 10:13:37 node113 systemd-logind: New session 88 of user root.
Sep 15 10:13:37 node113 systemd: Started Session 88 of user root.
Sep 15 10:13:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:14:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:14:50 node113 systemd-logind: Removed session 88.
Sep 15 10:15:33 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx: active_txs, 
0 seconds
Sep 15 10:15:33 node113 kernel: LNetError: 
13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with 
10.200.200.54@o2ib (106): c: 8, oc: 0, rc: 8
Sep 15 10:15:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:15:53 node113 systemd-logind: New session 89 of user root.
Sep 15 10:15:53 node113 systemd: Started Session 89 of user root.
Sep 15 10:16:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:17:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:17:58 node113 systemd-logind: Removed session 89.
Sep 15 10:18:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:19:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:19:53 node113 systemd-logind: New session 90 of user root.
Sep 15 10:19:53 node113 systemd: Started Session 90 of user root.
Sep 15 10:20:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:21:52 node113 kernel: LustreError: 
14973:0:(lmv_obd.c:1415:lmv_statfs()) lz01-MDT0000-mdc-ffff8990be2fb800: can't 
stat MDS #0: rc = -4
Sep 15 10:21:56 node113 kernel: LustreError: 
14977:0:(lmv_obd.c:1415:lmv_statfs()) lz01-MDT0000-mdc-ffff8990be2fb800: can't 
stat MDS #0: rc = -4
Sep 15 10:22:29 node113 systemd-logind: New session 91 of user root.
Sep 15 10:22:29 node113 systemd: Started Session 91 of user root.
Sep 15 10:22:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) ping 
12345-10.200.200.54@o2ib: late network completion
Sep 15 10:22:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping()) 
Skipped 1 previous similar message
Sep 15 10:22:50 node113 systemd-logind: Removed session 91.
Sep 15 10:23:30 node113 systemd-logind: New session 92 of user root.
Sep 15 10:23:30 node113 systemd: Started Session 92 of user root.




 Sep 15 10:01:04 storage04 systemd: Removed slice User Slice of root.
Sep 15 10:06:16 storage04 kernel: Lustre: MGS: Connection restored to 
5b462af8-e997-25fa-7aac-62ef6f834520 (at 10.200.200.113@o2ib)
Sep 15 10:07:03 storage04 kernel: Lustre: lz01-MDT0000: Client 
fd07e75b-dbc1-20b1-ecbf-6f3a96a62d2a (at 10.200.200.113@o2ib) reconnecting
Sep 15 10:07:10 storage04 kernel: Lustre: MGS: Client 
01e3bd51-26ab-f2bb-6515-6afe3a181876 (at 10.200.200.113@o2ib) reconnecting
Sep 15 10:10:01 storage04 systemd: Created slice User Slice of root.
Sep 15 10:10:01 storage04 systemd: Started Session 2738 of user root.
Sep 15 10:10:01 storage04 systemd: Removed slice User Slice of root.
Sep 15 10:10:44 storage04 systemd: Created slice User Slice of root.
Sep 15 10:10:44 storage04 systemd-logind: New session 2739 of user root.
Sep 15 10:10:44 storage04 systemd: Started Session 2739 of user root.
Sep 15 10:10:51 storage04 kernel: Lustre: lz01-MDT0000: haven't heard from 
client fd07e75b-dbc1-20b1-ecbf-6f3a96a62d2a (at 10.200.200.113@o2ib) in 228 
seconds. I think it's dead, and I am evicting it. exp ffff8fc5496cd800, cur 
1663251051 expire 1663250901 last 1663250823
Sep 15 10:10:51 storage04 kernel: Lustre: Skipped 1 previous similar message
Sep 15 10:11:07 storage04 kernel: Lustre: MGS: haven't heard from client 
01e3bd51-26ab-f2bb-6515-6afe3a181876 (at 10.200.200.113@o2ib) in 237 seconds. I 
think it's dead, and I am evicting it. exp ffff8fc6f0df9800, cur 1663251067 
expire 1663250917 last 1663250830
Sep 15 10:12:07 storage04 systemd-logind: Removed session 2739.



Robert E. Anderson

University of NH / Research Computing Center / Data Center Operations


_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=Mo5B9RdV3X_SWFVkDQnFeEpeDvsQQirYGIY2JhvVXao&m=Y5U3ZyhZ-m25ZdHuwyU8ouDc_s45ODZRUP1wyAe7t956uHS4QGdQK-8geCDHWC2S&s=tV__Zfvt_hIDOV8ZAz7U5bY0GuIw67m_s1x2vvNK_gA&e=<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org%26d%3DDwICAg%26c%3D009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE%26r%3DMo5B9RdV3X_SWFVkDQnFeEpeDvsQQirYGIY2JhvVXao%26m%3DY5U3ZyhZ-m25ZdHuwyU8ouDc_s45ODZRUP1wyAe7t956uHS4QGdQK-8geCDHWC2S%26s%3DtV__Zfvt_hIDOV8ZAz7U5bY0GuIw67m_s1x2vvNK_gA%26e%3D&data=05%7C01%7CRobert.E.Anderson%40unh.edu%7Cfdc83fb065c04233a8a008da9b536f4a%7Cd6241893512d46dc8d2bbe47e25f5666%7C0%7C0%7C637993080253986112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=95e59YVWmoihLSzoAYnG2aNQV0OM1rNAtlJXLG1VUko%3D&reserved=0>


--
Sincerely,
John Villa
Lead Research Systems Engineer
347-213-0339
[http://cuit.columbia.edu/sites/all/themes/ias/ascuit/images/logo.png]

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] [email protected]

Reply via email to