I think I figured out the problem.
My problem is related to Lnet Network Health feature: https://jira.whamcloud.com/browse/LU-9120 the lustre MDS and the lsutre client having same version 2.12.0 negotiate a Multi-rail peer connection while this does not happen with the other clients (2.10.5). So what happens is that both IB and tcp are being used during transfers. tcp is only for connecting to the MDS, IB only to connect to the OSS anyway Multi-rail is enabled by default between the MDS,OSS and client. This messes up the situation. the MDS has only one TCP interface and cannot communicate by IB but in the "lnetctl peer show" a NID @o2ib shows up and it should not. At this point the MDS tries to connect to the client using IB and it will never work because there is no IB on the MDS.
MDS Lnet configuration:

net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 172.21.49.233@tcp
          status: up
          interfaces:
              0: eth0

but if I look at lnetctl peer show I See

   - primary nid: 172.21.52.88@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 172.21.48.250@tcp
          state: NA
        - nid: 172.21.52.88@o2ib
          state: NA
        - nid: 172.21.48.250@tcp1
          state: NA
        - nid: 172.21.48.250@tcp2
          state: NA

there should be no o2ib nid but Multi-rail for some reason enables it.
I do not have problems with the other clients (non 2.12.0)

How can I disable Multi-rail on 2.12.0 ??

thank you



On 3/5/19 12:14 PM, Patrick Farrell wrote:
Riccardo,

Since 2.12 is still a relatively new maintenance release, it would be helpful 
if you could open an LU and provide more detail there - Such as what clients 
were doing, if you were using any new features (like DoM or FLR), and full 
dmesg from the clients and servers involved in these evictions.

- Patrick

On 3/5/19, 11:50 AM, "lustre-discuss on behalf of Riccardo Veraldi" 
<[email protected] on behalf of [email protected]> 
wrote:

     Hello,
I have quite a big issue on my Lustre 2.12.0 MDS/MDT. Clients moving data to the OSS occur into a locking problem I never met
     before.
The clients are mostly 2.10.5 except for one which is 2.12.0 but
     regardless the client version the problem is still there.
So these are the errors I see on hte MDS/MDT. When this happens
     everything just hangs. If I reboot the MDS everything is back to
     normality but it happened already 2 times in 3 days and it is disrupting.
Any hints ? Is it feasible to downgrade from 2.12.0 to 2.10.6 ? thanks Mar 5 11:10:33 psmdsana1501 kernel: Lustre:
     7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has
     failed due to network error: [sent 1551813033/real 1551813033]
     req@ffff9fdcbecd0300 x1626845000210688/t0(0)
     o104->[email protected]@o2ib:15/16 lens 296/224 e 0 to 1 dl
     1551813044 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
     Mar  5 11:10:33 psmdsana1501 kernel: Lustre:
     7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 50552576
     previous similar messages
     Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
     7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid
     172.21.52.87@o2ib) failed to reply to blocking AST (req@ffff9fdcbecd0300
     x1626845000210688 status 0 rc -110), evict it ns: mdt-ana15-MDT0000_UUID
     lock: ffff9fde9b6873c0/0x9824623d2148ef38 lrc: 4/0,0 mode: PR/PR res:
     [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags:
     0x60200400000020 nid: 172.21.52.87@o2ib remote: 0xd8efecd6e7621e63
     expref: 8 pid: 7898 timeout: 333081 lvb_type: 0
     Mar  5 11:13:03 psmdsana1501 kernel: LustreError: 138-a: ana15-MDT0000:
     A client on nid 172.21.52.87@o2ib was evicted due to a lock blocking
     callback time out: rc -110
     Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
     5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
     expired after 150s: evicting client at 172.21.52.87@o2ib ns:
     mdt-ana15-MDT0000_UUID lock: ffff9fde9b6873c0/0x9824623d2148ef38 lrc:
     3/0,0 mode: PR/PR res: [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc:
     5 type: IBT flags: 0x60200400000020 nid: 172.21.52.87@o2ib remote:
     0xd8efecd6e7621e63 expref: 9 pid: 7898 timeout: 0 lvb_type: 0
     Mar  5 11:13:04 psmdsana1501 kernel: Lustre: ana15-MDT0000: Connection
     restored to 59c5a826-f4e9-0dd0-8d4f-08c204f25941 (at 172.21.52.87@o2ib)
     Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
     7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid
     172.21.52.142@o2ib) failed to reply to blocking AST
     (req@ffff9fde2d393600 x1626845000213776 status 0 rc -110), evict it ns:
     mdt-ana15-MDT0000_UUID lock: ffff9fde9b6858c0/0x9824623d2148efee lrc:
     4/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3
     type: IBT flags: 0x60200400000020 nid: 172.21.52.142@o2ib remote:
     0xbb35541ea6663082 expref: 9 pid: 7898 timeout: 333232 lvb_type: 0
     Mar  5 11:15:34 psmdsana1501 kernel: LustreError: 138-a: ana15-MDT0000:
     A client on nid 172.21.52.142@o2ib was evicted due to a lock blocking
     callback time out: rc -110
     Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
     5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
     expired after 151s: evicting client at 172.21.52.142@o2ib ns:
     mdt-ana15-MDT0000_UUID lock: ffff9fde9b6858c0/0x9824623d2148efee lrc:
     3/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3
     type: IBT flags: 0x60200400000020 nid: 172.21.52.142@o2ib remote:
     0xbb35541ea6663082 expref: 10 pid: 7898 timeout: 0 lvb_type: 0
     Mar  5 11:15:34 psmdsana1501 kernel: Lustre: ana15-MDT0000: Connection
     restored to 9d49a115-646b-c006-fd85-000a4b90019a (at 172.21.52.142@o2ib)
     Mar  5 11:20:33 psmdsana1501 kernel: Lustre:
     7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has
     failed due to network error: [sent 1551813633/real 1551813633]
     req@ffff9fdcc2a95100 x1626845000222624/t0(0)
     o104->[email protected]@o2ib:15/16 lens 296/224 e 0 to 1 dl
     1551813644 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1
     Mar  5 11:20:33 psmdsana1501 kernel: Lustre:
     7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 23570550
     previous similar messages
     Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
     7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid
     172.21.52.87@o2ib) failed to reply to blocking AST (req@ffff9fdcc2a95100
     x1626845000222624 status 0 rc -110), evict it ns: mdt-ana15-MDT0000_UUID
     lock: ffff9fde86ffdf80/0x9824623d2148f23a lrc: 4/0,0 mode: PR/PR res:
     [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags:
     0x60200400000020 nid: 172.21.52.87@o2ib remote: 0xd8efecd6e7621eb7
     expref: 9 pid: 7898 timeout: 333665 lvb_type: 0
     Mar  5 11:22:46 psmdsana1501 kernel: LustreError: 138-a: ana15-MDT0000:
     A client on nid 172.21.52.87@o2ib was evicted due to a lock blocking
     callback time out: rc -110
     Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
     5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
     expired after 150s: evicting client at 172.21.52.87@o2ib ns:
     mdt-ana15-MDT0000_UUID lock: ffff9fde86ffdf80/0x9824623d2148f23a lrc:
     3/0,0 mode: PR/PR res: [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3
     type: IBT flags: 0x60200400000020 nid: 172.21.52.87@o2ib remote:
     0xd8efecd6e7621eb7 expref: 10 pid: 7898 timeout: 0 lvb_type: 0
_______________________________________________
     lustre-discuss mailing list
     [email protected]
     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to