Hi,

Last week we upgraded to Lustre 2.15.5 from 2.12.9. It went almost without any issues. However, clients using TCP logs this message, when mounting one of the two filesystems:

Issue #1:
-------------

Aug  1 09:39:41 fend08 kernel: Lustre: Lustre: Build Version: 2.15.5
Aug  1 09:39:41 fend08 kernel: LustreError: 31623:0:(mgc_request.c:1566:mgc_apply_recover_logs()) mgc: cannot find UUID by nid '10.21.10.122@o2ib': rc = -2 Aug  1 09:39:41 fend08 kernel: Lustre: 31623:0:(mgc_request.c:1784:mgc_process_recover_nodemap_log()) MGC172.20.10.101@tcp1: error processing recovery log hpc-cliir: rc = -2 Aug  1 09:39:41 fend08 kernel: Lustre: 31623:0:(mgc_request.c:2150:mgc_process_log()) MGC172.20.10.101@tcp1: IR log hpc-cliir failed, not fatal: rc = -2 Aug  1 09:39:41 fend08 root[31712]: ksocklnd-config: skip setting up route for bond0: don't overwrite existing route
Aug  1 09:39:42 fend08 kernel: Lustre: Mounted hpc-client

This is not happening when using Infiniband.

How can we fix this?


Issue #2 (might or might not be related):
---------------------------------------------------------

The status of target connections after mounting is:

# lfs check all
hpc-OST0003-osc-ffff90532327f000 active.
hpc-OST0004-osc-ffff90532327f000 active.
hpc-OST0005-osc-ffff90532327f000 active.
hpc-OST0006-osc-ffff90532327f000 active.
lfs check: error: check 'hpc-OST0007-osc-ffff90532327f000': Resource temporarily unavailable (11) lfs check: error: check 'hpc-OST0008-osc-ffff90532327f000': Resource temporarily unavailable (11)
hpc-OST0009-osc-ffff90532327f000 active.
hpc-OST000a-osc-ffff90532327f000 active.
hpc-OST000b-osc-ffff90532327f000 active.
hpc-OST000c-osc-ffff90532327f000 active.
hpc-OST000d-osc-ffff90532327f000 active.
hpc-OST000e-osc-ffff90532327f000 active.
hpc-MDT0000-mdc-ffff90532327f000 active.
MGC172.20.10.101@tcp1 active.

OST000[7-e] are on host 172.20.10.122@tcp1 (10.21.10.122@o2ib).

Due to this situation it hangs when hitting OST000[7-8].

Unmounting and mounting it again clear the error on OST000[7-8] and make it usable (Issue #1 still showing). With a clean LNet start the issue comes back.

Disabling 'discovery' in LNet makes this issue go away (Issue #1 still showing).

Reverting to Lustre 2.15.3 also makes it go away (Issue #1 still showing). Perhaps all the TCP issues in 2.15.4 was not fixed by LU-17664.


A few notes about our system:
------------------------------------------

- It's ZFS based.
- It was created back in 2015. MGS, and MDTs have survived since then (zfs send/receive), while new OSTs have been added over time an old ones have been taken out.
- There are 2 filesystems on an MDS pair. One MDT on each MDS.
- Dual network stack with Infiniband and TCP. For historical reasons we are using tcp1 and not the default tcp0. No routers.

Cheers,
Hans Henrik
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to