Hi,
Last week we upgraded to Lustre 2.15.5 from 2.12.9. It went almost
without any issues. However, clients using TCP logs this message, when
mounting one of the two filesystems:
Issue #1:
-------------
Aug 1 09:39:41 fend08 kernel: Lustre: Lustre: Build Version: 2.15.5
Aug 1 09:39:41 fend08 kernel: LustreError:
31623:0:(mgc_request.c:1566:mgc_apply_recover_logs()) mgc: cannot find
UUID by nid '10.21.10.122@o2ib': rc = -2
Aug 1 09:39:41 fend08 kernel: Lustre:
31623:0:(mgc_request.c:1784:mgc_process_recover_nodemap_log())
MGC172.20.10.101@tcp1: error processing recovery log hpc-cliir: rc = -2
Aug 1 09:39:41 fend08 kernel: Lustre:
31623:0:(mgc_request.c:2150:mgc_process_log()) MGC172.20.10.101@tcp1: IR
log hpc-cliir failed, not fatal: rc = -2
Aug 1 09:39:41 fend08 root[31712]: ksocklnd-config: skip setting up
route for bond0: don't overwrite existing route
Aug 1 09:39:42 fend08 kernel: Lustre: Mounted hpc-client
This is not happening when using Infiniband.
How can we fix this?
Issue #2 (might or might not be related):
---------------------------------------------------------
The status of target connections after mounting is:
# lfs check all
hpc-OST0003-osc-ffff90532327f000 active.
hpc-OST0004-osc-ffff90532327f000 active.
hpc-OST0005-osc-ffff90532327f000 active.
hpc-OST0006-osc-ffff90532327f000 active.
lfs check: error: check 'hpc-OST0007-osc-ffff90532327f000': Resource
temporarily unavailable (11)
lfs check: error: check 'hpc-OST0008-osc-ffff90532327f000': Resource
temporarily unavailable (11)
hpc-OST0009-osc-ffff90532327f000 active.
hpc-OST000a-osc-ffff90532327f000 active.
hpc-OST000b-osc-ffff90532327f000 active.
hpc-OST000c-osc-ffff90532327f000 active.
hpc-OST000d-osc-ffff90532327f000 active.
hpc-OST000e-osc-ffff90532327f000 active.
hpc-MDT0000-mdc-ffff90532327f000 active.
MGC172.20.10.101@tcp1 active.
OST000[7-e] are on host 172.20.10.122@tcp1 (10.21.10.122@o2ib).
Due to this situation it hangs when hitting OST000[7-8].
Unmounting and mounting it again clear the error on OST000[7-8] and make
it usable (Issue #1 still showing). With a clean LNet start the issue
comes back.
Disabling 'discovery' in LNet makes this issue go away (Issue #1 still
showing).
Reverting to Lustre 2.15.3 also makes it go away (Issue #1 still
showing). Perhaps all the TCP issues in 2.15.4 was not fixed by LU-17664.
A few notes about our system:
------------------------------------------
- It's ZFS based.
- It was created back in 2015. MGS, and MDTs have survived since then
(zfs send/receive), while new OSTs have been added over time an old ones
have been taken out.
- There are 2 filesystems on an MDS pair. One MDT on each MDS.
- Dual network stack with Infiniband and TCP. For historical reasons we
are using tcp1 and not the default tcp0. No routers.
Cheers,
Hans Henrik
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org