Hi Colin, I’ve done some more digging and found that on the affected nodes the messages repeat at ~10 min intervals. I can also see a lot of these errors in the MDS log:
Nov 25 10:56:02 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11 Nov 25 10:56:02 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) Skipped 4 previous similar messages Nov 25 11:08:39 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11 Nov 25 11:08:39 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) Skipped 4 previous similar messages Nov 25 11:21:16 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11 As you can see, these refer to another ost and the are repeated every ~14 mins. On oss03 (serving ost000a – ost000e), no errors are logged after rebooting the clients, but I can see these messages: Nov 25 19:08:02 oss03 kernel: Lustre: 19728:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply#012 req@ffff9c6ec5550850 x1713320906932288/t0(0) >[email protected]@o2ib:662/0<mailto:[email protected]@o2ib:662/0> lens 432/0 e 0 to 0 dl 1637863687 ref 2 fl New:/0/ffffffff rc 0/-1 Nov 25 19:08:02 oss03 kernel: Lustre: 19728:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 4 previous similar messages Nov 25 19:11:23 oss03 kernel: Lustre: lustre01-OST000b: Export ffff9c42c996fc00 already connecting from 192.168.1.13@o2ib<mailto:192.168.1.13@o2ib> Nov 25 19:11:23 oss03 kernel: Lustre: lustre01-OST000a: Export ffff9c4f43fb3c00 already connecting from 192.168.1.13@o2ib<mailto:192.168.1.13@o2ib> Also checked the Infiniband network, no errors found. Servers are running CentOS 7.9 with Lustre 2.12.6 / zfs 3.10.0 Clients are running CentOS 7.2 with Lustre 2.8.0 Looks like a problem on oss03 ? Hilsen Hallstein Fra: Colin Faber <[email protected]> Sendt: torsdag 25. november 2021 18:11 Til: Hallstein Løhre <[email protected]> Kopi: [email protected] Emne: Re: [lustre-discuss] ost_connect to node failed -114 == operation in progress, what's the logging look like on both sides of the connection? -cf On Thu, Nov 25, 2021 at 5:18 AM Hallstein Løhre <[email protected]<mailto:[email protected]>> wrote: Hi, After some trouble with runaway processes yesterday, I had to reboot several Lustre clients. Now some of these shows the following entries in /var/log/messages: Nov 25 11:09:51 nodexx kernel: LustreError: 11-0: lustre01-OST000a-osc-ffff887ee3207800: operation ost_connect to node 192.168.1.xxx@o2ib<mailto:192.168.1.xxx@o2ib> failed: rc = -114 The filesystem seems ok, but the stuck processes might have accessed file(s) on OST000a. No hardware problem seems to exist, the ost’s are all zfs volumes with status ok. I suspended writing to ost000a, but after reboot of the clients and checking for hardware problems, I have reenabled writing. Any explanation of rc = -114 ? Best Regards Hallstein Løhre ALPHA SYSTEM AS _______________________________________________ lustre-discuss mailing list [email protected]<mailto:[email protected]> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
