Hi Sid, What version of lustre?
-cf On Mon, Mar 1, 2021, 6:37 PM Sid Young via lustre-discuss < [email protected]> wrote: > G'Day all, > > I've been doing some file create/delete testing on our new Lustre storage > which results in the OSS nodes crashing and rebooting due to high latency > issues. > > I can reproduce it by running "dd" commands on the /lustre file system in > a for loop and then do a rm -f testfile-*.text at the end. > This results in console errors on our DL385 OSS nodes (running Centos 7.9) > which basically show a stack of: > mlx5_core and bnxt_en error messages.... mlx5 being the Mellanox Driver > for the 100G ConnectX5 cards followed by a stack of: > "NMI watchdog: BUG: soft lockup - CPU#"N stuck for XXs " > where the CPU number is around 4 different ones and XX is typical > 20-24seconds...then the boxes reboot! > > Before I log a support ticket to HPe, I'm going to try and disable the > 100G cards and see if its repeatable via the 10G interfaces on > the motherboards, but before I do that, does anyone use the mellanox > ConnectX5 cards on their Lustre Storage nodes and ethernet only and if so, > which driver are you using and on which OS... > > Thanks in advance! > > Sid Young > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
