We have, and have had it come and go with no clear explanation. I’d watch out for MTU and netmask troubles, sysctl limits that might be relevant (apparently the default settings for time spent doing ethernet are really appropriate for <1 Gb, not so much faster), hot spots on the network, etc.
-- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Oct 10, 2023, at 22:29, James Lam <unison2...@gmail.com> wrote: We have a cluster of 176 nodes consisting Infiniband switch and 10GbE and we are using 10GbE as SSH. Currently we have the older cards of Marvell 10GbE at launch https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886 and Current batch of 10GbE Qlogic card https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory We are using slurm 20.11.4 as server and node health check daemon are also deployed using the OpenHPC method. However , we have no issue on using the Marvell 10GbE cards - which don't have slurm node down <--> idle state. However, we do have the flip-flip situation of the down <--> idle state We tried on increasing the ARP caching , changing the subversion of the client to 20.11.9 , which doesn't help with the situation. We would like to see if anyone faced similar situation?