Re: [slurm-users] A strange situation of different network cards on the same network

Ryan Novosielski Tue, 10 Oct 2023 19:46:47 -0700

We have, and have had it come and go with no clear explanation. I’d watch out 
for MTU and netmask troubles, sysctl limits that might be relevant (apparently 
the default settings for time spent doing ethernet are really appropriate for 
<1 Gb, not so much faster), hot spots on the network, etc.


--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Oct 10, 2023, at 22:29, James Lam <unison2...@gmail.com> wrote:

We have a cluster of 176 nodes consisting Infiniband switch and 10GbE and we 
are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_117b0672d7ef4c5bb0eca02886

and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_9bd8f647238c4a5f8c72a5221b&tab=revisionHistory

We are using slurm 20.11.4 as server and node health check daemon are also 
deployed using the OpenHPC method.  However , we have no issue on using the 
Marvell 10GbE cards - which don't have slurm node down <--> idle state. 
However, we do have the flip-flip situation of the down <--> idle state

We tried on increasing the ARP caching , changing the subversion of the client 
to 20.11.9 , which doesn't help with the situation.

We would like to see if anyone faced similar situation?

Re: [slurm-users] A strange situation of different network cards on the same network

Reply via email to