Hello Slurm community,

We are using slurm as the system to deploy training jobs on a large gpu 
cluster, but encounter a strange behavior. As new comers, we wonder if this is 
a known behavior. Below is some more info:

  *   We are running a relatively older version 22.0.5
  *   At relatively higher load, we encountered hanging. It is particularly 
puzzling in the following sense: assume we have nodelist1 with 6 hosts and 
nodelist2 with 7 hosts. We run simple ‘hostname’. Deploying on nodelist1 alone 
or nodelusr2 alone will be fine, but with all 13 hosts, the debug messages show 
that the execution hang after showing that the last task done. It then hangs 
for exactly 180 seconds.

Does anyone know the potential issue? We sure be happy to post more config 
details or debug messages.

Thank you so much!
Richard
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to