Hello Slurm users, We are suddenly encountering strange errors while trying to launch interactive jobs on our cpu partitions. Have you encountered this problem before? Kindly let us know.
[darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: Communication connection failure srun: error: Application launch failed: Communication connection failure srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete Best regards, Durai Arasan MPI Tuebingen