I’m coming to this question late, and this is not the answer to your problem 
(well, maybe tangentially), but it may help someone else: my recollection is 
that the compute node that gets assigned the job must be able to contact the 
node you’re starting the interactive job from (so bg-slurmb-login1 here) on a 
wide variety of ports in the case of interactive jobs. For us, we had a 
firewall config that didn’t allow for that and all interactive jobs failed 
until we resolved that. I guess having the wrong address someplace could a 
mimic that behavior.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Jan 20, 2022, at 9:40 AM, Durai Arasan <arasan.du...@gmail.com> wrote:
> 
> Hello Slurm users,
> 
> We are suddenly encountering strange errors while trying to launch 
> interactive jobs on our cpu partitions. Have you encountered this problem 
> before? Kindly let us know.
> 
> [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 
> --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G  
> --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
> srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: 
> Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
> 
> Best regards,
> Durai Arasan
> MPI Tuebingen

Reply via email to