I’m coming to this question late, and this is not the answer to your problem (well, maybe tangentially), but it may help someone else: my recollection is that the compute node that gets assigned the job must be able to contact the node you’re starting the interactive job from (so bg-slurmb-login1 here) on a wide variety of ports in the case of interactive jobs. For us, we had a firewall config that didn’t allow for that and all interactive jobs failed until we resolved that. I guess having the wrong address someplace could a mimic that behavior.
-- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Jan 20, 2022, at 9:40 AM, Durai Arasan <arasan.du...@gmail.com> wrote: > > Hello Slurm users, > > We are suddenly encountering strange errors while trying to launch > interactive jobs on our cpu partitions. Have you encountered this problem > before? Kindly let us know. > > [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 > --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G > --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash > srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: > Communication connection failure > srun: error: Application launch failed: Communication connection failure > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete > > Best regards, > Durai Arasan > MPI Tuebingen