The communication from the compute nodes to the login nodes may be block by the firewall. That will prevent srun from running properly
Sent from my iPhone > On 17 Jul 2018, at 10:16, John Hearns <hear...@googlemail.com> wrote: > > Ronan, as far as I can see this means that you cannot launch a job. > > What state are the compute nodes in when you run sinfo? > > >> On 17 July 2018 at 10:08, Buckley, Ronan <ronan.buck...@dell.com> wrote: >> Yes, srun just hangs. Commands like sinfo and squeue run fine. >> >> I also have no slurm logs in /var/log ?? >> >> >> >> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf >> Of John Hearns >> Sent: Tuesday, July 17, 2018 8:57 AM >> >> >> To: Slurm User Community List >> Subject: Re: [slurm-users] 'srun hostname' hangs on the command line >> >> >> Ronan, sorry to ask but this is a bit unclear. >> >> >> >> Are you unable to launch ANY sessions with srun? >> >> In which case you need to look at the logs to see why the job is not being >> scheduled. >> >> >> >> Is it only the hostname command which fails? >> >> >> >> I would guess very much you have already run an ssh into a node and run the >> hostname command manually. >> >> >> >> >> >> >> >> On 17 July 2018 at 09:50, Buckley, Ronan <ronan.buck...@dell.com> wrote: >> >> Yes I do. >> >> >> >> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf >> Of Williams, Gareth (IM&T, Clayton) >> Sent: Tuesday, July 17, 2018 12:33 AM >> To: Slurm User Community List >> Subject: Re: [slurm-users] 'srun hostname' hangs on the command line >> >> >> >> Do you get the same problem as a non-root user? >> >> >> >> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf >> Of Buckley, Ronan >> Sent: Tuesday, 17 July 2018 12:53 AM >> To: slurm-users@lists.schedmd.com >> Subject: [slurm-users] 'srun hostname' hangs on the command line >> >> >> >> Hi All, >> >> >> >> Verbose mode doesn’t show much. >> >> I hashed out the hostnames. >> >> Any ideas/suggestions? >> >> >> >> # srun hostname >> >> ^Csrun: interrupt (one more within 1 sec to abort) >> >> srun: task 0: unknown >> >> ^Z >> >> [1]+ Stopped srun hostname >> >> # >> >> >> >> # srun -v hostname >> >> srun: defined options for program `srun' >> >> srun: --------------- --------------------- >> >> srun: user : `root' >> >> srun: uid : 0 >> >> srun: gid : 0 >> >> srun: cwd : /root >> >> srun: ntasks : 1 (default) >> >> srun: nodes : 1 (default) >> >> srun: jobid : 4294967294 (default) >> >> srun: partition : default >> >> srun: profile : `NotSet' >> >> srun: job name : `(null)' >> >> srun: reservation : `(null)' >> >> srun: burst_buffer : `(null)' >> >> srun: wckey : `(null)' >> >> srun: cpu_freq_min : 4294967294 >> >> srun: cpu_freq_max : 4294967294 >> >> srun: cpu_freq_gov : 4294967294 >> >> srun: switches : -1 >> >> srun: wait-for-switches : -1 >> >> srun: distribution : unknown >> >> srun: cpu_bind : default (0) >> >> srun: mem_bind : default (0) >> >> srun: verbose : 1 >> >> srun: slurmd_debug : 0 >> >> srun: immediate : false >> >> srun: label output : false >> >> srun: unbuffered IO : false >> >> srun: overcommit : false >> >> srun: threads : 60 >> >> srun: checkpoint_dir : /var/slurm/checkpoint >> >> srun: wait : 0 >> >> srun: nice : -2 >> >> srun: account : (null) >> >> srun: comment : (null) >> >> srun: dependency : (null) >> >> srun: exclusive : false >> >> srun: bcast : false >> >> srun: qos : (null) >> >> srun: constraints : >> >> srun: geometry : (null) >> >> srun: reboot : yes >> >> srun: rotate : no >> >> srun: preserve_env : false >> >> srun: network : (null) >> >> srun: propagate : NONE >> >> srun: prolog : (null) >> >> srun: epilog : (null) >> >> srun: mail_type : NONE >> >> srun: mail_user : (null) >> >> srun: task_prolog : (null) >> >> srun: task_epilog : (null) >> >> srun: multi_prog : no >> >> srun: sockets-per-node : -2 >> >> srun: cores-per-socket : -2 >> >> srun: threads-per-core : -2 >> >> srun: ntasks-per-node : -2 >> >> srun: ntasks-per-socket : -2 >> >> srun: ntasks-per-core : -2 >> >> srun: plane_size : 4294967294 >> >> srun: core-spec : NA >> >> srun: power : >> >> srun: remote command : `hostname' >> >> srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x >> index) >> >> srun: Nodes ####### are ready for job >> >> srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1) >> >> srun: launching 50871.0 on host #######, 1 tasks: 0 >> >> srun: route default plugin loaded >> >> srun: error: timeout waiting for task launch, started 0 of 1 tasks >> >> srun: Job step 50871.0 aborted before step completely launched. >> >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> >> srun: error: Timed out waiting for job step to complete >> >> # >> >> >> >> Rgds >> >> >> >> >> >