I'm having the same issue. The salloc command hangs on my login nodes, but works fine on the head node. My default salloc command is:
SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env $SHELL" I'm on the OpenHPC slurm 17.02.9-69.2. The log says the job is assigned, then eventually times out. I have tried srun directly with various tweaks, but it hangs every time. You can't ctl-C or ctl-Z out of it either, but the shell returns after the job times out. I killed the firewall on the login nodes but that made no difference. On Fri, Mar 9, 2018 at 10:17 AM, Andy Georges <andy.geor...@ugent.be> wrote: > Hi, > > Adding —pty makes no difference. I do not get a prompt and on the node the > logs show an error. If —pty is used, the error is somewhat different > compared to not using it but the end result is the same. > > My main issue is that giving the same command on the machines running > slurmd and slurmctld just works. > > As far as srun is concerned, that’s what is advised for an interactive > job, no? > > — Andy. > > Sent from my iPhone > > > On 9 Mar 2018, at 19:07, Michael Robbert <mrobb...@mines.edu> wrote: > > > > I think that the piece you may be missing is --pty, but I also don't > think that salloc is necessary. > > > > The most simple command that I typically use is: > > > > srun -N1 -n1 --pty bash -i > > > > Mike > > > >> On 3/9/18 10:20 AM, Andy Georges wrote: > >> Hi, > >> > >> > >> I am trying to get interactive jobs to work from the machine we use as > a login node, i.e., where the users of the cluster log into and from where > they typically submit jobs. > >> > >> > >> I submit the job as follows: > >> > >> vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun bash -i > >> salloc: Granted job allocation 41 > >> salloc: Waiting for resource configuration > >> salloc: Nodes node2801 are ready for job > >> > >> … > >> hangs > >> > >> > >> On node2801, the slurmd log has the following information: > >> > >> > >> [2018-03-09T18:16:08.820] _run_prolog: run job script took usec=10379 > >> [2018-03-09T18:16:08.820] _run_prolog: prolog with lock for job 41 ran > for 0 seconds > >> [2018-03-09T18:16:08.829] [41.extern] task/cgroup: > /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB > >> [2018-03-09T18:16:08.830] [41.extern] task/cgroup: > /slurm/uid_2540075/job_41/step_extern: alloc=800MB mem.limit=800MB > memsw.limit=880MB > >> [2018-03-09T18:16:11.824] launch task 41.0 request from > 2540075.2540075@10.141.21.202 (port 61928) > >> [2018-03-09T18:16:11.824] lllp_distribution jobid [41] implicit auto > binding: cores,one_thread, dist 1 > >> [2018-03-09T18:16:11.824] _task_layout_lllp_cyclic > >> [2018-03-09T18:16:11.824] _lllp_generate_cpu_bind jobid [41]: > mask_cpu,one_thread, 0x1 > >> [2018-03-09T18:16:11.834] [41.0] task/cgroup: > /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB > >> [2018-03-09T18:16:11.834] [41.0] task/cgroup: > >> /slurm/uid_2540075/job_41/step_0: > alloc=800MB mem.limit=800MB memsw.limit=880MB > >> [2018-03-09T18:16:11.836] [41.0] error: connect io: Connection refused > >> [2018-03-09T18:16:11.836] [41.0] error: IO setup failed: Connection > refused > >> [2018-03-09T18:16:11.905] [41.0] _oom_event_monitor: oom-kill event > count: 1 > >> [2018-03-09T18:16:11.905] [41.0] error: job_manager exiting abnormally, > rc = 4021 > >> [2018-03-09T18:16:11.905] [41.0] error: _send_launch_resp: Failed to > send RESPONSE_LAUNCH_TASKS: Connection refused > >> [2018-03-09T18:16:11.907] [41.0] done with job > >> > >> > >> We are running slurm 17.11.4. > >> > >> > >> When I change to the same user on both the master node (running > slurmctld) and worker nodes (running slurmd), things work just fine. I > would assume I need not run slurmd on the login node for this to work? > >> > >> > >> Any pointers are appreciated, > >> — Andy > > >