Re: [slurm-users] Problem launching interactive jobs using srun

Mark M Fri, 09 Mar 2018 11:04:48 -0800

I'm having the same issue. The salloc command hangs on my login nodes, but
works fine on the head node. My default salloc command is:


SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env $SHELL"

I'm on the OpenHPC slurm 17.02.9-69.2.

The log says the job is assigned, then eventually times out. I have tried
srun directly with various tweaks, but it hangs every time. You can't ctl-C
or ctl-Z out of it either, but the shell returns after the job times out. I
killed the firewall on the login nodes but that made no difference.


On Fri, Mar 9, 2018 at 10:17 AM, Andy Georges <andy.geor...@ugent.be> wrote:

> Hi,
>
> Adding —pty makes no difference. I do not get a prompt and on the node the
> logs show an error. If —pty is used, the error is somewhat different
> compared to not using it but the end result is the same.
>
> My main issue is that giving the same command on the machines running
> slurmd and slurmctld just works.
>
> As far as srun is concerned, that’s what is advised for an interactive
> job, no?
>
> — Andy.
>
> Sent from my iPhone
>
> > On 9 Mar 2018, at 19:07, Michael Robbert <mrobb...@mines.edu> wrote:
> >
> > I think that the piece you may be missing is --pty, but I also don't
> think that salloc is necessary.
> >
> > The most simple command that I typically use is:
> >
> > srun -N1 -n1 --pty bash -i
> >
> > Mike
> >
> >> On 3/9/18 10:20 AM, Andy Georges wrote:
> >> Hi,
> >>
> >>
> >> I am trying to get interactive jobs to work from the machine we use as
> a login node, i.e., where the users of the cluster log into and from where
> they typically submit jobs.
> >>
> >>
> >> I submit the job as follows:
> >>
> >> vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun bash -i
> >> salloc: Granted job allocation 41
> >> salloc: Waiting for resource configuration
> >> salloc: Nodes node2801 are ready for job
> >>
> >> …
> >> hangs
> >>
> >>
> >> On node2801, the slurmd log has the following information:
> >>
> >>
> >> [2018-03-09T18:16:08.820] _run_prolog: run job script took usec=10379
> >> [2018-03-09T18:16:08.820] _run_prolog: prolog with lock for job 41 ran
> for 0 seconds
> >> [2018-03-09T18:16:08.829] [41.extern] task/cgroup:
> /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
> >> [2018-03-09T18:16:08.830] [41.extern] task/cgroup:
> /slurm/uid_2540075/job_41/step_extern: alloc=800MB mem.limit=800MB
> memsw.limit=880MB
> >> [2018-03-09T18:16:11.824] launch task 41.0 request from
> 2540075.2540075@10.141.21.202 (port 61928)
> >> [2018-03-09T18:16:11.824] lllp_distribution jobid [41] implicit auto
> binding: cores,one_thread, dist 1
> >> [2018-03-09T18:16:11.824] _task_layout_lllp_cyclic
> >> [2018-03-09T18:16:11.824] _lllp_generate_cpu_bind jobid [41]:
> mask_cpu,one_thread, 0x1
> >> [2018-03-09T18:16:11.834] [41.0] task/cgroup:
> /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
> >> [2018-03-09T18:16:11.834] [41.0] task/cgroup: 
> >> /slurm/uid_2540075/job_41/step_0:
> alloc=800MB mem.limit=800MB memsw.limit=880MB
> >> [2018-03-09T18:16:11.836] [41.0] error: connect io: Connection refused
> >> [2018-03-09T18:16:11.836] [41.0] error: IO setup failed: Connection
> refused
> >> [2018-03-09T18:16:11.905] [41.0] _oom_event_monitor: oom-kill event
> count: 1
> >> [2018-03-09T18:16:11.905] [41.0] error: job_manager exiting abnormally,
> rc = 4021
> >> [2018-03-09T18:16:11.905] [41.0] error: _send_launch_resp: Failed to
> send RESPONSE_LAUNCH_TASKS: Connection refused
> >> [2018-03-09T18:16:11.907] [41.0] done with job
> >>
> >>
> >> We are running slurm 17.11.4.
> >>
> >>
> >> When I change to the same user on both the master node (running
> slurmctld) and worker nodes (running slurmd), things work just fine. I
> would assume I need not run slurmd on the login node for this to work?
> >>
> >>
> >> Any pointers are appreciated,
> >> — Andy
> >
>

Re: [slurm-users] Problem launching interactive jobs using srun

Reply via email to