srun: error: Application launch failed: Invalid node name specified Hearns Law. All batch system problems are DNS problems.
Seriously though - check out your name resolution both on the head node and the compute nodes. On Tue, 16 Jul 2019 at 08:49, Pär Lundö <par.lu...@foi.se> wrote: > Hi, > > I have now had the time to look at some of your suggestions. > > First I tried running "srun -N1 hostname" via a sbatch-script, while > having two nodes up and running. > "sinfo" yields that two nodes are up and idle prior to submitting the > sbatch-script. > After submitting the job, I receive an error stating that: > > "srun: error: Task launch for 86.0 failed on node lxclient11: Invalid node > name specified. > srun: error: Application launch failed: Invalid node name specified > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: TImed out waiting for job step to complete" > > > From the log file at the client I get a more detailed error: > " Launching batch job 86 for UID 1000 > [86.batch] error: Invalid host_index -1 for job 86 > [86.batch] error: Host lxclient10 not in hostlist lxclient11 > [86.batch] task_pre_launch: Using sched_affinity for tasks > rpc_launch_tasks: Invalid node list (lxclient10 not in lxclient11)" > > My two nodes are called lxclient10 and lxclient11. > Why is my batch job launched with the UID 1000, shouldnt it be launched > via the slurm-user (which in my case has the UID 64030)? > What is meant by that the different nodes are not in the nodeslist? > The two nodes and the server share the same setup of IP-addresses in the > "/etc/hosts"-file. > > -> This was resolved due to that lxclient10 was noted as down. Getting it > back up, the submitting of the same sbatch-script, resulted in no error. > However running it on two nodes I get an error > "srun: error: Job Step 88.0 aborted before step completely launched. > srun: error: Job step aborted: Waiting up to 32 seconds for job step to > finish. > srun: error: task 1 launched failed: Unspecifed error > srun: error: lxclient10: task 0: Killed" > > And in the slurmctld.log-file from the client I get an error similiar to > that prevously stated, that the pmix cannot bind UNIX socket > /var/spool/slurmd/stepd.slurm.pmix.88.0: Address already in use (98) > > I ran the lsof command, but I dont really know what I am looking after, I > can see if I grep with the different nodenames that the two nodes have > mounted the nfs-partition and that a link is established. > > "As an aside, you have checked that your username exists on that compue > server? getent passwd par > Also that your home directory is mounted - or something substituting for > your home directory?" > Yes, the user slurm exists on both nodes and have the same uid. > > "Have you tried > > > srun -N# -n# mpirun python3 .... > > > Perhaps you have no MPI environment being setup for the processes? There > was no "--mpi" flag in your "srun" command and we don't know if you have a > default value for that or not. > " > > In my slurm.conf-file I do specify that "MpiDefault=pmix" (And it can be > seen in the logfile that there is something wrong with pmix, that the > address already in use.) > > One thing that struck my mind now is that I run these nodes as a pair of > diskless nodes, whom boots and mounts the same filesystem which is supplied > by a server. The run differen pids for different processes which should not > affect one another(?), right? > > > Best regards, > > Palle > On 2019-07-12 19:34, Pär Lundö wrote: > > Hi, > > Thank you so much for your quick responses! > It is much appreciated. > I dont have access to the cluster until next week, but I’ll be sure to > follow up on all of your suggestions and get back you next week. > > Have a nice weekend! > Best regards > Palle > > ------------------------------ > *From:* "slurm-users" <slurm-users-boun...@lists.schedmd.com> > <slurm-users-boun...@lists.schedmd.com> > *Sent:* 12 juli 2019 17:37 > *To:* "Slurm User Community List" <slurm-users@lists.schedmd.com> > <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] Running pyMPI on several nodes > > Par, by 'poking around' Crhis means to use tools such as netstat and lsof. > Also I would look as ps -eaf --forest to make sure there are no 'orphaned' > jusbs sitting on that compute node. > > Having said that though, I have a dim memory of a classic PBSPro error > message which says something about a network connection, > but really means that you cannot open a remote session on that compute > server. > > As an aside, you have checked that your username exists on that compue > server? getent passwd par > Also that your home directory is mounted - or something substituting for > your home directory? > > > On Fri, 12 Jul 2019 at 15:55, Chris Samuel < ch...@csamuel.org> wrote: > >> On 12/7/19 7:39 am, Pär Lundö wrote: >> >> > Presumably, the first 8 tasks originates from the first node (in this >> > case the lxclient11), and the other node (lxclient10) response as >> > predicted. >> >> That looks right, it seems the other node has two processes fighting >> over the same socket and that's breaking Slurm there. >> >> > Is it neccessary to have passwordless ssh communication alongside the >> > munge authentication? >> >> No, srun doesn't need (or use) that at all. >> >> > In addition I checked the slurmctld-log from both the server and client >> > and found something (noted in bold): >> >> This is from the slurmd log on the client from the look of it. >> >> > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched >> affinity >> > for tasks lurm.pmix.83.0: Address already in use[98]* >> > [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 >> > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv >> > [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 >> > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() >> failed >> >> That indicates that something else has grabbed the socket it wants and >> that's why the setup of the MPI ranks on the second node fails. >> >> You'll want to poke around there to see what's using it. >> >> Best of luck! >> Chris >> -- >> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA >> >> -- > Hälsningar, Pär > ________________________________ > Pär Lundö > Forskare > Avdelningen för Ledningssystem > > FOI > Totalförsvarets forskningsinstitut > 164 90 Stockholm > > Besöksadress: > Olau Magnus väg 33, Linköping > > > Tel: +46 13 37 86 01 > Mob: +46 734 447 815 > Vxl: +46 13 37 80 00par.lu...@foi.sewww.foi.se > >