FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI process abnormally terminates unless told otherwise. See the mpirun man page for options on how to run without termination.
> On Oct 29, 2014, at 12:34 AM, Artem Polyakov <artpo...@gmail.com> wrote: > > Hello, Steven. > > As one of the opportunities you can give a DMTCP project > (http://dmtcp.sourceforge.net/ <http://dmtcp.sourceforge.net/>) a try. I was > the one who add SLURM support there and it is relatively stable now (still > under development though). Let me know if you'll have any problems. > > 2014-10-29 13:10 GMT+06:00 Steven Chow <wulingaoshou_...@163.com > <mailto:wulingaoshou_...@163.com>>: > Hi, > I am a newer on slurm. > I have a problem about the Failure Tolerance, when I was running a MPI > application on a cluster with slurm. > > My slurm version is 14.03.6, and the MPI version is OPEN MPI 1.6.5. > I didn't use plugin Checkpoint or Nonstop. > > I submit the job through command "salloc -N 10 --no-kill mpirun > ./my-mpi-application". > > In the running process, if one node crashed, then the WHOLE job would be > killd on all allocated nodes. > It seems that the "--no-kill" option dosen't work. > > I want the job continuing running without being killed, even with some nodes > failure or network connection broken. > Because i will handle the nodes failure by myself. > > Can anyone give some suggestions. > > Besides, if I want to use plugin Nonstop to handle failure, according to > http://slurm.schedmd.com/nonstop.html, > <http://slurm.schedmd.com/nonstop.html,> an additional package named smd > will also need to be installed. > How can I get this package? > > Thanks! > > -Steven Chow > > > > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov >