FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI 
process abnormally terminates unless told otherwise. See the mpirun man page 
for options on how to run without termination.


> On Oct 29, 2014, at 12:34 AM, Artem Polyakov <artpo...@gmail.com> wrote:
> 
> Hello, Steven.
> 
> As one of the opportunities you can give a DMTCP project 
> (http://dmtcp.sourceforge.net/ <http://dmtcp.sourceforge.net/>) a try. I was 
> the one who add SLURM support there and it is relatively stable now (still 
> under development though). Let me know if you'll have any problems.
> 
> 2014-10-29 13:10 GMT+06:00 Steven Chow <wulingaoshou_...@163.com 
> <mailto:wulingaoshou_...@163.com>>:
> Hi,
> I am a newer on slurm. 
> I have a problem about the Failure Tolerance, when I was running a MPI 
> application on a cluster with slurm. 
> 
> My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
> I didn't use plugin Checkpoint or Nonstop.
> 
> I submit the job through command "salloc -N 10 --no-kill  mpirun 
> ./my-mpi-application".
> 
> In the running process, if one node crashed, then the WHOLE job would be 
> killd on all allocated nodes.
> It seems that the "--no-kill" option dosen't work.
> 
>  I want the job continuing running without being killed, even with some nodes 
> failure or network connection broken. 
> Because  i will handle the nodes failure by myself.
> 
> Can anyone give some suggestions.
> 
> Besides, if I want to use plugin  Nonstop to handle failure, according to 
> http://slurm.schedmd.com/nonstop.html, 
> <http://slurm.schedmd.com/nonstop.html,>   an additional package named smd 
> will also need to be installed. 
> How can I get this package?
> 
> Thanks!
> 
> -Steven Chow
> 
> 
>  
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> 

Reply via email to