Not yet, but mybe I can have a try.
You say "(b) I verified that it does indeed behave as designed." Then, what the version of openmpi you used when you did the test. I now used openmpi-1.6.5, and slurm-14.03.6. And, can you give a further guide on how to set MCA param? Thank you! At 2014-10-29 21:57:40, "Ralph Castain" <r...@open-mpi.org> wrote: Hmm…perhaps I misunderstood. I thought your application was an MPI application, in which case OMPI would definitely abort the job if something fails. So I’m puzzled by your observation as (a) I wrote the logic that responds to that problem, and (b) I verified that it does indeed behave as designed. Are you setting an MCA param to tell it not to terminate upon bad proc exit? On Oct 29, 2014, at 3:55 AM, Steven Chow <wulingaoshou_...@163.com> wrote: If I run my mpi_application without using slurm, for example, I use "mpirun --machinefile host_list --pernode ./my_mpi_application", then in the running process , a node crashes, it turns out that the application still works. If I used "salloc -N 10 --no-kill mpirun ./my-mpi-application" when the whole job being killed, I got the slurmctld.logs as following: slurmctld.log: [2014-10-29T09:20:58.150] sched: _slurm_rpc_allocate_resources JobId=6 NodeList=vsc-00-03-[00-01],vsc-server-204 usec=8221 [2014-10-29T09:21:08.103] sched: _slurm_rpc_job_step_create: StepId=6.0 vsc-00-03-[00-01] usec=5004 [2014-10-29T09:23:05.066] sched: Cancel of StepId=6.0 by UID=0 usec=5054 [2014-10-29T09:23:05.084] sched: Cancel of StepId=6.0 by UID=0 usec=3904 [2014-10-29T09:23:07.005] sched: Cancel of StepId=6.0 by UID=0 usec=5114 And ,slurmd.log as following: [2014-10-29T09:21:13.000] launch task 6.0 request from 0.0@10.0.3.204 (port 60118) [2014-10-29T09:21:13.054] Received cpu frequency information for 2 cpus [2014-10-29T09:23:09.958] [6.0] *** STEP 6.0 KILLED AT 2014-10-29T09:23:09 WITH SIGNAL 9 *** [2014-10-29T09:23:09.976] [6.0] *** STEP 6.0 KILLED AT 2014-10-29T09:23:09 WITH SIGNAL 9 *** [2014-10-29T09:23:11.897] [6.0] *** STEP 6.0 KILLED AT 2014-10-29T09:23:11 WITH SIGNAL 9 *** So I get the conclusion that it is slurm who kills the mpi job. At 2014-10-29 17:40:43, "Ralph Castain" <r...@open-mpi.org> wrote: FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI process abnormally terminates unless told otherwise. See the mpirun man page for options on how to run without termination. On Oct 29, 2014, at 12:34 AM, Artem Polyakov <artpo...@gmail.com> wrote: Hello, Steven. As one of the opportunities you can give a DMTCP project (http://dmtcp.sourceforge.net/) a try. I was the one who add SLURM support there and it is relatively stable now (still under development though). Let me know if you'll have any problems. 2014-10-29 13:10 GMT+06:00 Steven Chow <wulingaoshou_...@163.com>: Hi, I am a newer on slurm. I have a problem about the Failure Tolerance, when I was running a MPI application on a cluster with slurm. My slurm version is 14.03.6, and the MPI version is OPEN MPI 1.6.5. I didn't use plugin Checkpoint or Nonstop. I submit the job through command "salloc -N 10 --no-kill mpirun ./my-mpi-application". In the running process, if one node crashed, then the WHOLE job would be killd on all allocated nodes. It seems that the "--no-kill" option dosen't work. I want the job continuing running without being killed, even with some nodes failure or network connection broken. Because i will handle the nodes failure by myself. Can anyone give some suggestions. Besides, if I want to use plugin Nonstop to handle failure, according to http://slurm.schedmd.com/nonstop.html, an additional package named smd will also need to be installed. How can I get this package? Thanks! -Steven Chow -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov