Hmm…perhaps I misunderstood. I thought your application was an MPI application, in which case OMPI would definitely abort the job if something fails. So I’m puzzled by your observation as (a) I wrote the logic that responds to that problem, and (b) I verified that it does indeed behave as designed.
Are you setting an MCA param to tell it not to terminate upon bad proc exit? > On Oct 29, 2014, at 3:55 AM, Steven Chow <[email protected]> wrote: > > > If I run my mpi_application without using slurm, for example, I use "mpirun > --machinefile host_list --pernode ./my_mpi_application", > then in the running process , a node crashes, it turns out that the > application still works. > > If I used >> "salloc -N 10 --no-kill mpirun ./my-mpi-application" >> > > when the whole job being killed, I got the slurmctld.logs as following: > slurmctld.log: > [2014-10-29T09:20:58.150] sched: _slurm_rpc_allocate_resources JobId=6 > NodeList=vsc-00-03-[00-01],vsc-server-204 usec=8221 > [2014-10-29T09:21:08.103] sched: _slurm_rpc_job_step_create: > StepId=6.0 vsc-00-03-[00-01] usec=5004 > [2014-10-29T09:23:05.066] sched: Cancel of StepId=6.0 by UID=0 > usec=5054 > [2014-10-29T09:23:05.084] sched: Cancel of StepId=6.0 by UID=0 > usec=3904 > [2014-10-29T09:23:07.005] sched: Cancel of StepId=6.0 by UID=0 > usec=5114 > And ,slurmd.log as following: > [2014-10-29T09:21:13.000] launch task 6.0 request from > [email protected] <mailto:[email protected]> (port 60118) > [2014-10-29T09:21:13.054] Received cpu frequency information for 2 > cpus > [2014-10-29T09:23:09.958] [6.0] *** STEP 6.0 KILLED AT > 2014-10-29T09:23:09 WITH SIGNAL 9 *** > [2014-10-29T09:23:09.976] [6.0] *** STEP 6.0 KILLED AT > 2014-10-29T09:23:09 WITH SIGNAL 9 *** > [2014-10-29T09:23:11.897] [6.0] *** STEP 6.0 KILLED AT > 2014-10-29T09:23:11 WITH SIGNAL 9 *** > > > So I get the conclusion that it is slurm who kills the mpi job. > > > > > At 2014-10-29 17:40:43, "Ralph Castain" <[email protected] > <mailto:[email protected]>> wrote: > FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI > process abnormally terminates unless told otherwise. See the mpirun man page > for options on how to run without termination. > > >> On Oct 29, 2014, at 12:34 AM, Artem Polyakov <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hello, Steven. >> >> As one of the opportunities you can give a DMTCP project >> (http://dmtcp.sourceforge.net/ <http://dmtcp.sourceforge.net/>) a try. I was >> the one who add SLURM support there and it is relatively stable now (still >> under development though). Let me know if you'll have any problems. >> >> 2014-10-29 13:10 GMT+06:00 Steven Chow <[email protected] >> <mailto:[email protected]>>: >> Hi, >> I am a newer on slurm. >> I have a problem about the Failure Tolerance, when I was running a MPI >> application on a cluster with slurm. >> >> My slurm version is 14.03.6, and the MPI version is OPEN MPI 1.6.5. >> I didn't use plugin Checkpoint or Nonstop. >> >> I submit the job through command "salloc -N 10 --no-kill mpirun >> ./my-mpi-application". >> >> In the running process, if one node crashed, then the WHOLE job would be >> killd on all allocated nodes. >> It seems that the "--no-kill" option dosen't work. >> >> I want the job continuing running without being killed, even with some >> nodes failure or network connection broken. >> Because i will handle the nodes failure by myself. >> >> Can anyone give some suggestions. >> >> Besides, if I want to use plugin Nonstop to handle failure, according to >> http://slurm.schedmd.com/nonstop.html, >> <http://slurm.schedmd.com/nonstop.html,> an additional package named smd >> will also need to be installed. >> How can I get this package? >> >> Thanks! >> >> -Steven Chow >> >> >> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov > > >
