[slurm-dev] Re: Failure tolerance in "slurm + openmpi"

Ralph Castain Wed, 29 Oct 2014 06:55:54 -0700

Hmm…perhaps I misunderstood. I thought your application was an MPI application, 
in which case OMPI would definitely abort the job if something fails. So I’m 
puzzled by your observation as (a) I wrote the logic that responds to that 
problem, and (b) I verified that it does indeed behave as designed.


Are you setting an MCA param to tell it not to terminate upon bad proc exit?


> On Oct 29, 2014, at 3:55 AM, Steven Chow <[email protected]> wrote:
> 
> 
> If  I run my mpi_application without using slurm, for example, I use "mpirun 
> --machinefile host_list --pernode  ./my_mpi_application",
> then in the running  process , a node crashes,  it turns out that the 
> application still works.
> 
> If I used
>> "salloc -N 10 --no-kill  mpirun ./my-mpi-application"
>> 
> 
> when the whole job being killed, I got the slurmctld.logs as following:
>    slurmctld.log:
>        [2014-10-29T09:20:58.150] sched: _slurm_rpc_allocate_resources JobId=6 
> NodeList=vsc-00-03-[00-01],vsc-server-204 usec=8221
>        [2014-10-29T09:21:08.103] sched: _slurm_rpc_job_step_create: 
> StepId=6.0 vsc-00-03-[00-01] usec=5004
>           [2014-10-29T09:23:05.066] sched: Cancel of StepId=6.0 by UID=0 
> usec=5054
>           [2014-10-29T09:23:05.084] sched: Cancel of StepId=6.0 by UID=0 
> usec=3904
>           [2014-10-29T09:23:07.005] sched: Cancel of StepId=6.0 by UID=0 
> usec=5114 
> And ,slurmd.log as following:
>           [2014-10-29T09:21:13.000] launch task 6.0 request from 
> [email protected] <mailto:[email protected]> (port 60118)
>           [2014-10-29T09:21:13.054] Received cpu frequency information for 2 
> cpus
>           [2014-10-29T09:23:09.958] [6.0] *** STEP 6.0 KILLED AT 
> 2014-10-29T09:23:09 WITH SIGNAL 9 ***
>           [2014-10-29T09:23:09.976] [6.0] *** STEP 6.0 KILLED AT 
> 2014-10-29T09:23:09 WITH SIGNAL 9 ***
>           [2014-10-29T09:23:11.897] [6.0] *** STEP 6.0 KILLED AT 
> 2014-10-29T09:23:11 WITH SIGNAL 9 ***
> 
> 
> So I get the conclusion that it is slurm who kills the mpi job.
> 
> 
> 
> 
> At 2014-10-29 17:40:43, "Ralph Castain" <[email protected] 
> <mailto:[email protected]>> wrote:
> FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI 
> process abnormally terminates unless told otherwise. See the mpirun man page 
> for options on how to run without termination.
> 
> 
>> On Oct 29, 2014, at 12:34 AM, Artem Polyakov <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hello, Steven.
>> 
>> As one of the opportunities you can give a DMTCP project 
>> (http://dmtcp.sourceforge.net/ <http://dmtcp.sourceforge.net/>) a try. I was 
>> the one who add SLURM support there and it is relatively stable now (still 
>> under development though). Let me know if you'll have any problems.
>> 
>> 2014-10-29 13:10 GMT+06:00 Steven Chow <[email protected] 
>> <mailto:[email protected]>>:
>> Hi,
>> I am a newer on slurm. 
>> I have a problem about the Failure Tolerance, when I was running a MPI 
>> application on a cluster with slurm. 
>> 
>> My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
>> I didn't use plugin Checkpoint or Nonstop.
>> 
>> I submit the job through command "salloc -N 10 --no-kill  mpirun 
>> ./my-mpi-application".
>> 
>> In the running process, if one node crashed, then the WHOLE job would be 
>> killd on all allocated nodes.
>> It seems that the "--no-kill" option dosen't work.
>> 
>>  I want the job continuing running without being killed, even with some 
>> nodes failure or network connection broken. 
>> Because  i will handle the nodes failure by myself.
>> 
>> Can anyone give some suggestions.
>> 
>> Besides, if I want to use plugin  Nonstop to handle failure, according to 
>> http://slurm.schedmd.com/nonstop.html, 
>> <http://slurm.schedmd.com/nonstop.html,>   an additional package named smd 
>> will also need to be installed. 
>> How can I get this package?
>> 
>> Thanks!
>> 
>> -Steven Chow
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
> 
> 
>

[slurm-dev] Re: Failure tolerance in "slurm + openmpi"

Reply via email to