[slurm-dev] Re: Failure tolerance in "slurm + openmpi"

Steven Chow Wed, 29 Oct 2014 19:15:47 -0700

Not yet, but mybe I can have a try.


You say "(b) I verified that it does indeed behave as designed."
Then,  what the version of openmpi  you used when you did the test. I now used 
openmpi-1.6.5, and slurm-14.03.6.


And, can you give a further guide on how to set  MCA param?


Thank you!






At 2014-10-29 21:57:40, "Ralph Castain" <r...@open-mpi.org> wrote:

Hmm…perhaps I misunderstood. I thought your application was an MPI application, 
in which case OMPI would definitely abort the job if something fails. So I’m 
puzzled by your observation as (a) I wrote the logic that responds to that 
problem, and (b) I verified that it does indeed behave as designed.


Are you setting an MCA param to tell it not to terminate upon bad proc exit?




On Oct 29, 2014, at 3:55 AM, Steven Chow <wulingaoshou_...@163.com> wrote:



If  I run my mpi_application without using slurm, for example, I use "mpirun 
--machinefile host_list --pernode  ./my_mpi_application",
then in the running  process , a node crashes,  it turns out that the 
application still works.


If I used
"salloc -N 10 --no-kill  mpirun ./my-mpi-application"


when the whole job being killed, I got the slurmctld.logs as following:
   slurmctld.log:
       [2014-10-29T09:20:58.150] sched: _slurm_rpc_allocate_resources JobId=6 
NodeList=vsc-00-03-[00-01],vsc-server-204 usec=8221
       [2014-10-29T09:21:08.103] sched: _slurm_rpc_job_step_create: StepId=6.0 
vsc-00-03-[00-01] usec=5004
          [2014-10-29T09:23:05.066] sched: Cancel of StepId=6.0 by UID=0 
usec=5054
          [2014-10-29T09:23:05.084] sched: Cancel of StepId=6.0 by UID=0 
usec=3904
          [2014-10-29T09:23:07.005] sched: Cancel of StepId=6.0 by UID=0 
usec=5114 
And ,slurmd.log as following:
          [2014-10-29T09:21:13.000] launch task 6.0 request from 0.0@10.0.3.204 
(port 60118)
          [2014-10-29T09:21:13.054] Received cpu frequency information for 2 
cpus
          [2014-10-29T09:23:09.958] [6.0] *** STEP 6.0 KILLED AT 
2014-10-29T09:23:09 WITH SIGNAL 9 ***
          [2014-10-29T09:23:09.976] [6.0] *** STEP 6.0 KILLED AT 
2014-10-29T09:23:09 WITH SIGNAL 9 ***
          [2014-10-29T09:23:11.897] [6.0] *** STEP 6.0 KILLED AT 
2014-10-29T09:23:11 WITH SIGNAL 9 ***




So I get the conclusion that it is slurm who kills the mpi job.





At 2014-10-29 17:40:43, "Ralph Castain" <r...@open-mpi.org> wrote:
FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI 
process abnormally terminates unless told otherwise. See the mpirun man page 
for options on how to run without termination.




On Oct 29, 2014, at 12:34 AM, Artem Polyakov <artpo...@gmail.com> wrote:


Hello, Steven.


As one of the opportunities you can give a DMTCP project 
(http://dmtcp.sourceforge.net/) a try. I was the one who add SLURM support 
there and it is relatively stable now (still under development though). Let me 
know if you'll have any problems.


2014-10-29 13:10 GMT+06:00 Steven Chow <wulingaoshou_...@163.com>:

Hi,
I am a newer on slurm. 
I have a problem about the Failure Tolerance, when I was running a MPI 
application on a cluster with slurm. 


My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
I didn't use plugin Checkpoint or Nonstop.


I submit the job through command "salloc -N 10 --no-kill  mpirun 
./my-mpi-application".


In the running process, if one node crashed, then the WHOLE job would be killd 
on all allocated nodes.
It seems that the "--no-kill" option dosen't work.


 I want the job continuing running without being killed, even with some nodes 
failure or network connection broken. 
Because  i will handle the nodes failure by myself.


Can anyone give some suggestions.


Besides, if I want to use plugin  Nonstop to handle failure, according to 
http://slurm.schedmd.com/nonstop.html,   an additional package named smd will 
also need to be installed. 
How can I get this package?


Thanks!



-Steven Chow








-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

[slurm-dev] Re: Failure tolerance in "slurm + openmpi"

Reply via email to