[OMPI users] MPI_Abort under slurm
Hi, I noticed that MPI_Abort() does not abort the tasks if the mpi program is started using srun. I call MPI_Abort() from rank 0, this process exit, but the other ranks keep running or waiting for IO on the other nodes. The only way to kill the job is to use scancel. However if I use mpirun under a slurm allocation then MPI_Abort() works as expected aborting all tasks. Is this a known issue? Thanks, David
Re: [OMPI users] MPI_Abort under slurm
Hi Ralph, thanks for your answer. I am using: >mpirun --version mpirun (Open MPI) 1.5.4 Report bugs to http://www.open-mpi.org/community/help/ and slurm 2.5. Should I try to upgrade to 1.6.5? /David/Bigagli www.davidbigagli.com On Mon, Feb 25, 2013 at 7:38 PM, Bokassa wrote: > Hi, >I noticed that MPI_Abort() does not abort the tasks if the mpi program > is started using srun. > I call MPI_Abort() from rank 0, this process exit, but the other ranks > keep running or waiting for IO > on the other nodes. The only way to kill the job is to use scancel. > However if I use mpirun under a slurm allocation then MPI_Abort() works as > expected aborting > all tasks. > > Is this a known issue? > > Thanks, David > >
Re: [OMPI users] MPI_Abort under slurm
Thanks Ralph, you were right I was not aware of --kill-on-bad-exit and KillOnBadExit, setting it to 1 shuts down the entire MPI job when MPI_Abort() is called. I was thinking this MPI protocol message was just transported by slurm and then each task would exit. Oh well I should not guess the implementation. :-) Thanks again. David
[OMPI users] High cpu usage
Hi, I notice that a simple MPI program in which rank 0 sends 4 bytes to each rank and receives a reply uses a considerable amount of CPU in system call.s % time seconds usecs/call callserrors syscall -- --- --- - - 61.100.016719 3 5194 gettimeofday 20.770.005683 2 2596 epoll_wait 18.130.004961 2 2595 sched_yield 0.000.00 0 4 write 0.000.00 0 4 stat 0.000.00 0 2 readv 0.000.00 0 2 writev -- --- --- - - 100.000.027363 10397 total and Process 2512 attached - interrupt to quit 16:32:17.793039 sched_yield() = 0 <0.78> 16:32:17.793276 gettimeofday({1362065537, 793330}, NULL) = 0 <0.70> 16:32:17.793460 epoll_wait(4, {}, 32, 0) = 0 <0.000114> 16:32:17.793712 gettimeofday({1362065537, 793773}, NULL) = 0 <0.97> 16:32:17.793914 sched_yield() = 0 <0.89> 16:32:17.794107 gettimeofday({1362065537, 794157}, NULL) = 0 <0.83> 16:32:17.794292 epoll_wait(4, {}, 32, 0) = 0 <0.72> 16:32:17.794457 gettimeofday({1362065537, 794541}, NULL) = 0 <0.000115> 16:32:17.794695 sched_yield() = 0 <0.79> 16:32:17.794877 gettimeofday({1362065537, 794927}, NULL) = 0 <0.81> 16:32:17.795062 epoll_wait(4, {}, 32, 0) = 0 <0.79> 16:32:17.795244 gettimeofday({1362065537, 795294}, NULL) = 0 <0.82> 16:32:17.795432 sched_yield() = 0 <0.96> 16:32:17.795761 gettimeofday({1362065537, 795814}, NULL) = 0 <0.79> 16:32:17.795940 epoll_wait(4, {}, 32, 0) = 0 <0.80> 16:32:17.796123 gettimeofday({1362065537, 796191}, NULL) = 0 <0.000121> 16:32:17.796388 sched_yield() = 0 <0.000127> 16:32:17.796635 gettimeofday({1362065537, 796722}, NULL) = 0 <0.000121> 16:32:17.796951 epoll_wait(4, {}, 32, 0) = 0 <0.89> What is the purpose of this behavior. Thanks, David
Re: [OMPI users] High cpu usage
Hi, I was wondering if there is any way to reduce the cpu usage the openmpi seems to spend in the busy wait loop. Thanks, /David On Thu, Feb 28, 2013 at 4:34 PM, Bokassa wrote: > Hi, >I notice that a simple MPI program in which rank 0 sends 4 bytes to > each rank and receives a reply uses a >considerable amount of CPU in system call.s > >% time seconds usecs/call callserrors syscall > -- --- --- - - > 61.100.016719 3 5194 gettimeofday > 20.770.005683 2 2596 epoll_wait > 18.130.004961 2 2595 sched_yield > 0.000.00 0 4 write > 0.000.00 0 4 stat > 0.000.00 0 2 readv > 0.000.00 0 2 writev > -- --- --- - - > 100.000.027363 10397 total > > and > > Process 2512 attached - interrupt to quit > 16:32:17.793039 sched_yield() = 0 <0.78> > 16:32:17.793276 gettimeofday({1362065537, 793330}, NULL) = 0 <0.70> > 16:32:17.793460 epoll_wait(4, {}, 32, 0) = 0 <0.000114> > 16:32:17.793712 gettimeofday({1362065537, 793773}, NULL) = 0 <0.97> > 16:32:17.793914 sched_yield() = 0 <0.89> > 16:32:17.794107 gettimeofday({1362065537, 794157}, NULL) = 0 <0.83> > 16:32:17.794292 epoll_wait(4, {}, 32, 0) = 0 <0.72> > 16:32:17.794457 gettimeofday({1362065537, 794541}, NULL) = 0 <0.000115> > 16:32:17.794695 sched_yield() = 0 <0.79> > 16:32:17.794877 gettimeofday({1362065537, 794927}, NULL) = 0 <0.81> > 16:32:17.795062 epoll_wait(4, {}, 32, 0) = 0 <0.79> > 16:32:17.795244 gettimeofday({1362065537, 795294}, NULL) = 0 <0.82> > 16:32:17.795432 sched_yield() = 0 <0.96> > 16:32:17.795761 gettimeofday({1362065537, 795814}, NULL) = 0 <0.79> > 16:32:17.795940 epoll_wait(4, {}, 32, 0) = 0 <0.80> > 16:32:17.796123 gettimeofday({1362065537, 796191}, NULL) = 0 <0.000121> > 16:32:17.796388 sched_yield() = 0 <0.000127> > 16:32:17.796635 gettimeofday({1362065537, 796722}, NULL) = 0 <0.000121> > 16:32:17.796951 epoll_wait(4, {}, 32, 0) = 0 <0.89> > > What is the purpose of this behavior. > > Thanks, > David > >