[OMPI users] MPI_Abort under slurm

2013-02-25 Thread Bokassa
Hi,
   I noticed that MPI_Abort() does not abort the tasks if the mpi program
is started using srun.
I call MPI_Abort() from rank 0, this process exit, but the other ranks keep
running or waiting for IO
on the other nodes. The only way to kill the job is to use scancel.
However if I use mpirun under a slurm allocation then MPI_Abort() works as
expected aborting
all tasks.

Is this a known issue?

Thanks, David


Re: [OMPI users] MPI_Abort under slurm

2013-02-26 Thread Bokassa
Hi Ralph, thanks for your answer. I am using:

>mpirun --version
mpirun (Open MPI) 1.5.4

Report bugs to http://www.open-mpi.org/community/help/

and slurm 2.5.

Should I try to upgrade to 1.6.5?



/David/Bigagli
www.davidbigagli.com


On Mon, Feb 25, 2013 at 7:38 PM, Bokassa  wrote:

> Hi,
>I noticed that MPI_Abort() does not abort the tasks if the mpi program
> is started using srun.
> I call MPI_Abort() from rank 0, this process exit, but the other ranks
> keep running or waiting for IO
> on the other nodes. The only way to kill the job is to use scancel.
> However if I use mpirun under a slurm allocation then MPI_Abort() works as
> expected aborting
> all tasks.
>
> Is this a known issue?
>
> Thanks, David
>
>


Re: [OMPI users] MPI_Abort under slurm

2013-02-28 Thread Bokassa
Thanks Ralph, you were right I was not aware of --kill-on-bad-exit
and KillOnBadExit, setting it to 1 shuts down
the entire MPI job when MPI_Abort() is called. I was thinking this MPI
protocol message was just transported
by slurm and then each task would exit. Oh well I should not guess the
implementation. :-)

Thanks again.

  David


[OMPI users] High cpu usage

2013-02-28 Thread Bokassa
Hi,
   I notice that a simple MPI program in which rank 0 sends 4 bytes to each
rank and receives a reply uses a
   considerable amount of CPU in system call.s

   % time seconds  usecs/call callserrors syscall
-- --- --- - - 
 61.100.016719   3  5194   gettimeofday
 20.770.005683   2  2596   epoll_wait
 18.130.004961   2  2595   sched_yield
  0.000.00   0 4   write
  0.000.00   0 4   stat
  0.000.00   0 2   readv
  0.000.00   0 2   writev
-- --- --- - - 
100.000.027363 10397   total

and

  Process 2512 attached - interrupt to quit
16:32:17.793039 sched_yield()   = 0 <0.78>
16:32:17.793276 gettimeofday({1362065537, 793330}, NULL) = 0 <0.70>
16:32:17.793460 epoll_wait(4, {}, 32, 0) = 0 <0.000114>
16:32:17.793712 gettimeofday({1362065537, 793773}, NULL) = 0 <0.97>
16:32:17.793914 sched_yield()   = 0 <0.89>
16:32:17.794107 gettimeofday({1362065537, 794157}, NULL) = 0 <0.83>
16:32:17.794292 epoll_wait(4, {}, 32, 0) = 0 <0.72>
16:32:17.794457 gettimeofday({1362065537, 794541}, NULL) = 0 <0.000115>
16:32:17.794695 sched_yield()   = 0 <0.79>
16:32:17.794877 gettimeofday({1362065537, 794927}, NULL) = 0 <0.81>
16:32:17.795062 epoll_wait(4, {}, 32, 0) = 0 <0.79>
16:32:17.795244 gettimeofday({1362065537, 795294}, NULL) = 0 <0.82>
16:32:17.795432 sched_yield()   = 0 <0.96>
16:32:17.795761 gettimeofday({1362065537, 795814}, NULL) = 0 <0.79>
16:32:17.795940 epoll_wait(4, {}, 32, 0) = 0 <0.80>
16:32:17.796123 gettimeofday({1362065537, 796191}, NULL) = 0 <0.000121>
16:32:17.796388 sched_yield()   = 0 <0.000127>
16:32:17.796635 gettimeofday({1362065537, 796722}, NULL) = 0 <0.000121>
16:32:17.796951 epoll_wait(4, {}, 32, 0) = 0 <0.89>

What is the purpose of this behavior.

Thanks,
David


Re: [OMPI users] High cpu usage

2013-03-05 Thread Bokassa
Hi,
 I was wondering if there is any way to reduce the cpu usage the
openmpi seems to spend in the busy wait loop.
Thanks,

/David


On Thu, Feb 28, 2013 at 4:34 PM, Bokassa  wrote:

> Hi,
>I notice that a simple MPI program in which rank 0 sends 4 bytes to
> each rank and receives a reply uses a
>considerable amount of CPU in system call.s
>
>% time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  61.100.016719   3  5194   gettimeofday
>  20.770.005683   2  2596   epoll_wait
>  18.130.004961   2  2595   sched_yield
>   0.000.00   0 4   write
>   0.000.00   0 4   stat
>   0.000.00   0 2   readv
>   0.000.00   0 2   writev
> -- --- --- - - 
> 100.000.027363 10397   total
>
> and
>
>   Process 2512 attached - interrupt to quit
> 16:32:17.793039 sched_yield()   = 0 <0.78>
> 16:32:17.793276 gettimeofday({1362065537, 793330}, NULL) = 0 <0.70>
> 16:32:17.793460 epoll_wait(4, {}, 32, 0) = 0 <0.000114>
> 16:32:17.793712 gettimeofday({1362065537, 793773}, NULL) = 0 <0.97>
> 16:32:17.793914 sched_yield()   = 0 <0.89>
> 16:32:17.794107 gettimeofday({1362065537, 794157}, NULL) = 0 <0.83>
> 16:32:17.794292 epoll_wait(4, {}, 32, 0) = 0 <0.72>
> 16:32:17.794457 gettimeofday({1362065537, 794541}, NULL) = 0 <0.000115>
> 16:32:17.794695 sched_yield()   = 0 <0.79>
> 16:32:17.794877 gettimeofday({1362065537, 794927}, NULL) = 0 <0.81>
> 16:32:17.795062 epoll_wait(4, {}, 32, 0) = 0 <0.79>
> 16:32:17.795244 gettimeofday({1362065537, 795294}, NULL) = 0 <0.82>
> 16:32:17.795432 sched_yield()   = 0 <0.96>
> 16:32:17.795761 gettimeofday({1362065537, 795814}, NULL) = 0 <0.79>
> 16:32:17.795940 epoll_wait(4, {}, 32, 0) = 0 <0.80>
> 16:32:17.796123 gettimeofday({1362065537, 796191}, NULL) = 0 <0.000121>
> 16:32:17.796388 sched_yield()   = 0 <0.000127>
> 16:32:17.796635 gettimeofday({1362065537, 796722}, NULL) = 0 <0.000121>
> 16:32:17.796951 epoll_wait(4, {}, 32, 0) = 0 <0.89>
>
> What is the purpose of this behavior.
>
> Thanks,
> David
>
>