[OMPI users] openMP and mpi problem

2014-07-02 Thread Timur Ismagilov

Hello!
I have open mpi 1.9a1r32104 and open mpi 1.5.5.
I have much better perfomance in open mpi 1.5.5 with openMP on 8 cores
in  the program:


#define N 1000
int main(int argc, char *argv[]) {
...
MPI_Init(&argc, &argv);
...
for (i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}

#pragma omp parallel for shared(a, b, c) private(i)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
.
MPI_Finalize();
}
I got on 1 node 
(for i in 1 2 4 8 ; do export OMP_NUM_THREADS=$i; sbatch -p test -t 5 
--exclusive -N 1 -o hybrid-hello_omp$i.out -e hybrid-hello_omp$i.err 
ompi_mxm3.0 ./hybrid-hello; done)
*   open mpi 1.5.5 (Data for node: node1-128-17 Num slots: 8 Max slots: 0): 
*  8 threads 0.014527 sec
*  4 threads 0.016138 sec
*  2 threads 0.018764 sec
*  1 thread   0.029963 sec
*  openmpi 1.9a1r32104 ( node1-128-29: slots=8 max_slots=0 slots_inuse=0 
state=UP ):
*  8 threads 0.035055 sec
*  4 threads 0.029859 sec 
*  2 threads 0.019564 sec  (same as  open mpi 1.5.5 )
*  1 thread   0.028394 sec (same as  open mpi 1.5.5 )
So, it looks like, that open mpi 1.9 use only 2 cores from 8.

What can i do with this?

$cat ompi_mxm3.0
#!/bin/sh
[ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { rm 
-f $HOSTFILE; exit 255; }
LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --display-allocation --hostfile 
$HOSTFILE "$@"
rc=$?
rm -f $HOSTFILE
exit $rc

For open mpi 1.5.5 i remove LD_PRELOAD from run script.

Re: [OMPI users] openMP and mpi problem

2014-07-02 Thread Ralph Castain
OMPI started binding by default during the 1.7 series. You should add the 
following to your cmd line:

--map-by :pe=$OMP_NUM_THREADS

This will give you a dedicated core for each thread. Alternatively, you could 
instead add

--bind-to socket

OMPI 1.5.5 doesn't bind at all unless directed to do so, which is why you are 
getting the difference in behavior.


On Jul 2, 2014, at 12:33 AM, Timur Ismagilov  wrote:

> Hello!
> 
> I have open mpi 1.9a1r32104 and open mpi 1.5.5.
> I have much better perfomance in open mpi 1.5.5 with openMP on 8 cores
> in  the program:
> 
> 
> 
> #define N 1000
> 
> 
> int main(int argc, char *argv[]) {
> ...
> MPI_Init(&argc, &argv);
> ...
> for (i = 0; i < N; i++) {
> a[i] = i * 1.0;
> b[i] = i * 2.0;
> }
> 
> #pragma omp parallel for shared(a, b, c) private(i)
> for (i = 0; i < N; i++) {
> c[i] = a[i] + b[i];
> }
> .
> MPI_Finalize();
> }
> 
> I got on 1 node 
> (for i in 1 2 4 8 ; do export OMP_NUM_THREADS=$i; sbatch -p test -t 5 
> --exclusive -N 1 -o hybrid-hello_omp$i.out -e hybrid-hello_omp$i.err 
> ompi_mxm3.0 ./hybrid-hello; done)
> 
> 
> open mpi 1.5.5 (Data for node: node1-128-17 Num slots: 8 Max slots: 0): 
> 8 threads 0.014527 sec
> 4 threads 0.016138 sec
> 2 threads 0.018764 sec
> 1 thread   0.029963 sec
> openmpi 1.9a1r32104 (node1-128-29: slots=8 max_slots=0 slots_inuse=0 
> state=UP):
> 8 threads 0.035055 sec
> 4 threads 0.029859 sec 
> 2 threads 0.019564 sec (same as open mpi 1.5.5)
> 1 thread   0.028394 sec (same as open mpi 1.5.5)
> So, it looks like, that open mpi 1.9 use only 2 cores from 8.
> 
> What can i do with this?
> 
> $cat ompi_mxm3.0
> 
> #!/bin/sh
> 
> [ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
> 
> HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
> srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { rm 
> -f $HOSTFILE; exit 255; }
> LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>  mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --display-allocation 
> --hostfile $HOSTFILE "$@"
> rc=$?
> rm -f $HOSTFILE
> 
> exit $rc
> 
> For open mpi 1.5.5 i remove LD_PRELOAD from run script.
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24738.php



Re: [OMPI users] mpi prorg fails (big data)

2014-07-02 Thread Ralph Castain
I would suggest having him look at the core file with a debugger and see where 
it fails. Sounds like he has a memory corruption problem.


On Jun 24, 2014, at 3:31 AM, Dr.Peer-Joachim Koch  wrote:

> Hi,
> 
> one of our cluster users reported a problem with openmpi.
> He created a short sample (just a few lines) which will start and
> crash after a short time.
> We only see "Fatal error in PMPI_Gather: Other MPI error" - no further 
> details.
> He is using an intel fortran compiler with a self compiled openmpi (just 
> tested 1.8.1).
> 
> I've know nearly nothing about mpi(openmpi) so I'm asking at this forum.
> Has anybody some idea ?
> 
> Thanks, Peer
> 
> 
> 
> ---makefile--
> OPTIONS=-assume byterecl -fpp -allow nofpp_comments -free
> DEBUG=-g -d-lines -check -debug -debug-parameters -fpe0 -traceback
> 
> all:
>rm -f JeDi globe_mod.mod JeDi.out jedi_restart
>$(SOURCE) ; mpif90 $(OPTIONS) $(DEBUG) -o JeDi globe.f90
> 
> --
> 
> globe.f90-
>  program globe
>  use mpi
>  implicit none
> 
>  integer :: mpinfo  = 0
>  integer :: myworld = 0
>  integer :: mypid   = 0
>  integer :: npro= 1
> 
> ! * The comments give some conditions required to reproduce the problem.
> 
> ! * If the program runs at two hosts, the error message is shown two times
> 
>  integer, parameter :: vv_g_d1 = 2432
>  integer, parameter :: vv_p_d1 = vv_g_d1 / 16  ! requires 16 CPUs
> 
>  integer, parameter :: out_d1  = 2418  ! requires >=2416 (vv_g_d1 - 16)
> 
>  integer, parameter :: d2 = 5001 !  requires >=4282 @ ii=30 / >=6682 @ 
> ii=20 (depends on number of loops, but this limit can change for unknown 
> reason)
> 
>  integer :: ii, jj
> 
>  real:: vv_p(vv_p_d1,d2)
>  real,allocatable :: vv_g(:,:)
> ! * requires the definition of the variable for write to be defined below 
> vv_g(:,:)
>  real:: out(out_d1,d2)
> 
>  vv_p(:,:) = 0.0
>  out(:,:) = 0.0
> 
>  call mpi_init(mpinfo)
>  myworld = MPI_COMM_WORLD
>  call mpi_comm_size(myworld, npro, mpinfo)
> ! * The problem requires 16 CPUs
>  if (npro .ne. 16) then; write(*,*) "Works only with 16 CPUs"; stop; endif
>  call mpi_comm_rank(myworld, mypid, mpinfo)
> 
>  if (mypid == 0) then
>open(11, FILE='jedi_restart', STATUS='replace', FORM='unformatted')
>  endif
> 
>  write(6,*) "test1",mypid ; flush(6)
> 
>  do ii = 1, 25  ! number of loops depends on field size
>allocate(vv_g(vv_g_d1,d2))
> 
>do jj = 1, d2
>  call mpi_gather(vv_p(1,jj), vv_p_d1, MPI_REAL, vv_g(1,jj), vv_p_d1, 
> MPI_REAL, 0, myworld, mpinfo)
>enddo
> 
>if (mypid == 0) then; write(11) out; flush(11); endif
> 
>deallocate(vv_g)
>  enddo
> 
>  write(6,*) "test2",mypid ; flush(6)
> 
>  if (mypid == 0) close(11)
> 
>  call mpi_barrier(myworld, mpinfo)
>  call mpi_finalize(mpinfo)
> 
>  end
> -end 
> globe.f90--
> 
> -- 
> Mit freundlichem Gruß
>Peer-Joachim Koch
> _
> Max-Planck-Institut für Biogeochemie
> Dr. Peer-Joachim Koch
> Hans-Knöll Str.10Telefon: ++49 3641 57-6705
> D-07745 Jena Telefax: ++49 3641 57-7705
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24695.php



Re: [OMPI users] openMP and mpi problem

2014-07-02 Thread Ralph Castain
Let's keep this on the user list so others with similar issues can find it.

My guess is that the $OMP_NUM_THREADS syntax isn't quite right, so it didn't 
pick up the actual value there. Since it doesn't hurt to have extra cpus, just 
set it to 8 for your test case and that should be fine, so adding a little 
clarity:

--map-by slot:pe=8

I'm not aware of any slurm utility similar to top, but there is no reason you 
can't just submit this as an interactive job and use top itself, is there?

As for that sbgp warning - you can probably just ignore it. Not sure why that 
is failing, but it just means that component will disqualify itself. If you 
want to eliminate it, just add

-mca sbgp ^ibnet

to your cmd line


On Jul 2, 2014, at 7:29 AM, Timur Ismagilov  wrote:

> Thanks, Ralph!
> 
> With '--map-by :pe=$OMP_NUM_THREADS'  i got:
> 
> --
> Your job failed to map. Either no mapper was available, or none
> of the available mappers was able to perform the requested
> mapping operation. This can happen if you request a map type
> (e.g., loadbalance) and the corresponding mapper was not built.
> 
> What does it mean?
> 
> With '--bind-to socket' everything looks better, but performance still 
> worse..( but better than it was)
> 
> 1 thread 0.028 sec
> 2 thread 0.018 sec
> 4 thread 0.020 sec 
> 8 thread 0.021 sec
> Do i have utility similar to the 'top' with sbatch?
> 
> Also, every time,  i got the message in ompi 1.9:
> mca: base: components_register: component sbgp / ibnet register function 
> failed
> Is it bad?
> 
> Regards, 
> Timur
> 
> Wed, 2 Jul 2014 05:53:44 -0700 от Ralph Castain :
> 
> OMPI started binding by default during the 1.7 series. You should add the 
> following to your cmd line:
> 
> --map-by :pe=$OMP_NUM_THREADS
> 
> This will give you a dedicated core for each thread. Alternatively, you could 
> instead add
> 
> --bind-to socket
> 
> OMPI 1.5.5 doesn't bind at all unless directed to do so, which is why you are 
> getting the difference in behavior.
> 
> 
> On Jul 2, 2014, at 12:33 AM, Timur Ismagilov  wrote:
> 
>> Hello!
>> 
>> I have open mpi 1.9a1r32104 and open mpi 1.5.5.
>> I have much better perfomance in open mpi 1.5.5 with openMP on 8 cores
>> in  the program:
>> 
>> 
>> 
>> #define N 1000
>> 
>> 
>> int main(int argc, char *argv[]) {
>> ...
>> MPI_Init(&argc, &argv);
>> ...
>> for (i = 0; i < N; i++) {
>> a[i] = i * 1.0;
>> b[i] = i * 2.0;
>> }
>> 
>> #pragma omp parallel for shared(a, b, c) private(i)
>> for (i = 0; i < N; i++) {
>> c[i] = a[i] + b[i];
>> }
>> .
>> MPI_Finalize();
>> }
>> 
>> I got on 1 node 
>> (for i in 1 2 4 8 ; do export OMP_NUM_THREADS=$i; sbatch -p test -t 5 
>> --exclusive -N 1 -o hybrid-hello_omp$i.out -e hybrid-hello_omp$i.err 
>> ompi_mxm3.0 ./hybrid-hello; done)
>> 
>> 
>> open mpi 1.5.5 (Data for node: node1-128-17 Num slots: 8 Max slots: 0): 
>> 8 threads 0.014527 sec
>> 4 threads 0.016138 sec
>> 2 threads 0.018764 sec
>> 1 thread   0.029963 sec
>> openmpi 1.9a1r32104 (node1-128-29: slots=8 max_slots=0 slots_inuse=0 
>> state=UP):
>> 8 threads 0.035055 sec
>> 4 threads 0.029859 sec 
>> 2 threads 0.019564 sec (same as open mpi 1.5.5)
>> 1 thread   0.028394 sec (same as open mpi 1.5.5)
>> So, it looks like, that open mpi 1.9 use only 2 cores from 8.
>> 
>> What can i do with this?
>> 
>> $cat ompi_mxm3.0
>> 
>> #!/bin/sh
>> 
>> [ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
>> 
>> HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
>> srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { 
>> rm -f $HOSTFILE; exit 255; }
>> LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>  mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --display-allocation 
>> --hostfile $HOSTFILE "$@"
>> rc=$?
>> rm -f $HOSTFILE
>> 
>> exit $rc
>> 
>> For open mpi 1.5.5 i remove LD_PRELOAD from run script.
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/07/24738.php
> 
> 
> 
>