Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

2011-05-10 Thread hi
Hi Jeff,

> You didn't answer my prior questions.  :-)
I am observing this crash using MPI_ALLREDUCE() in test program; and
which does not have any memory corruption issue. ;)

> I ran your test program with -np 2 and -np 4 and it seemed to work ok.
Can you please let me know what environment (including os, compilers)
are you using?

I am able to reproduce the crash using attached simplified test
program with 5 element array.
Please note that these experiments I am doing on Windows7 using
msys/mingw console; see attached makefile for more information.

When running this program as "C:\>mpirun mar_f_dp2.exe" it works fine;
but when running it as "C:\>mpirun -np 2 mar_f_dp2.exe" it generates
following error on console...

C:\>mpirun -np 2 mar_f_dp2.exe
   0
   0
   0
 size=   2 , rank=   0
 start --
   0
   0
   0
 size=   2 , rank=   1
 start --
forrtl: severe (157): Program Exception - access violation
Image  PCRoutineLineSource
[vibgyor:09168] [[28311,0],0]-[[28311,1],0] mca_oob_tcp_msg_recv:
readv failed: Unknown error (108)
--
WARNING: A process refused to die!

Host: vibgyor
PID:  512

This process may still be running and/or consuming resources.

--
--
mpirun has exited due to process rank 0 with PID 476 on
node vibgyor exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--



Another observation I have with attached test program is that it
crashes in MPI_Finilize() on running "C:\>mpirun mar_f_dp2.exe" if we
un-comment following lines (lines 27 and 35)...

write(*,*) "start --, rcvbuf=", rcvbuf
...
write(*,*) "end --, rcvbuf=", rcvbuf


Thank you in advance.
-Hiral


makefile
Description: Binary data


mar_f_dp2.f
Description: Binary data


[OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-10 Thread Salvatore Podda

Dear all,

	we succeed in building several version of openmpi from 1.2.8 to 1.4.3  
with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending from  
the application: IMB, xhpl or user applications
and form the number of required cores) above which the application  
hangs (sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the same  
limits.
There are any known incompatibilities of openmpi with this version of  
intel compiilers?


The characteristics of our computational infrastructure are:

Intel processors E7330, E5345, E5530 e E5620

CentOS 5.3, CentOS 5.5.

Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1

Regards

Salvatore Podda

ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy

Tel:  +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it


[OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1

2011-05-10 Thread francoise.r...@obs.ujf-grenoble.fr


Hi,

I compile a parallel program with OpenMPI 1.4.1 (compiled with intel 
compilers 12 from composerxe package) . This program is linked to MUMPS 
library 4.9.2, compiled with the same compilers and link with intel 
MKL.  The OS is linux debian.
No error in compiling or running the job, but the program freeze inside 
a call to "zmumps" routine, when the slaves process call MPI_COMM_DUP 
routine.


The program is executed on 2 nodes of 12 cores each (westmere 
processors) with the following command :


mpirun -np 24 --machinefile $OAR_NODE_FILE  -mca plm_rsh_agent "oarsh"  
--mca btl self,openib -x LD_LIBRARY_PATH ./prog


We have 12 process running on each node. We submit the job with OAR 
batch scheduler (the $OAR_NODE_FILE variable and "oarsh" command are 
specific to this scheduler and are usually working well with openmpi )


via gdb, on the slaves, we can see that they are blocked in  MPI_COMM_DUP :

(gdb) where
#0  0x2b32c1533113 in poll () from /lib/libc.so.6
#1  0x00adf52c in poll_dispatch ()
#2  0x00adcea3 in opal_event_loop ()
#3  0x00ad69f9 in opal_progress ()
#4  0x00a34b4e in mca_pml_ob1_recv ()
#5  0x009b0768 in 
ompi_coll_tuned_allreduce_intra_recursivedoubling ()

#6  0x009ac829 in ompi_coll_tuned_allreduce_intra_dec_fixed ()
#7  0x0097e271 in ompi_comm_allreduce_intra ()
#8  0x0097dd06 in ompi_comm_nextcid ()
#9  0x0097be01 in ompi_comm_dup ()
#10 0x009a0785 in PMPI_Comm_dup ()
#11 0x0097931d in pmpi_comm_dup__ ()
#12 0x00644251 in zmumps (id=...) at zmumps_part1.F:144
#13 0x004c0d03 in sub_pbdirect_init (id=..., matrix_build=...) 
at sub_pbdirect_init.f90:44

#14 0x00628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048


the master wait further :

(gdb) where
#0  0x2b9dc9f3e113 in poll () from /lib/libc.so.6
#1  0x00adf52c in poll_dispatch ()
#2  0x00adcea3 in opal_event_loop ()
#3  0x00ad69f9 in opal_progress ()
#4  0x0098f294 in ompi_request_default_wait_all ()
#5  0x00a06e56 in ompi_coll_tuned_sendrecv_actual ()
#6  0x009ab8e3 in ompi_coll_tuned_barrier_intra_bruck ()
#7  0x009ac926 in ompi_coll_tuned_barrier_intra_dec_fixed ()
#8  0x009a0b20 in PMPI_Barrier ()
#9  0x00978c93 in pmpi_barrier__ ()
#10 0x004c0dc4 in sub_pbdirect_init (id=..., matrix_build=...) 
at sub_pbdirect_init.f90:62

#11 0x00628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048


Remark :
The same code compiled and run well with intel MPI library, from the 
same intel package, on the same nodes.


Thanks for any help

Françoise Roch



Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

2011-05-10 Thread Jeff Squyres
On May 10, 2011, at 2:30 AM, hi wrote:

>> You didn't answer my prior questions.  :-)
> I am observing this crash using MPI_ALLREDUCE() in test program; and
> which does not have any memory corruption issue. ;)

Can you send the info listed on the help page?

>> I ran your test program with -np 2 and -np 4 and it seemed to work ok.
> Can you please let me know what environment (including os, compilers)
> are you using?

RHEL 5.4, gcc 4.5.

This could be a Windows-specific thing, but I would find that unlikely (but 
heck, I don't know much about Windows...).

> I am able to reproduce the crash using attached simplified test
> program with 5 element array.
> Please note that these experiments I am doing on Windows7 using
> msys/mingw console; see attached makefile for more information.
> 
> When running this program as "C:\>mpirun mar_f_dp2.exe" it works fine;
> but when running it as "C:\>mpirun -np 2 mar_f_dp2.exe" it generates
> following error on console...
> 
> C:\>mpirun -np 2 mar_f_dp2.exe
>   0
>   0
>   0
> size=   2 , rank=   0
> start --
>   0
>   0
>   0
> size=   2 , rank=   1
> start --
> forrtl: severe (157): Program Exception - access violation
> Image  PCRoutineLineSource
> [vibgyor:09168] [[28311,0],0]-[[28311,1],0] mca_oob_tcp_msg_recv:
> readv failed: Unknown error (108)

You forgot ierr in the call to MPI_Finalize.  You also paired DOUBLE_PRECISION 
data with MPI_INTEGER in the call to allreduce.  And you mixed sndbuf and 
rcvbuf in the call to allreduce, meaning that when your print rcvbuf 
afterwards, it'll always still be 0.

I pared your sample program down to the following:

program Test_MPI
use mpi
implicit none

DOUBLE PRECISION rcvbuf(5), sndbuf(5)
INTEGER nproc, rank, ierr, n, i, ret

n = 5
do i = 1, n
sndbuf(i) = 2.0
rcvbuf(i) = 0.0
end do

call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
write(*,*) "size=", nproc, ", rank=", rank
write(*,*) "start --, rcvbuf=", rcvbuf
CALL MPI_ALLREDUCE(sndbuf, rcvbuf, n,
 &  MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD, ierr)
write(*,*) "end --, rcvbuf=", rcvbuf

CALL MPI_Finalize(ierr)
end

(you could use "include 'mpif.h'", too -- I tried both)

This program works fine for me.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Issue with Open MPI 1.5.3 Windows binary builds

2011-05-10 Thread Tyler W. Wilson

Good day,

I am new to the Open MPI package, and so am starting at the beginning. I 
have little if any desire to build the binaries, so I was glad to see a 
Windows binary release.


I started with I think is the minimum program:

#include "mpi.h"

int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);

MPI_Finalize();

return 0;
}

But, when I build and run this (with MS Visual C++ 2010 Express, running 
on Windows 7 x64), I get this error:


[Tyler-Quad:06832] [[2206,0],0] ORTE_ERROR_LOG: Value out of bounds in 
file ..\.

.\..\openmpi-1.5.3\orte\mca\oob\tcp\oob_tcp.c at line 1193

And it hangs there.

As I mentioned, I am new to this project. Perhaps there is some simple 
configuration I failed to do after the install. Any clues welcome.


Thank you,
Tyler


Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1

2011-05-10 Thread Tim Prince

On 5/10/2011 6:43 AM, francoise.r...@obs.ujf-grenoble.fr wrote:


Hi,

I compile a parallel program with OpenMPI 1.4.1 (compiled with intel
compilers 12 from composerxe package) . This program is linked to MUMPS
library 4.9.2, compiled with the same compilers and link with intel MKL.
The OS is linux debian.
No error in compiling or running the job, but the program freeze inside
a call to "zmumps" routine, when the slaves process call MPI_COMM_DUP
routine.

The program is executed on 2 nodes of 12 cores each (westmere
processors) with the following command :

mpirun -np 24 --machinefile $OAR_NODE_FILE -mca plm_rsh_agent "oarsh"
--mca btl self,openib -x LD_LIBRARY_PATH ./prog

We have 12 process running on each node. We submit the job with OAR
batch scheduler (the $OAR_NODE_FILE variable and "oarsh" command are
specific to this scheduler and are usually working well with openmpi )

via gdb, on the slaves, we can see that they are blocked in MPI_COMM_DUP :

(gdb) where
#0 0x2b32c1533113 in poll () from /lib/libc.so.6
#1 0x00adf52c in poll_dispatch ()
#2 0x00adcea3 in opal_event_loop ()
#3 0x00ad69f9 in opal_progress ()
#4 0x00a34b4e in mca_pml_ob1_recv ()
#5 0x009b0768 in
ompi_coll_tuned_allreduce_intra_recursivedoubling ()
#6 0x009ac829 in ompi_coll_tuned_allreduce_intra_dec_fixed ()
#7 0x0097e271 in ompi_comm_allreduce_intra ()
#8 0x0097dd06 in ompi_comm_nextcid ()
#9 0x0097be01 in ompi_comm_dup ()
#10 0x009a0785 in PMPI_Comm_dup ()
#11 0x0097931d in pmpi_comm_dup__ ()
#12 0x00644251 in zmumps (id=...) at zmumps_part1.F:144
#13 0x004c0d03 in sub_pbdirect_init (id=..., matrix_build=...)
at sub_pbdirect_init.f90:44
#14 0x00628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048


the master wait further :

(gdb) where
#0 0x2b9dc9f3e113 in poll () from /lib/libc.so.6
#1 0x00adf52c in poll_dispatch ()
#2 0x00adcea3 in opal_event_loop ()
#3 0x00ad69f9 in opal_progress ()
#4 0x0098f294 in ompi_request_default_wait_all ()
#5 0x00a06e56 in ompi_coll_tuned_sendrecv_actual ()
#6 0x009ab8e3 in ompi_coll_tuned_barrier_intra_bruck ()
#7 0x009ac926 in ompi_coll_tuned_barrier_intra_dec_fixed ()
#8 0x009a0b20 in PMPI_Barrier ()
#9 0x00978c93 in pmpi_barrier__ ()
#10 0x004c0dc4 in sub_pbdirect_init (id=..., matrix_build=...)
at sub_pbdirect_init.f90:62
#11 0x00628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048


Remark :
The same code compiled and run well with intel MPI library, from the
same intel package, on the same nodes.

Did you try compiling with equivalent options in each compiler?  For 
example, (supposing you had gcc 4.6)

gcc -O3 -funroll-loops --param max-unroll-times=2 -march=corei7
would be equivalent (as closely as I know) to
icc -fp-model source -msse4.2 -ansi-alias

As you should be aware, default settings in icc are more closely 
equivalent to
gcc -O3 -ffast-math -fno-cx-limited-range -funroll-loops --param 
max-unroll-times=2 -fnostrict-aliasing


The options I suggest as an upper limit are probably more aggressive 
than most people have used successfully with OpenMPI.


As to run-time MPI options, Intel MPI has affinity with Westmere 
awareness turned on by default.  I suppose testing without affinity 
settings, particularly when banging against all hyperthreads, is a more 
severe test of your application.   Don't you get better results at 1 
rank per core?

--
Tim Prince


Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1

2011-05-10 Thread George Bosilca

On May 10, 2011, at 08:10 , Tim Prince wrote:

> On 5/10/2011 6:43 AM, francoise.r...@obs.ujf-grenoble.fr wrote:
>> 
>> Hi,
>> 
>> I compile a parallel program with OpenMPI 1.4.1 (compiled with intel
>> compilers 12 from composerxe package) . This program is linked to MUMPS
>> library 4.9.2, compiled with the same compilers and link with intel MKL.
>> The OS is linux debian.
>> No error in compiling or running the job, but the program freeze inside
>> a call to "zmumps" routine, when the slaves process call MPI_COMM_DUP
>> routine.
>> 
>> The program is executed on 2 nodes of 12 cores each (westmere
>> processors) with the following command :
>> 
>> mpirun -np 24 --machinefile $OAR_NODE_FILE -mca plm_rsh_agent "oarsh"
>> --mca btl self,openib -x LD_LIBRARY_PATH ./prog
>> 
>> We have 12 process running on each node. We submit the job with OAR
>> batch scheduler (the $OAR_NODE_FILE variable and "oarsh" command are
>> specific to this scheduler and are usually working well with openmpi )
>> 
>> via gdb, on the slaves, we can see that they are blocked in MPI_COMM_DUP :

Francoise,

Based on your traces the workers and the master are not doing the same MPI 
call. The workers are blocked in an MPI_Comm_dup in sub_pbdirect_init.f90:44, 
while the master is blocked in an MPI_Barrier in sub_pbdirect_init.f90:62. Can 
you verify that the slaves and the master are calling the MPI_Barrier and the 
MPI_Comm_dup in the same logical order?

  george.


>> 
>> (gdb) where
>> #0 0x2b32c1533113 in poll () from /lib/libc.so.6
>> #1 0x00adf52c in poll_dispatch ()
>> #2 0x00adcea3 in opal_event_loop ()
>> #3 0x00ad69f9 in opal_progress ()
>> #4 0x00a34b4e in mca_pml_ob1_recv ()
>> #5 0x009b0768 in
>> ompi_coll_tuned_allreduce_intra_recursivedoubling ()
>> #6 0x009ac829 in ompi_coll_tuned_allreduce_intra_dec_fixed ()
>> #7 0x0097e271 in ompi_comm_allreduce_intra ()
>> #8 0x0097dd06 in ompi_comm_nextcid ()
>> #9 0x0097be01 in ompi_comm_dup ()
>> #10 0x009a0785 in PMPI_Comm_dup ()
>> #11 0x0097931d in pmpi_comm_dup__ ()
>> #12 0x00644251 in zmumps (id=...) at zmumps_part1.F:144
>> #13 0x004c0d03 in sub_pbdirect_init (id=..., matrix_build=...)
>> at sub_pbdirect_init.f90:44
>> #14 0x00628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048
>> 
>> 
>> the master wait further :
>> 
>> (gdb) where
>> #0 0x2b9dc9f3e113 in poll () from /lib/libc.so.6
>> #1 0x00adf52c in poll_dispatch ()
>> #2 0x00adcea3 in opal_event_loop ()
>> #3 0x00ad69f9 in opal_progress ()
>> #4 0x0098f294 in ompi_request_default_wait_all ()
>> #5 0x00a06e56 in ompi_coll_tuned_sendrecv_actual ()
>> #6 0x009ab8e3 in ompi_coll_tuned_barrier_intra_bruck ()
>> #7 0x009ac926 in ompi_coll_tuned_barrier_intra_dec_fixed ()
>> #8 0x009a0b20 in PMPI_Barrier ()
>> #9 0x00978c93 in pmpi_barrier__ ()
>> #10 0x004c0dc4 in sub_pbdirect_init (id=..., matrix_build=...)
>> at sub_pbdirect_init.f90:62
>> #11 0x00628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048
>> 
>> 
>> Remark :
>> The same code compiled and run well with intel MPI library, from the
>> same intel package, on the same nodes.
>> 
> Did you try compiling with equivalent options in each compiler?  For example, 
> (supposing you had gcc 4.6)
> gcc -O3 -funroll-loops --param max-unroll-times=2 -march=corei7
> would be equivalent (as closely as I know) to
> icc -fp-model source -msse4.2 -ansi-alias
> 
> As you should be aware, default settings in icc are more closely equivalent to
> gcc -O3 -ffast-math -fno-cx-limited-range -funroll-loops --param 
> max-unroll-times=2 -fnostrict-aliasing
> 
> The options I suggest as an upper limit are probably more aggressive than 
> most people have used successfully with OpenMPI.
> 
> As to run-time MPI options, Intel MPI has affinity with Westmere awareness 
> turned on by default.  I suppose testing without affinity settings, 
> particularly when banging against all hyperthreads, is a more severe test of 
> your application.   Don't you get better results at 1 rank per core?
> -- 
> Tim Prince
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"To preserve the freedom of the human mind then and freedom of the press, every 
spirit should be ready to devote itself to martyrdom; for as long as we may 
think as we will, and speak as we think, the condition of man will proceed in 
improvement."
  -- Thomas Jefferson, 1799




[OMPI users] Trouble with MPI-IO

2011-05-10 Thread Tom Rosmond
I would appreciate someone with experience with MPI-IO look at the
simple fortran program gzipped and attached to this note.  It is
imbedded in a script so that all that is necessary to run it is do:
'testio' from the command line.  The program generates a small 2-D input
array, sets up an MPI-IO environment, and write a 2-D output array
twice, with the only difference being the displacement arrays used to
construct the indexed datatype.  For the first write, simple
monotonically increasing displacements are used, for the second the
displacements are 'shuffled' in one dimension.  They are printed during
the run.

For the first case the file is written properly, but for the second the
program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
Although the program is compiled as an mpi program, I am running on a
single processor, which makes the problem more puzzling.

The program should be relatively self-explanatory, but if more
information is needed, please ask.  I am on an 8 core Xeon based Dell
workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
OpenMPI 1.5.3.  I have also attached output from 'ompi_info'.

T. Rosmond




testio.gz
Description: GNU Zip compressed data


info_ompi.gz
Description: GNU Zip compressed data


Re: [OMPI users] is there an equiv of iprove for bcast?

2011-05-10 Thread Randolph Pullen
Thanks,
The messages are small and frequent (they flash metadata across the cluster).  
The current approach works fine for small to medium clusters but I want it to 
be able to go big.  Maybe up to several hundred or even a thousands of nodes.

Its these larger deployments that concern me.  The current scheme may see the 
clearinghouse become overloaded in a very large cluster.
>From what you have  said, a possible strategy may be to combine the listener 
>and worker into a single process, using the non-blocking bcast just for that 
>group, while each worker scanned its own port for an incoming request, which 
>it would in turn bcast to its peers.
As you have indicated though, this would depend on the load the non-blocking 
bcast would cause.  - At least the load would be fairly even over the cluster.

--- On Mon, 9/5/11, Jeff Squyres  wrote:

From: Jeff Squyres 
Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
To: randolph_pul...@yahoo.com.au
Cc: "Open MPI Users" 
Received: Monday, 9 May, 2011, 11:27 PM

On May 3, 2011, at 8:20 PM, Randolph Pullen wrote:

> Sorry, I meant to say:
> - on each node there is 1 listener and 1 worker.
> - all workers act together when any of the listeners send them a request.
> - currently I must use an extra clearinghouse process to receive from any of 
> the listeners and bcast to workers, this is unfortunate because of the 
> potential scaling issues
> 
> I think you have answered this in that I must wait for MPI-3's non-blocking 
> collectives.

Yes and no.  If each worker starts N non-blocking broadcasts just to be able to 
test for completion of any of them, you might end up consuming a bunch of 
resources for them (I'm *anticipating* that pending non-blocking collective 
requests maybe more heavyweight than pending non-blocking point-to-point 
requests).

But then again, if N is small, it might not matter.

> Can anyone suggest another way?  I don't like the serial clearinghouse 
> approach.

If you only have a few workers and/or the broadcast message is small and/or the 
broadcasts aren't frequent, then MPI's built-in broadcast algorithms might not 
offer much more optimization than doing your own with point-to-point 
mechanisms.  I don't usually recommend this, but it may be possible for your 
case.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/