[OMPI users] has anybody used the Intel Thread Checker w/OpenMPI?

2007-03-23 Thread Curtis Janssen
I'm interested in getting OpenMPI working with a multi-threaded
application (MPI_THREAD_MULTIPLE is required).  I'm trying the trunk
from a couple weeks ago (1.3a1r14001) compiled for multi-threading and
threaded progress, and have had success with some small cases.  Larger
cases with the same algorithms fail (they work with MPICH2 1.0.5/TCP and
other thread-safe MPIs, so I don't think it is an application bug).  I
don't mind doing a little work to track down the problem, so I'm trying
to use the Intel Thread Checker.  I have the thread checker working with
my application when using Intel's MPI, but with OpenMPI it hangs.
OpenMPI is compiled for OFED 1.1, but I'm overriding communications with
"-gmca btl self,tcp" in the hope that OpenMPI won't do anything funky
that would cause the thread checker problems (like RMDA or writes from
other processes into shared memory segments).  Has anybody used the
Intel Thread Checker with OpenMPI successfully?

Thanks,
Curt
-- 
Curtis Janssen, clja...@sandia.gov, +1 925-294-1509
Sandia National Labs, MS 9158, PO Box 969, Livermore, CA 94551, USA



[OMPI users] error in MPI_Waitall

2007-03-23 Thread Jeffrey Stephen
Hi,
 
I am trying to run an MPICH2 application over 2 processors on a dual
processor x64 Linux box (SuSE 10). I am getting the following error
message:
 
--
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..: MPI_Waitall(count=2,
req_array=0x5bbda70, status_array=0x7fff461d9ce0) failed
MPIDI_CH3_Progress_wait(212)..: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)
rank 0 in job 2  Demeter_18432   caused collective abort of all ranks
  exit status of rank 0: killed by signal 11
--
 
The "cpi" example that comes with MPICH2 executes correctly. I am using
MPICH2-1.0.5p2 which I compiled from source. 
 
Does anyone know what the problem is?
 
cheers
steve


Climate change will impact on everyone… Queensland takes action
Register your interest in attending at 
http://www.nrw.qld.gov.au/events/nrconference/index.html
Natural Resources Conference 2007
Climate Change - Queensland takes action
Wednesday 23 May 2007
Brisbane Convention and Exhibition Centre

The information in this email together with any attachments is
intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material.
Any form of review, disclosure, modification, distribution
and/or publication of this email message is prohibited, unless
as a necessary part of Departmental business.
If you have received this message in error, you are asked to
inform the sender as quickly as possible and delete this message
and any copies of this message from your computer and/or your
computer system network.




Re: [OMPI users] has anybody used the Intel Thread Checker w/OpenMPI?

2007-03-23 Thread Rainer Keller
Hello Curtis,
yes, done with ompi-trunk:
Apart from --enable-mpi-threads --enable-progress-threads, You need to compile 
Open MPI with --enable-mca-no-build=memory-ptmalloc2 ; and of course the 
usual options for debugging (--enable-debug) and the options for 
icc/ifort/icpc:
CFLAGS='-debug all -inline-debug-info -tcheck'
CXXFLAGS='-debug all -inline-debug-info -tcheck'
FFLAGS='-debug all -tcheck'
LDFLAGS='-tcheck'

Then, as You already noted, run the application with --mca btl tcp,sm,self:
mpirun --mca tcp,sm,self -np 2\
 tcheck_cl\
   --reinstrument \
   -u all \
   -c \
   -d '/tmp/hpcraink_$$__tc_cl_cache' \
   -f html\
   -o 'tc_mpi_test_suite_$$.html' \
   -p 'file=tc_mpi_test_suite_%H_%I,  \
   pad=128,   \
   delay=2,   \
   stall=2'   \
   -- \
  ./mpi_test_suite -j 2 -r FULL -t 'Ring Ibsend' -d MPI_INT

-- the reinstrument is not really necessary, also setting the padding and 
delay for startup of threads; shortenign the delay for stalls to 2 seconds 
alos does not trigger any deadlocks.

This was with icc-9.1 and itt-3.0 23205.

Hope this helps,
Rainer

On Friday 23 March 2007 05:22, Curtis Janssen wrote:
> I'm interested in getting OpenMPI working with a multi-threaded
> application (MPI_THREAD_MULTIPLE is required).  I'm trying the trunk
> from a couple weeks ago (1.3a1r14001) compiled for multi-threading and
> threaded progress, and have had success with some small cases.  Larger
> cases with the same algorithms fail (they work with MPICH2 1.0.5/TCP and
> other thread-safe MPIs, so I don't think it is an application bug).  I
> don't mind doing a little work to track down the problem, so I'm trying
> to use the Intel Thread Checker.  I have the thread checker working with
> my application when using Intel's MPI, but with OpenMPI it hangs.
> OpenMPI is compiled for OFED 1.1, but I'm overriding communications with
> "-gmca btl self,tcp" in the hope that OpenMPI won't do anything funky
> that would cause the thread checker problems (like RMDA or writes from
> other processes into shared memory segments).  Has anybody used the
> Intel Thread Checker with OpenMPI successfully?
>
> Thanks,
> Curt

-- 

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
 High Performance Computing   Tel: ++49 (0)711-685 6 5858
   Center Stuttgart (HLRS)   Fax: ++49 (0)711-685 6 5832
 POSTAL:Nobelstrasse 19 email: kel...@hlrs.de 
 ACTUAL:Allmandring 30, R.O.030AIM:rusraink
 70550 Stuttgart


Re: [OMPI users] error in MPI_Waitall

2007-03-23 Thread Tim Prins

Steve,

This list is for supporting Open MPI, not MPICH2 (MPICH2 is an  
entirely different software package).  You should probably redirect  
your question to their support lists.


Thanks,

Tim

On Mar 23, 2007, at 12:46 AM, Jeffrey Stephen wrote:


Hi,

I am trying to run an MPICH2 application over 2 processors on a  
dual processor x64 Linux box (SuSE 10). I am getting the following  
error message:


--
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..: MPI_Waitall(count=2,  
req_array=0x5bbda70, status_array=0x7fff461d9ce0) failed
MPIDI_CH3_Progress_wait(212)..: an error occurred while  
handling an event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..: connection failure  
(set=0,sock=1,errno=104:Connection reset by peer)

rank 0 in job 2  Demeter_18432   caused collective abort of all ranks
  exit status of rank 0: killed by signal 11
--

The "cpi" example that comes with MPICH2 executes correctly. I am  
using MPICH2-1.0.5p2 which I compiled from source.


Does anyone know what the problem is?

cheers
steve
** 
**


Climate change will impact on everyone… Queensland takes action

Register your interest in attending at http://www.nrw.qld.gov.au/ 
events/nrconference/index.html


Natural Resources Conference 2007

Climate Change - Queensland takes action

Wednesday 23 May 2007
Brisbane Convention and Exhibition Centre

** 
**


The information in this email together with any attachments is

intended only for the person or entity to which it is addressed

and may contain confidential and/or privileged material.

Any form of review, disclosure, modification, distribution

and/or publication of this email message is prohibited, unless

as a necessary part of Departmental business.

If you have received this message in error, you are asked to

inform the sender as quickly as possible and delete this message

and any copies of this message from your computer and/or your

computer system network.

** 
**



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] Problems compiling openmpi 1.2 under AIX 5.2

2007-03-23 Thread Ricardo Fonseca

Hi guys

I'm having problems compiling openmpi 1.2 under AIX 5.2. Here are the  
configure parameters:


./configure  --disable-shared --enable-static \
   CC=xlc CXX=xlc++ F77=xlf FC=xlf95

To get it to work I have to do 2 changes:

diff -r openmpi-1.2/ompi/mpi/cxx/mpicxx.cc openmpi-1.2-aix/ompi/mpi/ 
cxx/mpicxx.cc

34a35,38
> #undef SEEK_SET
> #undef SEEK_CUR
> #undef SEEK_END
>

diff -r openmpi-1.2/orte/mca/pls/poe/pls_poe_module.c openmpi-1.2-aix/ 
orte/mca/pls/poe/pls_poe_module.c

636a637,641
> static int pls_poe_cancel_operation(void)
> {
> return ORTE_ERR_NOT_IMPLEMENTED;
> }

This last one means that when you run OpenMPI jobs through POE you  
get a:


[r1blade003:381130] [0,0,0] ORTE_ERROR_LOG: Not implemented in file  
errmgr_hnp.c at line 90
 
--
mpirun was unable to cleanly terminate the daemons for this job.  
Returned value Not implemented instead of ORTE_SUCCESS.


at the job end.

Keep up the good work, cheers,

Ricardo


---
Prof. Ricardo Fonseca

GoLP - Grupo de Lasers e Plasmas
Centro de Física dos Plasmas
Instituto Superior Técnico
Av. Rovisco Pais
1049-001 Lisboa
Portugal

tel: +351 21 8419202
fax: +351 21 8464455
web: http://cfp.ist.utl.pt/golp/



Re: [OMPI users] segfault with netpipe & ompi 1.2 + MX (32bit only)

2007-03-23 Thread Nicolas Niclausse
Nicolas Niclausse ecrivait le 21.03.2007 16:45:

> I'm trying to use netpipe with openmpi on my system (rhel 3, dual opteron,
> myrinet 2G with MX drivers).
> 
> Everything is fine when i use a 64bit binary, but it segfaults when i use a
> 32 bit binary :

I rebuilt everything with PGI 6.2 instead of 6.0 and everything is working
as expected now.

-- 
Nicolas NICLAUSSE  Service DREAM
INRIA Sophia Antipolis http://www-sop.inria.fr/


Re: [OMPI users] quadrics

2007-03-23 Thread Ashley Pittman

I can volunteer myself as a beta-tester if that's OK.  If there is
anything specific you want help with either drop me a mail directly or
mail supp...@quadrics.com

We are not aware of any other current project of this nature.

Ashley,

On Mon, 2007-03-19 at 18:48 -0400, George Bosilca wrote:
> UTK is working on Quadrics support. Right now, we have an embryo of  
> Quadrics support. The work is still in progress. I can let you know  
> as soon as we have something that pass most of our test, and we are  
> confident enough to give it to beta-testers.
> 
>Thanks,
>  george.
> 
> On Mar 18, 2007, at 11:07 PM, Robin Humble wrote:
> 
> >
> > does OpenMPI support Quadrics elan3/4 interconnects?
> >
> > I saw a few hits on google suggesting that support was partial or  
> > maybe
> > planned, but couldn't find much in the openmpi sources to suggest any
> > support at all.
> >
> > cheers,
> > robin
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Cell EIB support for OpenMPI

2007-03-23 Thread Marcus G. Daniels

Marcus G. Daniels wrote:

Mike Houston wrote:
  
The main issue with this, and addressed at the end 
of the report, is that the code size is going to be a problem as data 
and code must live in the same 256KB in each SPE.  They mention dynamic 
overlay loading, which is also how we deal with large code size, but 
things get tricky and slow with the potentially needed save and restore 
of registers and LS. 



I did some checking on this.   Apparently the trunk of GCC and the 
latest GNU Binutils handle overlays.   Because the SPU compiler knows of 
its limits address space, the ELF object code sections reflect this, and 
the the linker can transparently generate stubs to trigger the 
loading.   GCC also has options like -ffunction-sections that enable the 
linker to optimize for locality. 

So even though the OpenMPI shared libraries in total appear to have a 
footprint about four times too big for code alone (don't know about the 
typical stack & heap requirements), perhaps it's still doable without a 
big effort to strip down OpenMPI?




[OMPI users] Failure to launch on a remote node. SSH problem?

2007-03-23 Thread Walker, David T.
I am presently trying to get OpenMPI up and running on a small cluster
of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using
the intel Fortran Compiler (9.1) and gcc.  When I try to launch a job on
a remote node, orted starts on the remote node but then times out.  I am
guessing that the problem is SSH related.  Any thoughts?

Thanks,

Dave

Details:  

I am using SSH, set up as outlined in the FAQ, using ssh-agent to allow
passwordless logins.  The paths for all the libraries appear to be OK.  

A simple MPI code (Hello_World_Fortran) launched on node01 will run OK
for up to four processors (all on node01).  The output is shown here.

node01 1247% mpirun --debug-daemons -hostfile machinefile -np 4
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
Fortran version of Hello World, rank2
Rank 0 is present in Fortran version of Hello World.
Fortran version of Hello World, rank3
Fortran version of Hello World, rank1

For five processors mpirun tries to start an additional process on
node03.  Everything launches the same on node01 (four instances of
Hello_World_Fortran are launched).  On node03, orted starts, but times
out after 10 seconds and the output below is generated.   

node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
[node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send()
failed with errno=57
[node01.local:21427] ERROR: A daemon on node node03 failed to start as
expected.
[node01.local:21427] ERROR: There may be more information available from
[node01.local:21427] ERROR: the remote shell (see above).
[node01.local:21427] ERROR: The daemon exited unexpectedly with status
255.
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)

Here is the ompi info:


node01 1248% ompi_info --all
Open MPI: 1.1.2
   Open MPI SVN revision: r12073
Open RTE: 1.1.2
   Open RTE SVN revision: r12073
OPAL: 1.1.2
   OPAL SVN revision: r12073
  MCA memory: darwin (MCA v1.0, API v1.0, Component v1.1.2)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.1.2)
   MCA timer: darwin (MCA v1.0, API v1.0, Component v1.1.2)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: hierarch (MCA v1.0, API v1.0, Component
v1.1.2)
MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
  MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: hostfile (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: localhost (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: xgrid (MCA v1.0, API v1.0, Component v1.1.2)
 MCA rds: hostfile (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
   MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
v1.1.2)
MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
 MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2)
 MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
   

Re: [OMPI users] Cell EIB support for OpenMPI

2007-03-23 Thread Mike Houston



Marcus G. Daniels wrote:

Marcus G. Daniels wrote:
  

Mike Houston wrote:
  

The main issue with this, and addressed at the end 
of the report, is that the code size is going to be a problem as data 
and code must live in the same 256KB in each SPE.  They mention dynamic 
overlay loading, which is also how we deal with large code size, but 
things get tricky and slow with the potentially needed save and restore 
of registers and LS. 

  


I did some checking on this.   Apparently the trunk of GCC and the 
latest GNU Binutils handle overlays.   Because the SPU compiler knows of 
its limits address space, the ELF object code sections reflect this, and 
the the linker can transparently generate stubs to trigger the 
loading.   GCC also has options like -ffunction-sections that enable the 
linker to optimize for locality. 

So even though the OpenMPI shared libraries in total appear to have a 
footprint about four times too big for code alone (don't know about the 
typical stack & heap requirements), perhaps it's still doable without a 
big effort to strip down OpenMPI?
  
But loading an overlay can be quite expensive depending on how much 
needs to be loaded and how much user data/code needs to be restored.  If 
the user is trying to use most of the LS for data, which is perfectly 
sane and reasonable, then you might have to load multiple overlays to 
complete a function. We've also been having issues with mixing manual 
overlay loading of our code with the autoloading generated by the compiler.


Regardless, it would be interesting to see if this can even be made to 
work.  If so, it might really help people get apps up on Cell since it 
can be reasonably thought of as a cluster on a chip, backed by a larger 
address space.


-Mike


Re: [OMPI users] MPI processes swapping out

2007-03-23 Thread Rolf Vandevaart


Todd:

I assume the system time is being consumed by
the calls to send and receive data over the TCP sockets.
As the number of processes in the job increases, then more
time is spent waiting for data from one of the other processes.

I did a little experiment on a single node to see the difference
in system time consumed when running over TCP vs when
running over shared memory.   When running on a single
node and using the sm btl, I see almost 100% user time. 
I assume this is because the sm btl handles sending and
receiving its data within a shared memory segment. 
However, when I switch over to TCP, I see my system time

go up.  Note that this is on Solaris.

RUNNING OVER SELF,SM
> mpirun -np 8 -mca btl self,sm hpcc.amd64

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
 3505 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  75 182   0 hpcc.amd64/1
 3503 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.2   0  69 116   0 hpcc.amd64/1
 3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5   0 106 236   0 hpcc.amd64/1
 3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0   0 169 200   0 hpcc.amd64/1
 3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9   0 127 158   0 hpcc.amd64/1
 3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 244 200   0 hpcc.amd64/1
 3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 282 212   0 hpcc.amd64/1
 3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2   0 237  98   0 hpcc.amd64/1

RUNNING OVER SELF,TCP
>mpirun -np 8 -mca btl self,tcp hpcc.amd64

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
 4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2   5 346 .6M   0 hpcc.amd64/1
 4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4   3  59 .15   0 hpcc.amd64/1
 4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7   2 270 .1M   0 hpcc.amd64/1
 4320 rolfv 88  12 0.0 0.0 0.0 0.0 0.0 0.8   4 244 .15   0 hpcc.amd64/1
 4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3   2 150 .2M   0 hpcc.amd64/1
 4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4   5 236 .9M   0 hpcc.amd64/1
 4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7   7 117 .2M   0 hpcc.amd64/1
 4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9  19 150 .10   0 hpcc.amd64/1

I also ran HPL over a larger cluster of 6 nodes, and noticed even higher
system times. 

And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs 
per node

using Sun HPC ClusterTools 6, and saw about a 50/50 split between user
and system time.

 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
11525 rolfv 55  44 0.1 0.0 0.0 0.0 0.1 0.4  76 960 .3M   0 
maxtrunc_ct6/1
11526 rolfv 54  45 0.0 0.0 0.0 0.0 0.0 1.0   0 362 .4M   0 
maxtrunc_ct6/1


Is it possible that everything is working just as it should?

Rolf

Heywood, Todd wrote On 03/22/07 13:30,:


Ralph,

Well, according to the FAQ, aggressive mode can be "forced" so I did try
setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
bewteen run and sleep states, driving up system time well over user time.

Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
(depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
sure, I also tried running directly with a hostfile with slots=4 or slots=2.
The same behavior occurs.

This behavior is a function of the size of the job. I.e. As I scale from 200
to 800 tasks the run/sleep cycling increases, so that system time grows from
maybe half the user time to maybe 5 times user time.

This is for TCP/gigE.

Todd


On 3/22/07 12:19 PM, "Ralph Castain"  wrote:

 


Just for clarification: ompi_info only shows the *default* value of the MCA
parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
that value is reset internally if the system sees an "oversubscribed"
condition.

The issue here isn't how many cores are on the node, but rather how many
were specifically allocated to this job. If the allocation wasn't at least 2
(in your example), then we would automatically reset mpi_yield_when_idle to
be non-aggressive, regardless of how many cores are actually on the node.

Ralph


On 3/22/07 7:14 AM, "Heywood, Todd"  wrote:

   


Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
4-core node, the 2 tasks are still cycling between run and sleep, with
higher system time than user time.

Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
so that suggests the tasks aren't swapping out on bloccking calls.

Still puzzled.

Thanks,
Todd


On 3/22/07 7:36 AM, "Jeff Squyres"  wrote:

 


Are you using a scheduler on your system?

More specifically, does Open MPI know that you have for process slots
on each node?  If you are using a hostfile and didn't specify
"slots=4" for each host, Open MPI will think that it's
oversubscribing and will therefore call sched_yield() in the depths
of its progress engine.


On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:


Re: [OMPI users] MPI processes swapping out

2007-03-23 Thread Heywood, Todd
Rolf,

> Is it possible that everything is working just as it should?

That's what I'm afraid of :-). But I did not expect to see such
communication overhead due to blocking from mpiBLAST, which is very
course-grained. I then tried HPL, which is computation-heavy, and found the
same thing. Also, the system time seemed to correspond to the MPI processes
cycling between run and sleep (as seen via top), and I thought that setting
the mpi_yield_when_idle parameter to 0 would keep the processes from
entering sleep state when blocking. But it doesn't.

Todd



On 3/23/07 2:06 PM, "Rolf Vandevaart"  wrote:

> 
> Todd:
> 
> I assume the system time is being consumed by
> the calls to send and receive data over the TCP sockets.
> As the number of processes in the job increases, then more
> time is spent waiting for data from one of the other processes.
> 
> I did a little experiment on a single node to see the difference
> in system time consumed when running over TCP vs when
> running over shared memory.   When running on a single
> node and using the sm btl, I see almost 100% user time.
> I assume this is because the sm btl handles sending and
> receiving its data within a shared memory segment.
> However, when I switch over to TCP, I see my system time
> go up.  Note that this is on Solaris.
> 
> RUNNING OVER SELF,SM
>> mpirun -np 8 -mca btl self,sm hpcc.amd64
> 
>PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>   3505 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  75 182   0 hpcc.amd64/1
>   3503 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.2   0  69 116   0 hpcc.amd64/1
>   3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5   0 106 236   0 hpcc.amd64/1
>   3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0   0 169 200   0 hpcc.amd64/1
>   3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9   0 127 158   0 hpcc.amd64/1
>   3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 244 200   0 hpcc.amd64/1
>   3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 282 212   0 hpcc.amd64/1
>   3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2   0 237  98   0 hpcc.amd64/1
> 
> RUNNING OVER SELF,TCP
>> mpirun -np 8 -mca btl self,tcp hpcc.amd64
> 
>PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>   4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2   5 346 .6M   0 hpcc.amd64/1
>   4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4   3  59 .15   0 hpcc.amd64/1
>   4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7   2 270 .1M   0 hpcc.amd64/1
>   4320 rolfv 88  12 0.0 0.0 0.0 0.0 0.0 0.8   4 244 .15   0 hpcc.amd64/1
>   4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3   2 150 .2M   0 hpcc.amd64/1
>   4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4   5 236 .9M   0 hpcc.amd64/1
>   4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7   7 117 .2M   0 hpcc.amd64/1
>   4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9  19 150 .10   0 hpcc.amd64/1
> 
> I also ran HPL over a larger cluster of 6 nodes, and noticed even higher
> system times. 
> 
> And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs
> per node
> using Sun HPC ClusterTools 6, and saw about a 50/50 split between user
> and system time.
> 
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>  11525 rolfv 55  44 0.1 0.0 0.0 0.0 0.1 0.4  76 960 .3M   0
> maxtrunc_ct6/1
>  11526 rolfv 54  45 0.0 0.0 0.0 0.0 0.0 1.0   0 362 .4M   0
> maxtrunc_ct6/1
> 
> Is it possible that everything is working just as it should?
> 
> Rolf
> 
> Heywood, Todd wrote On 03/22/07 13:30,:
> 
>> Ralph,
>> 
>> Well, according to the FAQ, aggressive mode can be "forced" so I did try
>> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
>> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
>> bewteen run and sleep states, driving up system time well over user time.
>> 
>> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
>> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
>> sure, I also tried running directly with a hostfile with slots=4 or slots=2.
>> The same behavior occurs.
>> 
>> This behavior is a function of the size of the job. I.e. As I scale from 200
>> to 800 tasks the run/sleep cycling increases, so that system time grows from
>> maybe half the user time to maybe 5 times user time.
>> 
>> This is for TCP/gigE.
>> 
>> Todd
>> 
>> 
>> On 3/22/07 12:19 PM, "Ralph Castain"  wrote:
>> 
>>  
>> 
>>> Just for clarification: ompi_info only shows the *default* value of the MCA
>>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
>>> that value is reset internally if the system sees an "oversubscribed"
>>> condition.
>>> 
>>> The issue here isn't how many cores are on the node, but rather how many
>>> were specifically allocated to this job. If the allocation wasn't at least 2
>>> (in your example), then we would automatically reset mpi_yield_when_idle to
>>> be non-aggressive, regardless of how many cores are actually o

[OMPI users] install error

2007-03-23 Thread Dan Dansereau
To ALL

I am getting the following error while attempting to install openmpi on
a Linux 

System - as follows

Linux utahwtm.hydropoint.com 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23
13:38:27 BST 2006 x86_64 x86_64 x86_64 GNU/Linux

 

with the  lntel compilers that are the latest versions of 9.1 

 

this is the ERROR

 

libtool: link: icc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing
-restrict -pthread -o opal_wrapper opal_wrapper.o -Wl,--export-dynamic
-pthread ../../../opal/.libs/libopen-pal.a -lnsl -lutil

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x1d):
In function `munmap':

: undefined reference to `__munmap'

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x52):
In function `opal_mem_free_ptmalloc2_munmap':

: undefined reference to `__munmap'

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x66):
In function `mmap':

: undefined reference to `__mmap'

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text+0x8d):
In function `opal_mem_free_ptmalloc2_mmap':

: undefined reference to `__mmap'

make[2]: *** [opal_wrapper] Error 1

make[2]: Leaving directory
`/home/dad/model/openmpi-1.2/opal/tools/wrappers'

make[1]: *** [all-recursive] Error 1

make[1]: Leaving directory `/home/dad/model/openmpi-1.2/opal'

make: *** [all-recursive] Error 1

 

the config command was 

./configure CC=icc CXX=icpc F77=ifort FC=ifort --disable-shared
--enable-static --prefix=/model/OPENMP_I

 

and executed with no errors

 

I have attached both the config.log and the compile.log

 

Any help or direction would greatly be appreciated.



Re: [OMPI users] MPI processes swapping out

2007-03-23 Thread George Bosilca
So far the described behavior seems as normal as expected. As Open  
MPI never goes in blocking mode, the processes will always spin  
between active and sleep mode. More processes on the same node leads  
to more time in the system mode (because of the empty polls). There  
is a trick in the trunk version of Open MPI which will trigger the  
blocking mode if and only if TCP is the only used device. Please try  
add "--mca btl tcp,self" to your mpirun command line, and check the  
output of vmstat.


  Thanks,
george.

On Mar 23, 2007, at 3:32 PM, Heywood, Todd wrote:


Rolf,


Is it possible that everything is working just as it should?


That's what I'm afraid of :-). But I did not expect to see such
communication overhead due to blocking from mpiBLAST, which is very
course-grained. I then tried HPL, which is computation-heavy, and  
found the
same thing. Also, the system time seemed to correspond to the MPI  
processes
cycling between run and sleep (as seen via top), and I thought that  
setting

the mpi_yield_when_idle parameter to 0 would keep the processes from
entering sleep state when blocking. But it doesn't.

Todd



On 3/23/07 2:06 PM, "Rolf Vandevaart"  wrote:



Todd:

I assume the system time is being consumed by
the calls to send and receive data over the TCP sockets.
As the number of processes in the job increases, then more
time is spent waiting for data from one of the other processes.

I did a little experiment on a single node to see the difference
in system time consumed when running over TCP vs when
running over shared memory.   When running on a single
node and using the sm btl, I see almost 100% user time.
I assume this is because the sm btl handles sending and
receiving its data within a shared memory segment.
However, when I switch over to TCP, I see my system time
go up.  Note that this is on Solaris.

RUNNING OVER SELF,SM

mpirun -np 8 -mca btl self,sm hpcc.amd64


   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG  
PROCESS/NLWP
  3505 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  75 182   0  
hpcc.amd64/1
  3503 rolfv100 0.0 0.0 0.0 0.0 0.0 0.0 0.2   0  69 116   0  
hpcc.amd64/1
  3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5   0 106 236   0  
hpcc.amd64/1
  3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0   0 169 200   0  
hpcc.amd64/1
  3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9   0 127 158   0  
hpcc.amd64/1
  3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 244 200   0  
hpcc.amd64/1
  3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 282 212   0  
hpcc.amd64/1
  3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2   0 237  98   0  
hpcc.amd64/1


RUNNING OVER SELF,TCP

mpirun -np 8 -mca btl self,tcp hpcc.amd64


   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG  
PROCESS/NLWP
  4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2   5 346 .6M   0  
hpcc.amd64/1
  4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4   3  59 .15   0  
hpcc.amd64/1
  4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7   2 270 .1M   0  
hpcc.amd64/1
  4320 rolfv 88  12 0.0 0.0 0.0 0.0 0.0 0.8   4 244 .15   0  
hpcc.amd64/1
  4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3   2 150 .2M   0  
hpcc.amd64/1
  4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4   5 236 .9M   0  
hpcc.amd64/1
  4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7   7 117 .2M   0  
hpcc.amd64/1
  4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9  19 150 .10   0  
hpcc.amd64/1


I also ran HPL over a larger cluster of 6 nodes, and noticed even  
higher

system times.

And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2  
procs

per node
using Sun HPC ClusterTools 6, and saw about a 50/50 split between  
user

and system time.

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG  
PROCESS/NLWP

 11525 rolfv 55  44 0.1 0.0 0.0 0.0 0.1 0.4  76 960 .3M   0
maxtrunc_ct6/1
 11526 rolfv 54  45 0.0 0.0 0.0 0.0 0.0 1.0   0 362 .4M   0
maxtrunc_ct6/1

Is it possible that everything is working just as it should?

Rolf

Heywood, Todd wrote On 03/22/07 13:30,:


Ralph,

Well, according to the FAQ, aggressive mode can be "forced" so I  
did try
setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also  
tried turning
processor/memory affinity on. Efffects were minor. The MPI tasks  
still cycle
bewteen run and sleep states, driving up system time well over  
user time.


Mpstat shows SGE is indeed giving 4 or 2 slots per node as  
approporiate
(depending on memory) and the MPI tasks are using 4 or 2 cores,  
but to be
sure, I also tried running directly with a hostfile with slots=4  
or slots=2.

The same behavior occurs.

This behavior is a function of the size of the job. I.e. As I  
scale from 200
to 800 tasks the run/sleep cycling increases, so that system time  
grows from

maybe half the user time to maybe 5 times user time.

This is for TCP/gigE.

Todd


On 3/22/07 12:19 PM, "Ralph Castain"  wrote:



Just for clarification: ompi_info only shows the *default* value  
of the MCA
parameter. In thi

Re: [OMPI users] Cell EIB support for OpenMPI

2007-03-23 Thread George Bosilca
The main problem with MPI is the huge number of function in the API.  
Even if we implement only the 1.0 standard we still have several  
hundreds of functions around. Moreover, an MPI library is far from  
being a simple self-sufficient library, it requires a way to start  
and monitor processes, interact with the operating system and so on.  
All in all we end up with a multi-hundreds KB library which in most  
of the applications will be only used at 10%.


We investigated this possiblity few months ago, but in front of the  
task of removing all unnecessary functions from Open MPI in order to  
get something that can fit in the 256KB of memory on the SPU (and of  
course still leave some empty room for the user) ... Moreover, most  
of the Cell users we talked with, are not interested to have MPI  
between the SPU. There is only one thing they're looking for,  
removing the last unused SPU cycle from the pipeline !!! There is no  
room for anything MPI-like at that level.


  george.

On Mar 22, 2007, at 12:30 PM, Marcus G. Daniels wrote:


Hi,

Has anyone investigated adding intra chip Cell EIB messaging to  
OpenMPI?

It seems like it ought to work.  This paper seems pretty convincing:

http://www.cs.fsu.edu/research/reports/TR-061215.pdf
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Cell EIB support for OpenMPI

2007-03-23 Thread Marcus G. Daniels

George Bosilca wrote:
All in all we end up with a multi-hundreds KB library which in most  
of the applications will be only used at 10%.
  
Seems like it ought to be possible to do some coverage analysis for a 
particular application and figure out what parts of the library (and 
user code) to make adjacent in memory.  Then the 10% could be put in the 
same overlay.   Seems like the EIB is quite fast and can take some abuse 
in terms of swapping.
Moreover, most  
of the Cell users we talked with, are not interested to have MPI  
between the SPU. There is only one thing they're looking for,  
removing the last unused SPU cycle from the pipeline !!! There is no  
room for anything MPI-like at that level.
  
I imagine that OpenMP might be good option for the Cell and even sounds 
like maybe there will be a GCC option:


http://gcc.gnu.org/ml/gcc-patches/2006-05/msg00987.html

..but even so, there are more existing scientific codes for MPI than 
OpenMP.Even if the thing was a dog initially, and yielded 2 speed 
ups instead of 10 compared to typical CPUs, it would still be useful for 
installations with large Cell deployments that could well be risking 
underutilization or hogging due to poor tools support.  

I have not investigated how much of the SPU C library stuff is missing 
to make OpenMPI compile, but that's at least fixable and independently 
useful thing to have for Cell users.


Marcus