[OMPI users] implementation of a message logging protocol

2007-03-22 Thread Thomas Ropars

Dear all,

I am currently working on a fault tolerant protocol for message passing 
applications based on message logging.

For my experimentations, I want to implement my protocol in a MPI library.
I know that message logging protocols have already been implemented in 
MPICH with MPICH-V.


I'm wondering if in the actual state of Open MPI it is possible to do 
the same kind of work in this library ?

Is there somebody currently working on the same subject ?

Best regards,

Thomas Ropars.



Re: [OMPI users] deadlock on barrier

2007-03-22 Thread Jeff Squyres

Is this a TCP-based cluster?

If so, do you have multiple IP addresses on your frontend machine?   
Check out these two FAQ entries to see if they help:


http://www.open-mpi.org/faq/?category=tcp#tcp-routability
http://www.open-mpi.org/faq/?category=tcp#tcp-selection



On Mar 21, 2007, at 4:51 PM, tim gunter wrote:

i am experiencing some issues w/ openmpi 1.2 running on a rocks  
4.2.1 cluster(the issues also appear to occur w/ openmpi 1.1.5 and  
1.1.4).


when i run my program with the frontend in the list of nodes, they  
deadlock.


when i run my program without the frontend in the list of nodes,  
they run to completion.


the simplest test program that does this(test1.c) does an  
"MPI_Init", followed by an "MPI_Barrier", and a "MPI_Finalize".


so the following deadlocks:

/users/gunter $ mpirun -np 3 -H  
frontend,compute-0-0,compute-0-1 ./test1

host:compute-0-1.local made it past the barrier, ret:0
mpirun: killing job...

mpirun noticed that job rank 0 with PID 15384 on node frontend  
exited on signal 15 (Terminated).

2 additional processes aborted (not shown)

this runs to completion:

/users/gunter $ mpirun -np 3 -H  
compute-0-0,compute-0-1,compute-0-2 ./test1

host:compute-0-1.local made it past the barrier, ret:0
host:compute-0-0.local made it past the barrier, ret:0
host:compute-0-2.local made it past the barrier, ret:0

if i have the compute nodes send a message to the frontend prior to  
the barrier, it runs to completion:


/users/gunter $ mpirun -np 3 -H  
frontend,compute-0-0,compute-0-1 ./test2 0

host: frontend.domain node:  0 is the master
host:   compute-0-0.local node:  1 sent:  1 to:0
host:   compute-0-1.local node:  2 sent:  2 to:0
host: frontend.domain node:  0 recv:  1 from:  1
host: frontend.domain node:  0 recv:  2 from:  2
host: frontend.domain made it past the barrier, ret:0
host:   compute-0-1.local made it past the barrier, ret:0
host:   compute-0-0.local made it past the barrier, ret:0

if i have a different node function as the master, it deadlocks:

/users/gunter $ mpirun -np 3 -H  
frontend,compute-0-0,compute-0-1 ./test2 1

host:   compute-0-0.local node:  1 is the master
host:   compute-0-1.local node:  2 sent:  2 to:1
mpirun: killing job...

mpirun noticed that job rank 0 with PID 15411 on node frontend  
exited on signal 15 (Terminated).

2 additional processes aborted (not shown)

how is it that in the first example, one node makes it past the  
barrier, and the rest deadlock?


these programs both run to completion on two other MPI  
implementations.


is there something mis-configured on my cluster? or is this  
potentially an openmpi bug?


what is the best way to debug this?

any help would be appreciated!

--tim


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] MPI processes swapping out

2007-03-22 Thread Jeff Squyres

Are you using a scheduler on your system?

More specifically, does Open MPI know that you have for process slots  
on each node?  If you are using a hostfile and didn't specify  
"slots=4" for each host, Open MPI will think that it's  
oversubscribing and will therefore call sched_yield() in the depths  
of its progress engine.



On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:

P.s. I should have said this this is a pretty course-grained  
application,
and netstat doesn't show much communication going on (except in  
stages).



On 3/21/07 4:21 PM, "Heywood, Todd"  wrote:

I noticed that my OpenMPI processes are using larger amounts of  
system time

than user time (via vmstat, top). I'm running on dual-core, dual-CPU
Opterons, with 4 slots per node, where the program has the nodes to
themselves. A closer look showed that they are constantly  
switching between

run and sleep states with 4-8 page faults per second.

Why would this be? It doesn't happen with 4 sequential jobs  
running on a

node, where I get 99% user time, maybe 1% system time.

The processes have plenty of memory. This behavior occurs whether  
I use

processor/memory affinity or not (there is no oversubscription).

Thanks,

Todd

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] threading

2007-03-22 Thread Jeff Squyres
Open MPI currently has minimal use of hidden "progress" threads, but  
we will likely be experimenting with more usage of them over time  
(previous MPI implementations have shown that progress threads can be  
a big performance win for large messages, although they do tend to  
add a bit of latency).


To answer your direct question, when you ask Open MPI for N processes  
(E.g., "mpirun -np  a.out"), you'll get N unix processes.  Open  
MPI will not create N threads (or split threads across nodes without  
oversubscription such that you get a total of N ranks in  
MPI_COMM_WORLD).


Previous MPI implementations have tried this kind of scheme  
(launching threads as MPI processes), but (IMHO) it violated the Law  
of Least Astonishment (see http://www.canonical.org/~kragen/tao-of- 
programming.html) in that the user's MPI application was then subject  
to the constraints of multi-threaded programming.


So most (all?) modern MPI implementations that I am aware of deal  
with operating system processes as individual MPI_COMM_WORLD ranks  
(as opposed to threads).




On Mar 21, 2007, at 5:29 PM, David Burns wrote:

I have used POSIX threading and Open MPI without problems on our  
Opteron

2216 Cluster (4 cores per node). Moving to core-level parallelization
with multi threading resulted in significant performance gains.

Sam Adams wrote:

I have been looking, but I haven't really found a good answer about
system level threading.  We are about to get a new cluster of
dual-processor quad-core nodes or 8 cores per node.  Traditionally I
would just tell MPI to launch two processes per dual processor single
core node, but with eight cores on a node, having 8 processes seems
inefficient.



My question is this: does OpenMPI sense that there are multiple cores
on a node and use something like pthreads instead of creating new
processes automatically when I request 8 processes for a node, or
should I run a single process per node and use OpenMP or pthreads
explicitly to get better performance on a per node basis?



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] portability of the executables compiled with OpenMPI

2007-03-22 Thread Jeff Squyres

On Mar 15, 2007, at 12:18 PM, Michael wrote:


I'm having trouble with the portability of executables compiled with
OpenMPI.  I suspect the sysadms on the HPC system I'm using changed
something because I think it worked previously.

Situation: I'm compiling my code locally on a machine with just
ethernet interfaces and OpenMPI 1.1.2 that I built.

When I attempt to run that executable on a HPC machine with OpenMPI
1.1.2 and InfiniBand interfaces I get messages about "can't find
libmosal.so.0.0" -- I'm certain this wasn't happening earlier.

I can compile on this machine and run on it, even though there is no
libmosal.* in my path.

mpif90 --showme on this system gives me:

/opt/compiler/intel/compiler91/x86_64/bin/ifort -I/opt/mpi/x86_64/
intel/9.1/openmpi-1.1.4/include -pthread -I/opt/mpi/x86_64/intel/9.1/
openmpi-1.1.4/lib -L/opt/mpi/x86_64/intel/9.1/openmpi-1.1.4/lib -L/
opt/gm/lib64 -lmpi_f90 -lmpi -lorte -lopal -lgm -lvapi -lmosal -lrt -
lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -ldl


Based on this output, I assume you have configured OMPI with either -- 
enable-static or otherwise including all plugins in libmpi.so, right?



I suspect that read access to libmosal.so has been removed and
somehow when I link on this machine I'm getting a static library,
i.e. libmosal.a

Does this make any sense?


This would be consistent with what you described above -- that  
libmosal.so (a VAPI support library) is available on the initial  
machine, so your MPI executable will have a runtime dependency on  
it.  But then on the second machine, libmosal.so is not available, so  
the runtime dependency fails.  But if you compile manually,  
libmosal.a is available and therefore the application can be created  
with compile/link-time resolution (vs. runtime resolution).



Is there a flag in this compile line that permits linking an
executable even when the person doing the linking does not have
access to all the libraries, i.e.  export-dynamic?


No.  All the same Linux/POSIX linking rules apply for creating an  
executable; we're not doing anything funny in this area.


FYI: --export-dynamic tells the linker that symbols in the libraries  
should be available to plugins that are opened later.  It's probably  
not relevant for the case where you're not opening any plugins at  
runtime, but we don't differentiate between this case because the  
decision whether to open plugins or not is a runtime decision, not a  
compile/link-time decision.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] portability of the executables compiled with OpenMPI

2007-03-22 Thread Jeff Squyres

On Mar 15, 2007, at 5:02 PM, Michael wrote:


I would like to hear just how portable an executable compiled against
OpenMPI shared libraries should be.


This is a hard question to answer:

1. We have not done this explicit kind of testing.
2. Open MPI's libmpi.so itself is plain vanilla C.  If you have an  
application that is already portable, linking it against Open MPI  
should not cause it to be less portable.
3. Open MPI, however, can use many support libraries (e.g., libmosal  
in your previous mail).  This myriad of extra libraries may create  
difficulties in creating a truly portable application.


The best practices that I have seen have been:

- start with an application that itself is already portable (without  
MPI)

- compile everything 100% static

But this has drawbacks as well -- consider if you link in libmosal.a  
to your MPI application and then take it to another system that has a  
slightly different version of VAPI (e.g., a kernel interface  
changed).  Although your application will load and start running  
(i.e., no runtime linker resolution failures), it may fail in  
unpredictable ways later because the libmosal.a in your application  
calls down to the kernel in ways that are unsupported by the VAPI/ 
libmosal on the current system.


Make sense?

This is unfortunately not an MPI (or Open MPI) specific issue; it's a  
larger problem of creating truly portable software.  To have a better  
chance of success, you probably want to ensure that all relevant  
points of interaction between your application and the outside system  
are either the same version or "compatible enough".


- high-speed networking support libraries
- resource manager support libraries
- libc
- ...etc.

Specifically, even though you won't be looking for .so's at runtime,  
you need to ensure that the way the .a's compiled into your  
application interact with the system is either the same way or "close  
enough" to how the corresponding support libraries work on the target  
machine.


All this being said, Open MPI did try to take steps in its design to  
be able to effect more portability (e.g., for ISV's).  Theoretically  
-- we have not explicitly tested this -- the following setup may  
provide a better degree of portability:


- have the same version of Open MPI available on each machine,  
compiled against whatever support libraries are relevant on that  
machine (using plugins, not --enable-static).


- compile your application *dynamically* against Open MPI.  Note that  
some of the upper-level configuration of Open MPI must be either the  
same or "close enough" between machines such that runtime linking  
will work properly (e.g., don't use a 32 bit libmpi on one machine  
and a 64 bit libmpi on another, etc.  There's more details here, but  
you get the general idea)


- ensure that other (non-MPI-related) interaction points in your  
application are the same or "close enough" to be portable


By linking dynamically against Open MPI (which is plain vanilla C),  
the application will only be looking for Open MPI's plain C support  
libraries -- not the other support libraries (such as libmosal),  
because those are linked against OMPI's plugins -- not libmpi.so  
(etc.).  This design effectively takes MPI out of the portability  
equation.


That's the theory, anyway.  :-)

I skipped many nit-picky details, so I'm sure there will be issues to  
figure out.  But *in theory*, it's possible...



I'm compiling on a Debian Linux system with dual 1.3 GHz AMD Opterons
per node and an internal network of dual gigabit ethernet.

I'm planning on a SUSE Linux Enterprise Server 9 system with dual 3.6
GHz Intel Xeon EM64T per node and an internal network using Myrinet.


I can't speak for how portable Myrinet support libraries are...   
Myricom?


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] portability of the executables compiled with OpenMPI

2007-03-22 Thread Michael


On Mar 22, 2007, at 7:55 AM, Jeff Squyres wrote:


On Mar 15, 2007, at 12:18 PM, Michael wrote:


Situation: I'm compiling my code locally on a machine with just
ethernet interfaces and OpenMPI 1.1.2 that I built.

When I attempt to run that executable on a HPC machine with OpenMPI
1.1.2 and InfiniBand interfaces I get messages about "can't find
libmosal.so.0.0" -- I'm certain this wasn't happening earlier.

I can compile on this machine and run on it, even though there is no
libmosal.* in my path.

mpif90 --showme on this system gives me:

/opt/compiler/intel/compiler91/x86_64/bin/ifort -I/opt/mpi/x86_64/
intel/9.1/openmpi-1.1.4/include -pthread -I/opt/mpi/x86_64/intel/9.1/
openmpi-1.1.4/lib -L/opt/mpi/x86_64/intel/9.1/openmpi-1.1.4/lib -L/
opt/gm/lib64 -lmpi_f90 -lmpi -lorte -lopal -lgm -lvapi -lmosal -lrt -
lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -ldl


Based on this output, I assume you have configured OMPI with either --
enable-static or otherwise including all plugins in libmpi.so, right?


No, I did not configure OpenMPI on this machine.  I believe OpenMPI  
was configured not static by the installers based on the messages and  
the dependency on the missing libraries.


The issue was that some of the 1000+ nodes on this major HPC machine  
were missing libraries needed for OpenMPI but because of the low  
usage of OpenMPI I'm the first to discover the problem.  For whatever  
reason these libraries are not on the front-end machines that feed  
the main system.  It's always nice running OpenMPI on your own  
machine but not everyone can always do that.


The way I read my experience is that OpenMPI's libmpi.so depends on  
different libraries on different machines, this means that if you  
don't compile static you can compile on a machine that does not have  
libraries for expensive interfaces and run on another machine with  
those expensive interfaces -- that's what I'm am doing now successfully.


Michael



Re: [OMPI users] portability of the executables compiled with OpenMPI

2007-03-22 Thread Michael

For your reference:

The following cross compile/run combination with OpenMPI 1.1.4 is  
currently working for me:


I'm compiling on a Debian Linux system with dual 1.3 GHz AMD Opterons  
per node and an internal network of dual gigabit ethernet.  With  
OpenMPI compiled with Intel Fortran 9.1.041 and gcc 3.3.5


I'm running on a SUSE Linux Enterprise Server 9 system with dual 3.6  
GHz Intel Xeon EM64T per node and an internal network using Myrinet.   
OpenMPI compiled with Intel Fortran 9.1.041 and Intel icc 9.1.046


There is enough compatibility between the two different libmpi.so's  
that I do not have a problem.


I have to periodically check the second system to see if it has been  
updated in which I case I have to update my system.


Michael



Re: [OMPI users] MPI processes swapping out

2007-03-22 Thread Heywood, Todd
Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
4-core node, the 2 tasks are still cycling between run and sleep, with
higher system time than user time.

Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
so that suggests the tasks aren't swapping out on bloccking calls.

Still puzzled.

Thanks,
Todd


On 3/22/07 7:36 AM, "Jeff Squyres"  wrote:

> Are you using a scheduler on your system?
> 
> More specifically, does Open MPI know that you have for process slots
> on each node?  If you are using a hostfile and didn't specify
> "slots=4" for each host, Open MPI will think that it's
> oversubscribing and will therefore call sched_yield() in the depths
> of its progress engine.
> 
> 
> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
> 
>> P.s. I should have said this this is a pretty course-grained
>> application,
>> and netstat doesn't show much communication going on (except in
>> stages).
>> 
>> 
>> On 3/21/07 4:21 PM, "Heywood, Todd"  wrote:
>> 
>>> I noticed that my OpenMPI processes are using larger amounts of
>>> system time
>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
>>> Opterons, with 4 slots per node, where the program has the nodes to
>>> themselves. A closer look showed that they are constantly
>>> switching between
>>> run and sleep states with 4-8 page faults per second.
>>> 
>>> Why would this be? It doesn't happen with 4 sequential jobs
>>> running on a
>>> node, where I get 99% user time, maybe 1% system time.
>>> 
>>> The processes have plenty of memory. This behavior occurs whether
>>> I use
>>> processor/memory affinity or not (there is no oversubscription).
>>> 
>>> Thanks,
>>> 
>>> Todd
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



Re: [OMPI users] Fault Tolerance

2007-03-22 Thread Josh Hursey
LAM/MPI was able to checkpoint/restart an entire MPI job as you  
mention. Open MPI is now able to checkpoint/restart as well. In the  
past week I added to the Open MPI trunk a LAM/MPI-like checkpoint/ 
restart implementation. In Open MPI we revisited many of the design  
decisions from the LAM/MPI development and improved on them quite a  
bit. At the moment there is no documentation on how to use it (egg on  
my face actually). I'm working on developing the documentation, and I  
will send a note to the users list once it is available.


Cheers,
Josh

On Mar 21, 2007, at 1:18 PM, Thomas Spraggins wrote:


To migrate processes, you need to be able to checkpoint them.  I
believe that LAM-MPI is the only MPI implementation that allows this,
although I have never used LAM-MPI.

Good luck.

Tom Spraggins
t...@virginia.edu

On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote:


Hello folks,

I am trying to write some fault-tolerance systems with the
following criteria:
1) Recover any software/hardware crashes
2) Dynamically Shrink and grow.
3) Migrate processes among machines.

Does anyone has examples of code? What MPI platform is recommended
to accomplish such requirements?

I am using three MPI platforms and each has it own issues:
1) MPICH2 - good multi-threading support, but bad fault-tolerance
mechanisms.
2) OpenMPI - Does not support multi-threading properly and cannot
have it trap exceptions yet.
3) FT-MPI - Old and does not support multi-threading at all.

Any suggestions?
--

Regards,
Mohammad Huwaidi

We can't resolve problems by using the same kind of thinking we used
when we created them.
--Albert Einstein

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/



Re: [OMPI users] deadlock on barrier

2007-03-22 Thread tim gunter

On 3/22/07, Jeff Squyres  wrote:


Is this a TCP-based cluster?



yes

If so, do you have multiple IP addresses on your frontend machine?

Check out these two FAQ entries to see if they help:

http://www.open-mpi.org/faq/?category=tcp#tcp-routability
http://www.open-mpi.org/faq/?category=tcp#tcp-selection



ok, using the internal interfaces only fixed the problem.

it is a little confusing that when this happens, one machine would make
it past the barrier, and the others would not.

thanks Jeff!

--tim


Re: [OMPI users] MPI processes swapping out

2007-03-22 Thread Ralph Castain
Just for clarification: ompi_info only shows the *default* value of the MCA
parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
that value is reset internally if the system sees an "oversubscribed"
condition.

The issue here isn't how many cores are on the node, but rather how many
were specifically allocated to this job. If the allocation wasn't at least 2
(in your example), then we would automatically reset mpi_yield_when_idle to
be non-aggressive, regardless of how many cores are actually on the node.

Ralph


On 3/22/07 7:14 AM, "Heywood, Todd"  wrote:

> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
> 4-core node, the 2 tasks are still cycling between run and sleep, with
> higher system time than user time.
> 
> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
> so that suggests the tasks aren't swapping out on bloccking calls.
> 
> Still puzzled.
> 
> Thanks,
> Todd
> 
> 
> On 3/22/07 7:36 AM, "Jeff Squyres"  wrote:
> 
>> Are you using a scheduler on your system?
>> 
>> More specifically, does Open MPI know that you have for process slots
>> on each node?  If you are using a hostfile and didn't specify
>> "slots=4" for each host, Open MPI will think that it's
>> oversubscribing and will therefore call sched_yield() in the depths
>> of its progress engine.
>> 
>> 
>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
>> 
>>> P.s. I should have said this this is a pretty course-grained
>>> application,
>>> and netstat doesn't show much communication going on (except in
>>> stages).
>>> 
>>> 
>>> On 3/21/07 4:21 PM, "Heywood, Todd"  wrote:
>>> 
 I noticed that my OpenMPI processes are using larger amounts of
 system time
 than user time (via vmstat, top). I'm running on dual-core, dual-CPU
 Opterons, with 4 slots per node, where the program has the nodes to
 themselves. A closer look showed that they are constantly
 switching between
 run and sleep states with 4-8 page faults per second.
 
 Why would this be? It doesn't happen with 4 sequential jobs
 running on a
 node, where I get 99% user time, maybe 1% system time.
 
 The processes have plenty of memory. This behavior occurs whether
 I use
 processor/memory affinity or not (there is no oversubscription).
 
 Thanks,
 
 Todd
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Cell EIB support for OpenMPI

2007-03-22 Thread Marcus G. Daniels

Hi,

Has anyone investigated adding intra chip Cell EIB messaging to OpenMPI?
It seems like it ought to work.  This paper seems pretty convincing:

http://www.cs.fsu.edu/research/reports/TR-061215.pdf


Re: [OMPI users] Cell EIB support for OpenMPI

2007-03-22 Thread Mike Houston
That's pretty cool.  The main issue with this, and addressed at the end 
of the report, is that the code size is going to be a problem as data 
and code must live in the same 256KB in each SPE.  They mention dynamic 
overlay loading, which is also how we deal with large code size, but 
things get tricky and slow with the potentially needed save and restore 
of registers and LS.  It would be interesting to see how much of MPI 
could be implemented and how much is really needed.  Maybe it's time to 
think about and MPI-ES spec?


-Mike

Marcus G. Daniels wrote:

Hi,

Has anyone investigated adding intra chip Cell EIB messaging to OpenMPI?
It seems like it ought to work.  This paper seems pretty convincing:

http://www.cs.fsu.edu/research/reports/TR-061215.pdf
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  


Re: [OMPI users] MPI processes swapping out

2007-03-22 Thread Heywood, Todd
Ralph,

Well, according to the FAQ, aggressive mode can be "forced" so I did try
setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
bewteen run and sleep states, driving up system time well over user time.

Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
(depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
sure, I also tried running directly with a hostfile with slots=4 or slots=2.
The same behavior occurs.

This behavior is a function of the size of the job. I.e. As I scale from 200
to 800 tasks the run/sleep cycling increases, so that system time grows from
maybe half the user time to maybe 5 times user time.

This is for TCP/gigE.

Todd


On 3/22/07 12:19 PM, "Ralph Castain"  wrote:

> Just for clarification: ompi_info only shows the *default* value of the MCA
> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
> that value is reset internally if the system sees an "oversubscribed"
> condition.
> 
> The issue here isn't how many cores are on the node, but rather how many
> were specifically allocated to this job. If the allocation wasn't at least 2
> (in your example), then we would automatically reset mpi_yield_when_idle to
> be non-aggressive, regardless of how many cores are actually on the node.
> 
> Ralph
> 
> 
> On 3/22/07 7:14 AM, "Heywood, Todd"  wrote:
> 
>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
>> 4-core node, the 2 tasks are still cycling between run and sleep, with
>> higher system time than user time.
>> 
>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
>> so that suggests the tasks aren't swapping out on bloccking calls.
>> 
>> Still puzzled.
>> 
>> Thanks,
>> Todd
>> 
>> 
>> On 3/22/07 7:36 AM, "Jeff Squyres"  wrote:
>> 
>>> Are you using a scheduler on your system?
>>> 
>>> More specifically, does Open MPI know that you have for process slots
>>> on each node?  If you are using a hostfile and didn't specify
>>> "slots=4" for each host, Open MPI will think that it's
>>> oversubscribing and will therefore call sched_yield() in the depths
>>> of its progress engine.
>>> 
>>> 
>>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
>>> 
 P.s. I should have said this this is a pretty course-grained
 application,
 and netstat doesn't show much communication going on (except in
 stages).
 
 
 On 3/21/07 4:21 PM, "Heywood, Todd"  wrote:
 
> I noticed that my OpenMPI processes are using larger amounts of
> system time
> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
> Opterons, with 4 slots per node, where the program has the nodes to
> themselves. A closer look showed that they are constantly
> switching between
> run and sleep states with 4-8 page faults per second.
> 
> Why would this be? It doesn't happen with 4 sequential jobs
> running on a
> node, where I get 99% user time, maybe 1% system time.
> 
> The processes have plenty of memory. This behavior occurs whether
> I use
> processor/memory affinity or not (there is no oversubscription).
> 
> Thanks,
> 
> Todd
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI processes swapping out

2007-03-22 Thread Ralph Castain



On 3/22/07 11:30 AM, "Heywood, Todd"  wrote:

> Ralph,
> 
> Well, according to the FAQ, aggressive mode can be "forced" so I did try
> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
> bewteen run and sleep states, driving up system time well over user time.

Yes, that's true - and we do (should) respect any such directive.

> 
> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
> sure, I also tried running directly with a hostfile with slots=4 or slots=2.
> The same behavior occurs.

Okay - thanks for trying that!

> 
> This behavior is a function of the size of the job. I.e. As I scale from 200
> to 800 tasks the run/sleep cycling increases, so that system time grows from
> maybe half the user time to maybe 5 times user time.
> 
> This is for TCP/gigE.

What version of OpenMPI are you using? This sounds like something we need to
investigate.

Thanks for the help!
Ralph

> 
> Todd
> 
> 
> On 3/22/07 12:19 PM, "Ralph Castain"  wrote:
> 
>> Just for clarification: ompi_info only shows the *default* value of the MCA
>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
>> that value is reset internally if the system sees an "oversubscribed"
>> condition.
>> 
>> The issue here isn't how many cores are on the node, but rather how many
>> were specifically allocated to this job. If the allocation wasn't at least 2
>> (in your example), then we would automatically reset mpi_yield_when_idle to
>> be non-aggressive, regardless of how many cores are actually on the node.
>> 
>> Ralph
>> 
>> 
>> On 3/22/07 7:14 AM, "Heywood, Todd"  wrote:
>> 
>>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
>>> 4-core node, the 2 tasks are still cycling between run and sleep, with
>>> higher system time than user time.
>>> 
>>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
>>> so that suggests the tasks aren't swapping out on bloccking calls.
>>> 
>>> Still puzzled.
>>> 
>>> Thanks,
>>> Todd
>>> 
>>> 
>>> On 3/22/07 7:36 AM, "Jeff Squyres"  wrote:
>>> 
 Are you using a scheduler on your system?
 
 More specifically, does Open MPI know that you have for process slots
 on each node?  If you are using a hostfile and didn't specify
 "slots=4" for each host, Open MPI will think that it's
 oversubscribing and will therefore call sched_yield() in the depths
 of its progress engine.
 
 
 On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
 
> P.s. I should have said this this is a pretty course-grained
> application,
> and netstat doesn't show much communication going on (except in
> stages).
> 
> 
> On 3/21/07 4:21 PM, "Heywood, Todd"  wrote:
> 
>> I noticed that my OpenMPI processes are using larger amounts of
>> system time
>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
>> Opterons, with 4 slots per node, where the program has the nodes to
>> themselves. A closer look showed that they are constantly
>> switching between
>> run and sleep states with 4-8 page faults per second.
>> 
>> Why would this be? It doesn't happen with 4 sequential jobs
>> running on a
>> node, where I get 99% user time, maybe 1% system time.
>> 
>> The processes have plenty of memory. This behavior occurs whether
>> I use
>> processor/memory affinity or not (there is no oversubscription).
>> 
>> Thanks,
>> 
>> Todd
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI processes swapping out

2007-03-22 Thread Heywood, Todd
Hi,

It is v1.2, default configuration. If it matters: OS is RHEL
(2.6.9-42.0.3.ELsmp) on x86_64.

I have noticed this for 2 apps so far, mpiBLAST and HPL, which are both
course grained.

Thanks,

Todd


On 3/22/07 2:38 PM, "Ralph Castain"  wrote:

> 
> 
> 
> On 3/22/07 11:30 AM, "Heywood, Todd"  wrote:
> 
>> Ralph,
>> 
>> Well, according to the FAQ, aggressive mode can be "forced" so I did try
>> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
>> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
>> bewteen run and sleep states, driving up system time well over user time.
> 
> Yes, that's true - and we do (should) respect any such directive.
> 
>> 
>> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
>> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
>> sure, I also tried running directly with a hostfile with slots=4 or slots=2.
>> The same behavior occurs.
> 
> Okay - thanks for trying that!
> 
>> 
>> This behavior is a function of the size of the job. I.e. As I scale from 200
>> to 800 tasks the run/sleep cycling increases, so that system time grows from
>> maybe half the user time to maybe 5 times user time.
>> 
>> This is for TCP/gigE.
> 
> What version of OpenMPI are you using? This sounds like something we need to
> investigate.
> 
> Thanks for the help!
> Ralph
> 
>> 
>> Todd
>> 
>> 
>> On 3/22/07 12:19 PM, "Ralph Castain"  wrote:
>> 
>>> Just for clarification: ompi_info only shows the *default* value of the MCA
>>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
>>> that value is reset internally if the system sees an "oversubscribed"
>>> condition.
>>> 
>>> The issue here isn't how many cores are on the node, but rather how many
>>> were specifically allocated to this job. If the allocation wasn't at least 2
>>> (in your example), then we would automatically reset mpi_yield_when_idle to
>>> be non-aggressive, regardless of how many cores are actually on the node.
>>> 
>>> Ralph
>>> 
>>> 
>>> On 3/22/07 7:14 AM, "Heywood, Todd"  wrote:
>>> 
 Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
 4-core node, the 2 tasks are still cycling between run and sleep, with
 higher system time than user time.
 
 Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
 so that suggests the tasks aren't swapping out on bloccking calls.
 
 Still puzzled.
 
 Thanks,
 Todd
 
 
 On 3/22/07 7:36 AM, "Jeff Squyres"  wrote:
 
> Are you using a scheduler on your system?
> 
> More specifically, does Open MPI know that you have for process slots
> on each node?  If you are using a hostfile and didn't specify
> "slots=4" for each host, Open MPI will think that it's
> oversubscribing and will therefore call sched_yield() in the depths
> of its progress engine.
> 
> 
> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
> 
>> P.s. I should have said this this is a pretty course-grained
>> application,
>> and netstat doesn't show much communication going on (except in
>> stages).
>> 
>> 
>> On 3/21/07 4:21 PM, "Heywood, Todd"  wrote:
>> 
>>> I noticed that my OpenMPI processes are using larger amounts of
>>> system time
>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
>>> Opterons, with 4 slots per node, where the program has the nodes to
>>> themselves. A closer look showed that they are constantly
>>> switching between
>>> run and sleep states with 4-8 page faults per second.
>>> 
>>> Why would this be? It doesn't happen with 4 sequential jobs
>>> running on a
>>> node, where I get 99% user time, maybe 1% system time.
>>> 
>>> The processes have plenty of memory. This behavior occurs whether
>>> I use
>>> processor/memory affinity or not (there is no oversubscription).
>>> 
>>> Thanks,
>>> 
>>> Todd
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://w

Re: [OMPI users] Cell EIB support for OpenMPI

2007-03-22 Thread Marcus G. Daniels

Mike Houston wrote:
The main issue with this, and addressed at the end 
of the report, is that the code size is going to be a problem as data 
and code must live in the same 256KB in each SPE. 
Just for reference, here are the stripped shared library sizes for 
OpenMPI 1.2 as built on a Mercury Cell system.  This is for the PPU, not 
SPU..


-rwxr-xr-x 1 mdaniels world  11216 Mar 22  2007 libmca_common_sm.so.0.0.0
-rwxr-xr-x 1 mdaniels world 191440 Mar 22  2007 libmpi_cxx.so.0.0.0
-rwxr-xr-x 1 mdaniels world 827440 Mar 22  2007 libmpi.so.0.0.0
-rwxr-xr-x 1 mdaniels world 327912 Mar 22  2007 libopen-pal.so.0.0.0
-rwxr-xr-x 1 mdaniels world 556584 Mar 22  2007 libopen-rte.so.0.0.0

Using -Os instead of -O3:

-rwxr-xr-x 1 mdaniels world  11232 Mar 22  2007 libmca_common_sm.so.0.0.0
-rwxr-xr-x 1 mdaniels world 258280 Mar 22  2007 libmpi_cxx.so.0.0.0
-rwxr-xr-x 1 mdaniels world 749688 Mar 22  2007 libmpi.so.0.0.0
-rwxr-xr-x 1 mdaniels world 296648 Mar 22  2007 libopen-pal.so.0.0.0
-rwxr-xr-x 1 mdaniels world 501712 Mar 22  2007 libopen-rte.so.0.0.0



[OMPI users] hostfile syntax

2007-03-22 Thread Geoff Galitz


Does the hostfile understand the syntax:

mybox cpu=4

I have some legacy code and scripts that I'd like to move without  
modifying if possible.  I understand the syntax is supposed to be:


mybox slots=4

but using "cpu" seems to work.  Does that achieve the same thing?

-geoff



[OMPI users] Buffered sends

2007-03-22 Thread Michael

Is there known issue with buffered sends in OpenMPI 1.1.4?

I changed a single send which is called thousands of times from  
MPI_SEND (& MPI_ISEND) to MPI_BSEND (& MPI_IBSEND) and my Fortran 90  
code slowed down by a factor of 10.


I've looked at several references and I can't see where I'm making a  
mistake.  The MPI_SEND is for MPI_PACKED data, so it's first  
parameter is an allocated character array.  I also allocated a  
character array for the buffer passed to MPI_BUFFER_ATTACH.


Looking at the model implementation in a reference they give a model  
of using MPI_PACKED inside MPI_BSEND, I was wondering if this could  
be a problem, i.e. packing packed data?


Michael

ps. I have to use OpenMPI 1.1.4 to maintain compatibility with a  
major HPC center.




Re: [OMPI users] Buffered sends

2007-03-22 Thread George Bosilca
This problem is not related to Open MPI. Is related to the way you  
use MPI. In fact there are 2 problems:


1.  Buffered sends will copy the data into the attached buffer. In  
your case, I think this only add one more memcpy operation to the  
critical path, which might partially explain the impressive slow-down  
(but I don't think this is the main reason). Buffering an MPI_PACKED  
data seems like a non optimal solution. You want to keep the critical  
path as short as possible and avoid any extra/useless memcopy. Using  
a double buffering technique (which will effectively double the  
amount of memory required for your communications) can give you some  
benefit.


2. Once the data is buffered, the Bsend (and the Ibsend) return to  
the user application without progressing the communication. With few  
exceptions (based on the available networks, which is not the case  
for TCP nor shared memory) the point-to-point communication will only  
be progressed on the next MPI call. If you look in the MPI standard  
to see what exactly means to return from a blocking send, you will  
realize that the only requirement is that the user can touch the send  
buffer. From this perspective, the major difference between a  
MPI_Send and an MPI_Bsend operation is that the MPI_Send will return  
once the data is delivered to the NIC (which then can then complete  
the communication asynchronously), while at the end of the MPI_Bsend  
the data is still in the application memory. The only way to get any  
benefit from the MPI_Bsend is to have a progress thread which take  
care of the pending communications in the background. Such thread is  
not enabled by default in Open MPI.


  Thanks,
george.


On Mar 22, 2007, at 5:18 PM, Michael wrote:


Is there known issue with buffered sends in OpenMPI 1.1.4?

I changed a single send which is called thousands of times from
MPI_SEND (& MPI_ISEND) to MPI_BSEND (& MPI_IBSEND) and my Fortran 90
code slowed down by a factor of 10.

I've looked at several references and I can't see where I'm making a
mistake.  The MPI_SEND is for MPI_PACKED data, so it's first
parameter is an allocated character array.  I also allocated a
character array for the buffer passed to MPI_BUFFER_ATTACH.

Looking at the model implementation in a reference they give a model
of using MPI_PACKED inside MPI_BSEND, I was wondering if this could
be a problem, i.e. packing packed data?

Michael

ps. I have to use OpenMPI 1.1.4 to maintain compatibility with a
major HPC center.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Fwd: [Allinea #6458] message queues

2007-03-22 Thread Brock Palen
We use OpenMPI as our default MPI lib on our clusters.  We are  
starting to do some work with parallel debuggers (ddt to be exact)   
and was wondering what the time line for message queue debugging  
was.  Just curious!  Thanks.


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


Begin forwarded message:


From: "David Lecomber" 
Date: March 22, 2007 6:44:35 PM GMT-04:00
To: bro...@umich.edu
Cc: jacq...@allinea.com
Subject: Re: [Allinea #6458] message queues
Reply-To: supp...@allinea.com

Hi Brock

This question has just become "frequently asked" -- no-one asked in  
all

of the last 12 months, and I think you're the third person this month,
the second today!

OpenMPI does not (yet?) support message queue debugging  - this means
the interface just isn't there for a debugger to get the information,
sadly.  Open-MPI's own FAQ does mention this lack of support, but I'm
not sure of an ETA or whether they are actively developing it.


Best wishes
David



On Thu, 2007-03-22 at 20:08 +, Brock Palen wrote:

Thu Mar 22 20:08:41 2007: Request 6458 was acted upon.
Transaction: Ticket created by bro...@umich.edu
   Queue: support
 Subject: message queues
   Owner: Nobody
  Requestors: bro...@umich.edu
  Status: new
 Ticket http://swtracker//Ticket/Display.html?id=6458 >


Hello,

According the the manual if you get the message "unable to load
message queue library"  to look at the FAQ, but i can not find a faq
anyplace.  We are new users and our mpi lib is openmpi-1.0.2

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985




--
David Lecomber
CTO Allinea Software
Tel: +44 1926 623231
Fax: +44 1926 623232



signature.asc
Description: PGP signature




PGP.sig
Description: This is a digitally signed message part


Re: [OMPI users] implementation of a message logging protocol

2007-03-22 Thread George Bosilca

Thomas,

We are working on this topic at the University of Tennessee. In fact,  
2 of the MPICH-V guys are now working on Open MPI on fault tolerant  
aspects. With all the expertise we gathered doing MPICH-V, we decided  
to take a different approach and to take advantage of the modular  
architecture offered by Open MPI. We don't focus on any specific  
message logging protocol right now (but we expect to have at least  
all those present in MPICH-V3). Instead what we target is a generic  
framework which will allow researchers to implement in a simple and  
straightforward way any message logging protocols they want, as well  
as providing all tools required to make their life easier.


The code is not yet in the Open MPI trunk but it will get there soon.  
We expect to be able to start moving the message logging framework in  
the trunk over the next month.


  Thanks,
george.

On Mar 22, 2007, at 4:48 AM, Thomas Ropars wrote:


Dear all,

I am currently working on a fault tolerant protocol for message  
passing

applications based on message logging.
For my experimentations, I want to implement my protocol in a MPI  
library.

I know that message logging protocols have already been implemented in
MPICH with MPICH-V.

I'm wondering if in the actual state of Open MPI it is possible to do
the same kind of work in this library ?
Is there somebody currently working on the same subject ?

Best regards,

Thomas Ropars.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Fwd: [Allinea #6458] message queues

2007-03-22 Thread George Bosilca
Open MPI have support for one parallel debugger: Total View. I don't  
know how DDT interact with the MPI library in order to get access to  
the message queues, but we provide a library which allow tv to get  
access to the internal representation of the message queues in Open  
MPI. The access to the message queue in complete, you can see all  
pending sends, receives as well as unexpected messages.


  Thanks,
george.

On Mar 22, 2007, at 7:10 PM, Brock Palen wrote:

We use OpenMPI as our default MPI lib on our clusters.  We are  
starting to do some work with parallel debuggers (ddt to be exact)   
and was wondering what the time line for message queue debugging  
was.  Just curious!  Thanks.


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


Begin forwarded message:


From: "David Lecomber" 
Date: March 22, 2007 6:44:35 PM GMT-04:00
To: bro...@umich.edu
Cc: jacq...@allinea.com
Subject: Re: [Allinea #6458] message queues
Reply-To: supp...@allinea.com

Hi Brock

This question has just become "frequently asked" -- no-one asked  
in all
of the last 12 months, and I think you're the third person this  
month,

the second today!

OpenMPI does not (yet?) support message queue debugging  - this means
the interface just isn't there for a debugger to get the information,
sadly.  Open-MPI's own FAQ does mention this lack of support, but I'm
not sure of an ETA or whether they are actively developing it.


Best wishes
David



On Thu, 2007-03-22 at 20:08 +, Brock Palen wrote:

Thu Mar 22 20:08:41 2007: Request 6458 was acted upon.
Transaction: Ticket created by bro...@umich.edu
   Queue: support
 Subject: message queues
   Owner: Nobody
  Requestors: bro...@umich.edu
  Status: new
 Ticket http://swtracker//Ticket/Display.html?id=6458 >


Hello,

According the the manual if you get the message "unable to load
message queue library"  to look at the FAQ, but i can not find a faq
anyplace.  We are new users and our mpi lib is openmpi-1.0.2

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985




--
David Lecomber
CTO Allinea Software
Tel: +44 1926 623231
Fax: +44 1926 623232




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] hostfile syntax

2007-03-22 Thread Tim Prins
Geoff,

'cpu', 'slots', and 'count' all do exactly the same thing.

Tim

On Thursday 22 March 2007 03:03 pm, Geoff Galitz wrote:
> Does the hostfile understand the syntax:
>
> mybox cpu=4
>
> I have some legacy code and scripts that I'd like to move without
> modifying if possible.  I understand the syntax is supposed to be:
>
> mybox slots=4
>
> but using "cpu" seems to work.  Does that achieve the same thing?
>
> -geoff
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users