[O-MPI users] MacResearch.org announces iPod giveaway contest

2006-02-10 Thread Joel Dudley
Help MacResearch.org expand its Script Repository and you could win a  
black 2GB iPod Nano. Eligible contestants must submit a research- 
oriented script that can run natively (no emulators) on Mac OS X 10.3  
or higher without modification before the contest end date. Scripts  
for all scientific domains are welcome including scripts written for  
High Performance Computing (grid, cluster, etc) setup and management.  
If your script does not meet the aforementioned criteria then you  
will not be eligible to win the iPod Nano. Winners will be chosen by  
random drawing. The contest begins 2/8/2006 and ends 2/28/2006.  The  
ultimate goal of this contest, and the script repository in general,  
is to create a valuable community resource that can be used to  
benefit endeavors in research and education. Please don't be shy  
about your coding style or lack of documentation. Your script will  
make someone's life easier. MacResearch.org is the premier, non- 
profit community for scientists using Mac OS X and related hardware  
in their research. To learn more about MacResearch.org and the  
MacResearch.org Script Repository visit http://www.macresearch.org  
and http://www.macresearch.org/script_repository.



For official contest rules see http://www.macresearch.org/ipod_contest


[O-MPI users] Anyone has build (used) OpenMPI with BLCR??

2006-02-10 Thread Alexandre Carissimi


Hi;

I'm trying to use BLCR to checkpoint OpenMPI applications
but I'm having lots of problems. To begin, I'm note sure that
openmpi recognizes blcr. I compiled open mpi with the
--with options like I used to do with lam versions.

The ompi_info doesn't seems to show blcr support.

Any hints? Someone tryed to do that?

Thanks in advance.

Alex

--
___
CARISSIMI, Alexandre  Universidade Federal do Rio Grande do Sul
a...@inf.ufrgs.br  Instituto de Informática
Tel: +55.51.33.16.61.69   Caixa Postal 15064
Fax: +55.51.33.16.73.08   CEP:91501-970 Porto Alegre - RS - Brasil
___




[O-MPI users] Strange errors when using open-mpi

2006-02-10 Thread Berend van Wachem

Hi,

I have always used MPICH for my MPI projects, but changed to open-mpi 
for its better integration with eclipse.


First of all, I got an error when using gcc 4.x when compiling the code, 
but I think this was discussed earlier on the mailinglist.


I downgraded gcc and have succesfully compiled open-mpi. However, when I 
run my project with 6 processors, I get the error:


*** glibc detected *** malloc(): memory corruption: 0x00f09190 ***


I have no idea where it comes from or how to debug this. Can anyone 
provide me with information or hints?


I am using openmpi-1.0.2a1, gcc (GCC) 3.4.4 om amd64 linux.


Thanks,

Berend.


Re: [O-MPI users] MacResearch.org announces iPod giveaway contest

2006-02-10 Thread Jeff Squyres
This list is for the discussion of Open MPI.  Please do not use it as  
a mechanism for 3rd party announcements.



On Feb 10, 2006, at 2:47 AM, Joel Dudley wrote:


Help MacResearch.org expand its Script Repository and you could win a
black 2GB iPod Nano. Eligible contestants must submit a research-
oriented script that can run natively (no emulators) on Mac OS X 10.3
or higher without modification before the contest end date. Scripts
for all scientific domains are welcome including scripts written for
High Performance Computing (grid, cluster, etc) setup and management.
If your script does not meet the aforementioned criteria then you
will not be eligible to win the iPod Nano. Winners will be chosen by
random drawing. The contest begins 2/8/2006 and ends 2/28/2006.  The
ultimate goal of this contest, and the script repository in general,
is to create a valuable community resource that can be used to
benefit endeavors in research and education. Please don't be shy
about your coding style or lack of documentation. Your script will
make someone's life easier. MacResearch.org is the premier, non-
profit community for scientists using Mac OS X and related hardware
in their research. To learn more about MacResearch.org and the
MacResearch.org Script Repository visit http://www.macresearch.org
and http://www.macresearch.org/script_repository.


For official contest rules see http://www.macresearch.org/ipod_contest
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] Anyone has build (used) OpenMPI with BLCR??

2006-02-10 Thread Josh Hursey

Alex,

Checkpoint/Restart is not supported in Open MPI, at the moment. The  
integration of LAM/MPI style of process fault tolerance using a  
single process checkpointer (e.g. BLCR) is currently under active  
development. Unfortunately, I cannot say exactly when you will see it  
released, but keep watching the users list for updates.


Cheers,
Josh

On Feb 10, 2006, at 4:25 AM, Alexandre Carissimi wrote:



Hi;

I'm trying to use BLCR to checkpoint OpenMPI applications
but I'm having lots of problems. To begin, I'm note sure that
openmpi recognizes blcr. I compiled open mpi with the
--with options like I used to do with lam versions.

The ompi_info doesn't seems to show blcr support.

Any hints? Someone tryed to do that?

Thanks in advance.

Alex

--
___
CARISSIMI, Alexandre  Universidade Federal do Rio Grande do Sul
a...@inf.ufrgs.br  Instituto de Informática
Tel: +55.51.33.16.61.69   Caixa Postal 15064
Fax: +55.51.33.16.73.08   CEP:91501-970 Porto Alegre - RS - Brasil
___


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/




Re: [O-MPI users] Anyone has build (used) OpenMPI with BLCR??

2006-02-10 Thread Alexandre Carissimi


Josh;

Thanks a lot!!

I was affraid about that :)

I looked at your docummentations but was not sure if it was
updated or not... so I tried to install/compil openmpi in the
same way that I used to do with lam... Configure doesn't
complained about my --with-xyz. I suspected when I saw
the output from ompi_info

I'll continue to be a lam/mpi user for the moment.

Cheers,

Alex


Josh Hursey wrote:


Alex,

Checkpoint/Restart is not supported in Open MPI, at the moment. The  
integration of LAM/MPI style of process fault tolerance using a  
single process checkpointer (e.g. BLCR) is currently under active  
development. Unfortunately, I cannot say exactly when you will see it  
released, but keep watching the users list for updates.


Cheers,
Josh

On Feb 10, 2006, at 4:25 AM, Alexandre Carissimi wrote:

 


Hi;

I'm trying to use BLCR to checkpoint OpenMPI applications
but I'm having lots of problems. To begin, I'm note sure that
openmpi recognizes blcr. I compiled open mpi with the
--with options like I used to do with lam versions.

The ompi_info doesn't seems to show blcr support.

Any hints? Someone tryed to do that?

Thanks in advance.

Alex

--
___
CARISSIMI, Alexandre  Universidade Federal do Rio Grande do Sul
a...@inf.ufrgs.br  Instituto de Informática
Tel: +55.51.33.16.61.69   Caixa Postal 15064
Fax: +55.51.33.16.73.08   CEP:91501-970 Porto Alegre - RS - Brasil
___


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
   




Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 




--
___
CARISSIMI, Alexandre  Universidade Federal do Rio Grande do Sul
a...@inf.ufrgs.br  Instituto de Informática
Tel: +55.51.33.16.61.69   Caixa Postal 15064
Fax: +55.51.33.16.73.08   CEP:91501-970 Porto Alegre - RS - Brasil
___




Re: [O-MPI users] direct openib btl and latency

2006-02-10 Thread Galen M. Shipman


I've been working for the MVAPICH project for around three years.  
Since

this thread is discussing MVAPICH, I thought I should post to this
thread. Galen's description of MVAPICH is not accurate. MVAPICH uses
RDMA for short message to deliver performance benefits to the
applications. However, it needs to be designed properly to handle
scalability while delivering best performance. Since MVAPICH-0.9.6
(released on 6th December, 2005), MVAPICH has been supporting a new  
mode

of operation which is called ADAPTIVE_RDMA_FAST_PATH (the basic
RDMA_FAST_PATH is also supported).

This new design uses RDMA for short message transfer in an intelligent
and adaptive manner.  Using this mode, the memory allocation of  
MVAPICH
is no longer static.  Instead its dynamic. Its an implementation of  
the

short message RDMA implementation for a limited set of peers (user
controllable) which Galen is suggesting. MVAPICH already supports this
feature. This also means that in the paper Galen mentions, the
comparison results in Figures 4 through 7 have to be re-evaluated to
make the paper and the results accurate.



I'm not sure how my results and description is not accurate, I was  
comparing using

MVAPICH0.9.5-118 which is detailed in my experimental setup.
This was done before your release of MVAPICH-0.9.6.
I do think it is good that you have addressed these shortcomings  
however and appreciate the clarification.


Thanks,

Galen







Hope this helps.

Thanks,
Sayantan.


--
http://www.cse.ohio-state.edu/~surs
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] [O-MPI users] problem running Migrate with open-MPI

2006-02-10 Thread Andy Vierstraete

Hi Brian and Peter,

I tried the nightly build like Brian said, and I was able to compile 
Migrate without errors-message (that was not the case before, like Peter 
suggested, I had to set openmpi in my path).   But is is still not 
running : now it can't find "libmpi.so.0", and the directory where the 
file is, is in my path.

If I install openmpi 1.0.1 again, I get the same errormessages as last time

I'll try it again with lam-mpi, and see if that works for compiling 
Migrate correctly and if it runs on this pc...



avierstr@muscorum:~> migrate-mpi
migrate-mpi: error while loading shared libraries: libmpi.so.0: cannot 
open shared object file: No such file or directory

avierstr@muscorum:~> migrate-n
migrate-n: error while loading shared libraries: libmpi.so.0: cannot 
open shared object file: No such file or directory

avierstr@muscorum:~> echo $PATH
/home/avierstr/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/local/openmpi-1.1a1/bin:/usr/local/Modeltest3.7/source:/usr/local/mrbayes-3.1.2:/usr/local/bin:/usr/local/MrModeltest2.2:/usr/local/paup4b10:/usr/local/mrbayes-3.1.2-mpi:/usr/local/openmpi-1.1a1/lib:/usr/local/migrate-2.1.3-mpi/src:/usr/local/openmpi-1.1a1/bin:/usr/local/Modeltest3.7/source:/usr/local/mrbayes-3.1.2:/usr/local/bin:/usr/local/MrModeltest2.2:/usr/local/paup4b10:/usr/local/mrbayes-3.1.2-mpi:/usr/local/openmpi-1.1a1/lib:/usr/local/migrate-2.1.3-mpi/src:/usr/local/Modeltest3.7/source:/usr/local/mrbayes-3.1.2:/usr/local/bin:/usr/local/MrModeltest2.2:/usr/local/paup4b10:/usr/local/mrbayes-3.1.2-mpi:/usr/local/openmpi-1.1a1/lib:/usr/local/migrate-2.1.3-mpi/src
avierstr@muscorum:~> mpiexec -np 2 migrate-mpi
orted: error while loading shared libraries: liborte.so.0: cannot open 
shared object file: No such file or directory
[muscorum:12220] ERROR: A daemon on node localhost failed to start as 
expected.

[muscorum:12220] ERROR: There may be more information available from
[muscorum:12220] ERROR: the remote shell (see above).
[muscorum:12220] ERROR: The daemon exited unexpectedly with status 127.
avierstr@muscorum:~>


Peter Beerli wrote:


Dear Brian,

The original poster intended to run migrate-n in parallel mode, but the
stdout fragment shows that the program was compiled for a non-MPI  
architecture
(either single CPU or SMP pthreads) [I talked with him list-offline  
and it used pthreads].
A version for parallel runs shows this fact in its first couple of  
lines, like this (<):

  =
  MIGRATION RATE AND POPULATION SIZE ESTIMATION
  using Markov Chain Monte Carlo simulation
  =
  compiled for a PARALLEL COMPUTER ARCHITECTURE
<@

  Version debug 2.1.3   [x]
  Program started at   Wed Feb  8 12:29:35 2006

As far as I am concerned migrate-n compiles and runs on openmpi  
1.0.1. There might be some use in running
the program multiple times completely independently through openmpi  
or lam for simulation purposes, but
that would not be a typical use of the program that can distribute  
multiple genetic loci on multiple nodes and only having
the master handling input and output (when compiled using configure;  
make mpis or configure;make mpi)



Peter

Peter Beerli,
Computational Evolutionary Biology Group
School of Computational Science (SCS)
and Biological Sciences Department
150-T Dirac Science Library
Florida State University
Tallahassee, Florida 32306-4120 USA
Webpage: http://www.csit.fsu.edu/~beerli
Phone: 850.645.1324
Fax: 850.644.0094





On Feb 8, 2006, at 11:24 AM, Brian Barrett wrote:

 


I think we fixed this over this last weekend.  I believe the problem
was our mis-handling of standard input in some cases. I believe I was
able to get the application running (but I could be fooling myself
there...).  Could you download the latest nightly build from the URL
below and see if it works for you?  The fixes are scheduled to be
part of Open MPI 1.0.2, which should be out real soon now.

http://www.open-mpi.org/nightly/trunk/

Thanks,

Brian

On Feb 3, 2006, at 10:23 AM, Andy Vierstraete wrote:

   


Hi,

I have installed Migrate  2.1.2, but it fails to run on open-MPI (it
does run on LAM-MPI : see end of mail)

my system is Suse 10 on Athlon X2

hostfile : localhost slots=2 max_slots=2

I tried different commands :

1. does not start : error message :
**

avierstr@muscorum:~/thomas> mpiexec  -np 2 migrate-mpi
mpiexec noticed that job rank 1 with PID 0 on node "localhost"
exited on
signal 11.
[muscorum:07212] ERROR: A daemon on node localhost failed to start as
expected.
[muscorum:07212] ERROR: There may be more information available from
[muscorum:07212] ERROR: the remote shell (see above).
[muscorum:07212] The daemon received a signal 11.
1 additional process aborted (not shown)



2. s

Re: [OMPI users] [O-MPI users] problem running Migrate with open-MPI

2006-02-10 Thread Andy Vierstraete

Hi Brian and Peter,

It works with lam-mpi, so probably still something wrong with open-mpi ? 


Greets,

Andy

avierstr@muscorum:~> lamboot hostfile

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

avierstr@muscorum:~> mpiexec migrate-n
migrate-n
migrate-n
 =
 MIGRATION RATE AND POPULATION SIZE ESTIMATION
 using Markov Chain Monte Carlo simulation
 =
 compiled for a PARALLEL COMPUTER ARCHITECTURE
 Version 2.1.3
 Program started at   Fri Feb 10 16:49:55 2006


 Settings for this run:
 D   Data type currently set to: DNA sequence model
 I   Input/Output formats
 P   Parameters  [start, migration model]
 S   Search strategy
 W   Write a parmfile
 Q   Quit the program


 Are the settings correct?
 (Type Y or the letter for one to change)



Peter Beerli wrote:


Dear Brian,

The original poster intended to run migrate-n in parallel mode, but the
stdout fragment shows that the program was compiled for a non-MPI  
architecture
(either single CPU or SMP pthreads) [I talked with him list-offline  
and it used pthreads].
A version for parallel runs shows this fact in its first couple of  
lines, like this (<):

  =
  MIGRATION RATE AND POPULATION SIZE ESTIMATION
  using Markov Chain Monte Carlo simulation
  =
  compiled for a PARALLEL COMPUTER ARCHITECTURE
<@

  Version debug 2.1.3   [x]
  Program started at   Wed Feb  8 12:29:35 2006

As far as I am concerned migrate-n compiles and runs on openmpi  
1.0.1. There might be some use in running
the program multiple times completely independently through openmpi  
or lam for simulation purposes, but
that would not be a typical use of the program that can distribute  
multiple genetic loci on multiple nodes and only having
the master handling input and output (when compiled using configure;  
make mpis or configure;make mpi)



Peter

Peter Beerli,
Computational Evolutionary Biology Group
School of Computational Science (SCS)
and Biological Sciences Department
150-T Dirac Science Library
Florida State University
Tallahassee, Florida 32306-4120 USA
Webpage: http://www.csit.fsu.edu/~beerli
Phone: 850.645.1324
Fax: 850.644.0094





On Feb 8, 2006, at 11:24 AM, Brian Barrett wrote:

 


I think we fixed this over this last weekend.  I believe the problem
was our mis-handling of standard input in some cases. I believe I was
able to get the application running (but I could be fooling myself
there...).  Could you download the latest nightly build from the URL
below and see if it works for you?  The fixes are scheduled to be
part of Open MPI 1.0.2, which should be out real soon now.

http://www.open-mpi.org/nightly/trunk/

Thanks,

Brian

On Feb 3, 2006, at 10:23 AM, Andy Vierstraete wrote:

   


Hi,

I have installed Migrate  2.1.2, but it fails to run on open-MPI (it
does run on LAM-MPI : see end of mail)

my system is Suse 10 on Athlon X2

hostfile : localhost slots=2 max_slots=2

I tried different commands :

1. does not start : error message :
**

avierstr@muscorum:~/thomas> mpiexec  -np 2 migrate-mpi
mpiexec noticed that job rank 1 with PID 0 on node "localhost"
exited on
signal 11.
[muscorum:07212] ERROR: A daemon on node localhost failed to start as
expected.
[muscorum:07212] ERROR: There may be more information available from
[muscorum:07212] ERROR: the remote shell (see above).
[muscorum:07212] The daemon received a signal 11.
1 additional process aborted (not shown)



2. starts a non-ending loop :


avierstr@muscorum:~/thomas> mpirun -np 2 --hostfile ./hostfile
migrate-mpi
migrate-mpi
 =
 MIGRATION RATE AND POPULATION SIZE ESTIMATION
 using Markov Chain Monte Carlo simulation
 =
 Version 2.1.2
 Program started at   Fri Feb  3 15:58:57 2006


 Settings for this run:
 D   Data type currently set to: DNA sequence model
 I   Input/Output formats
 P   Parameters  [start, migration model]
 S   Search strategy
 W   Write a parmfile
 Q   Quit the program


 Are the settings correct?
 (Type Y or the letter for one to change)
 Settings for this run:
 D   Data type currently set to: DNA sequence model
 I   Input/Output formats
 P   Parameters  [start, migration model]
 S   Search strategy
 W   Write a parmfile
 Q   Quit the program


 Are the settings correct?
 (Type Y or the letter for one to change)
 Settings for this run:
 D   Data type currently set to: DNA sequence model
 I   Input/Output formats
 P   Parameters  [start, migration model]
 S   Search strategy
 W   Write a parmfile
 Q   Quit the p

Re: [OMPI users] [O-MPI users] problem running Migrate with open-MPI

2006-02-10 Thread George Bosilca
There are 2 things that have to be done in order to be able to run a  
Open MPI application. First the runtime environment need access to  
some of the files in the bin directory so you have to add the Open  
MPI bin directory to your path. And second, as we use shared  
libraries the OS need to know where they can be found. This is done  
using the LD_LIBRARY_PATH environment variable. So suppose that one  
has compiled Open MPI like this:

./configure --prefix=/home/one/opt
He has to add in his tcsh startup script (.tcshrc):
setenv PATH "/home/one/opt/bin:${PATH}"
setenv LD_LIBRARY_PATH "/home/one/lib:${LD_LIBRARY_PATH}"

That should fix your problem. Enjoy.

george.

On Feb 10, 2006, at 10:31 AM, Andy Vierstraete wrote:


Hi Brian and Peter,

I tried the nightly build like Brian said, and I was able to  
compile Migrate without errors-message (that was not the case  
before, like Peter suggested, I had to set openmpi in my path).
But is is still not running : now it can't find "libmpi.so.0", and  
the directory where the file is, is in my path.
If I install openmpi 1.0.1 again, I get the same errormessages as  
last time


I'll try it again with lam-mpi, and see if that works for compiling  
Migrate correctly and if it runs on this pc...



avierstr@muscorum:~> migrate-mpi
migrate-mpi: error while loading shared libraries: libmpi.so.0:  
cannot open shared object file: No such file or directory

avierstr@muscorum:~> migrate-n
migrate-n: error while loading shared libraries: libmpi.so.0:  
cannot open shared object file: No such file or directory

avierstr@muscorum:~> echo $PATH
/home/avierstr/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/ 
games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/ 
sbin:/usr/local/openmpi-1.1a1/bin:/usr/local/Modeltest3.7/source:/ 
usr/local/mrbayes-3.1.2:/usr/local/bin:/usr/local/MrModeltest2.2:/ 
usr/local/paup4b10:/usr/local/mrbayes-3.1.2-mpi:/usr/local/ 
openmpi-1.1a1/lib:/usr/local/migrate-2.1.3-mpi/src:/usr/local/ 
openmpi-1.1a1/bin:/usr/local/Modeltest3.7/source:/usr/local/ 
mrbayes-3.1.2:/usr/local/bin:/usr/local/MrModeltest2.2:/usr/local/ 
paup4b10:/usr/local/mrbayes-3.1.2-mpi:/usr/local/openmpi-1.1a1/lib:/ 
usr/local/migrate-2.1.3-mpi/src:/usr/local/Modeltest3.7/source:/usr/ 
local/mrbayes-3.1.2:/usr/local/bin:/usr/local/MrModeltest2.2:/usr/ 
local/paup4b10:/usr/local/mrbayes-3.1.2-mpi:/usr/local/ 
openmpi-1.1a1/lib:/usr/local/migrate-2.1.3-mpi/src

avierstr@muscorum:~> mpiexec -np 2 migrate-mpi
orted: error while loading shared libraries: liborte.so.0: cannot  
open shared object file: No such file or directory
[muscorum:12220] ERROR: A daemon on node localhost failed to start  
as expected.

[muscorum:12220] ERROR: There may be more information available from
[muscorum:12220] ERROR: the remote shell (see above).
[muscorum:12220] ERROR: The daemon exited unexpectedly with status  
127.

avierstr@muscorum:~>


Peter Beerli wrote:
Dear Brian, The original poster intended to run migrate-n in  
parallel mode, but the stdout fragment shows that the program was  
compiled for a non-MPI architecture (either single CPU or SMP  
pthreads) [I talked with him list-offline and it used pthreads]. A  
version for parallel runs shows this fact in its first couple of  
lines, like this (<):  
= MIGRATION RATE AND  
POPULATION SIZE ESTIMATIONusing Markov Chain Monte Carlo  
simulation = compiled  
for a PARALLEL COMPUTER ARCHITECTURE <@  
Version debug 2.1.3 [x] Program started at Wed Feb 8 12:29:35 2006  
As far as I am concerned migrate-n compiles and runs on openmpi  
1.0.1. There might be some use in running the program multiple  
times completely independently through openmpi or lam for  
simulation purposes, but that would not be a typical use of the  
program that can distribute multiple genetic loci on multiple  
nodes and only having the master handling input and output (when  
compiled using configure; make mpis or configure;make mpi) Peter  
 Peter Beerli, Computational Evolutionary Biology Group School  
of Computational Science (SCS) and Biological Sciences Department  
150-T Dirac Science Library Florida State University Tallahassee,  
Florida 32306-4120 USA Webpage: http://www.csit.fsu.edu/~beerli  
Phone: 850.645.1324 Fax: 850.644.0094 On Feb 8, 2006, at 11:24 AM,  
Brian Barrett wrote:
I think we fixed this over this last weekend. I believe the  
problem was our mis-handling of standard input in some cases. I  
believe I was able to get the application running (but I could be  
fooling myself there...).  Could you download the latest nightly  
build from the URL below and see if it works for you? The fixes  
are scheduled to be part of Open MPI 1.0.2, which should be out  
real soon now. http://www.open-mpi.org/nightly/trunk/ Thanks,  
Brian On Feb 3, 2006, at 10:23 AM, Andy Vierstraete wrote:
Hi, I have installed M

[OMPI users] Cannonical ring program and Mac OSX 10.4.4

2006-02-10 Thread James Conway

Brian et al,

Original thread was "[O-MPI users] Firewall ports and Mac OS X 10.4.4"

On Feb 9, 2006, at 11:26 PM, Brian Barrett wrote:


Open MPI uses random port numbers for all it's communication.
(etc)


Thanks for the explanation. I will live with the open Firewall, and  
look at the ipfw docs for writing a script.


Now I have a more "core" OpenMPI problem, which may be just  
unfamiliarity on my part. I seem to have the environment variables  
set up alright though - the code runs, but doesn't complete.


I have copied the "MPI Tutorial: The cannonical ring program" from  
. It compiles and runs fine on the  
localhost (one CPU, one or more MPI processes). If I copy it to a  
remotehost, it does one round of passing the 'tag' then stalls. I  
modified the print statements a bit to see where in the code it  
stalls, but the loop hasn't changed. This is what I see happening:
1. Process 0 successfully kicks off the pass-around by sending the  
tag to the next process (1), and then enters the loop where it waits  
for the tag to come back.
2. Process 1 enters the loop, receives the tag and passes it on (back  
to process 0 since this is a ring of 2 players only).
3. Process 0 successfully receives the tag, decrements it, and calls  
the next send (MPI_Send) but it doesn't return from this. I have a  
print statement right after (with fflush) but there is no output. The  
CPU is maxed out on both the local and remote hosts, I assume some  
kind of polling.

4. Needless to say, Process 1 never reports receipt of the tag.

Output (with a little re-ordering to make sense) is:
   mpirun --hostfile my_mpi_hosts --np 2 mpi_test1
   Process rank 0: size = 2
   Process rank 1: size = 2
   Enter the number of times around the ring: 5

   Process 0 doing first send of '4' to 1
   Process 0 finished sending, now entering loop

   Process 0 waiting to receive from 1

   Process 1 waiting to receive from 0
   Process 1 received '4' from 0
   Process 1 sending '4' to 0
   Process 1 finished sending
   Process 1 waiting to receive from 0

   Process 0 received '4' from 1
   >>Process 0 decremented num
   Process 0 sending '3' to 1
   ! nothing more - hangs at 100% cpu until ctrl-
   ! should see "Process 0 finished sending"

Since process 0 succeeds in calling MPI_Send before the loop, and in  
calling MPI_Recv at the start of the loop, the communications appear  
to be working. Likewise, process 1 succeeds in receiving and sending  
within the loop. However, if its significant, these calls work one  
time for each process - the second time MPI_Send is called by process  
0, there is a hang.


I am using Mac OSX 10.4.4 and gcc 4.0.1 on both systems, with OpenMPI  
1.0.1 installed (compiled from sources). The small tutorial code is  
below (I hope its OK to include here), with the few printf mods that  
I made.


Any pointers appreciated!

James Conway

--
James Conway, PhD.,
Department of Structural Biology
University of Pittsburgh School of Medicine
Biomedical Science Tower 3, Room 2047
3501 5th Ave
Pittsburgh, PA 15260
U.S.A.
Phone: +1-412-383-9847
Fax:   +1-412-648-8998
Email: jxc...@pitt.edu
Web:    (under construction)
--


/*
 * Open Systems Lab
 * http://www.lam-mpi.org/tutorials/
 * Indiana University
 *
 * MPI Tutorial
 * The cannonical ring program
 *
 * Mail questions regarding tutorial material to m...@lam-mpi.org
 */

#include 
#include "mpi.h"

int main(int argc, char *argv[]);


int main(int argc, char *argv[])
{
  MPI_Status status;
  int num, rank, size;

  /* Start up MPI */

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

/*
Arbitrarily choose 201 to be our tag.  Calculate the
rank of the next process in the ring.  Use the modulus
operator so that the last process "wraps around" to rank
zero.
*/

  const int tag  = 201;
  const int next = (rank + 1) % size;
  const int from = (rank + size - 1) % size;

  printf("Process rank %d: size = %d\n", rank, size);

/*
If we are the "console" process, get an integer from the user
to specify how many times we want to go around the ring
*/

  if (rank == 0) {
printf("Enter the number of times around the ring: ");
scanf("%d", &num);
--num;

printf("Process %d doing first send of '%d' to %d\n", rank, num,  
next);

MPI_Send(&num, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
printf("Process %d finished sending, now entering loop\n", rank);
fflush(stdout);
  }

/*
Pass the message around the ring.  The exit mechanism works
as follows: the message (a positive integer) is passed
around the ring.  Each time is passes rank 0, it is decremented.
When each processes receives the 0 message, it passes it on
to the next process and then quits.  By passing the 0 first,
every process gets t

[OMPI users] Bug in OMPI 1.0.1 using MPI_Recv with indexed datatypes

2006-02-10 Thread Yvan Fournier
Hello,

I seem to have encountered a bug in Open MPI 1.0 using indexed datatypes
with MPI_Recv (which seems to be of the "off by one" sort). I have
joined a test case, which is briefly explained below (as well as in the
source file). This case should run on two processes. I observed the bug
on 2 different Linux systems (single processor Centrino under Suse 10.0
with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4)
with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2.

Here is a summary of the case:

--

Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.

In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.

lines should print (for example) as:
  6456 ;   6455.0   6455.0   6456.0
  6457 ;  -6457.0  -6457.0  -6457.0
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).

With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on
Debian sarge with gcc 3.4), we have for example for the last vertices:
  7176 ;   7175.0   7175.0   7176.0
  7177 ;   7176.0   7176.0   7177.0
seeming to indicate an "off by one" type bug in datatype handling

Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears. Using the indexed
datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug
either, so it does seem to be an Open MPI issue.

--

Best regards,

Yvan Fournier


ompi_datatype_bug.tar.gz
Description: application/compressed-tar


Re: [OMPI users] [O-MPI users] A few benchmarks

2006-02-10 Thread Jeff Squyres

On Feb 6, 2006, at 8:32 PM, Glen Kaukola wrote:

Anyway, here are the times on a few runs I did with Open MPI  
1.1a1r887.

  Basically what I'm seeing, my jobs run ok when they're local to one
machine, but as soon as they're split up between multiple machines
performance can vary:

4 cpu jobs:
2:16:27
2:01:35
1:24:20
1:03:55
1:22:43
1:31:53


Wow -- am I reading this correct in that you are seeing a delta of  
over 1 minute in runs of the same application with the same data?   
That should absolutely not be happening.


If you haven't already (I'm joining this thread late), can you send  
us your input deck so that we can try to reproduce this?


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [OMPI users] Bug in OMPI 1.0.1 using MPI_Recv with indexed datatypes

2006-02-10 Thread George Bosilca

Yvan,

I'm looking into this one. So far I cannot reproduce it with the  
current version from the trunk. I will look into the stable versions.  
Until I figure out what's wrong, can you please use the nightly  
builds to run your test. Once the problem get fixed it will be  
included in the 1.0.2 release.


BTW, which interconnect are you using ? Ethernet ?

  Thanks,
george.

On Feb 10, 2006, at 5:06 PM, Yvan Fournier wrote:


Hello,

I seem to have encountered a bug in Open MPI 1.0 using indexed  
datatypes

with MPI_Recv (which seems to be of the "off by one" sort). I have
joined a test case, which is briefly explained below (as well as in  
the
source file). This case should run on two processes. I observed the  
bug
on 2 different Linux systems (single processor Centrino under Suse  
10.0

with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4)
with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2.

Here is a summary of the case:

--

Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.

In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.

lines should print (for example) as:
  6456 ;   6455.0   6455.0   6456.0
  6457 ;  -6457.0  -6457.0  -6457.0
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).

With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0  
and on

Debian sarge with gcc 3.4), we have for example for the last vertices:
  7176 ;   7175.0   7175.0   7176.0
  7177 ;   7176.0   7176.0   7177.0
seeming to indicate an "off by one" type bug in datatype handling

Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears. Using the indexed
datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug
either, so it does seem to be an Open MPI issue.

--

Best regards,

Yvan Fournier

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


"Half of what I say is meaningless; but I say it so that the other  
half may reach you"

  Kahlil Gibran




Re: [OMPI users] [O-MPI users] mpirun with make

2006-02-10 Thread Jeff Squyres

On Feb 8, 2006, at 3:29 AM, Andreas Fladischer wrote:


I tested this example with hostname before and it worked well:

the hostfile contains only 2 lines
pc86
pc92

and the user wolf doesn't need a password when linking to the other
pc.the user wolf have the same uid and gui on both pc.

i have also another question: is it possible to use mpi to compile  
some

programms without changing the source code of the program?


I'm guessing that you're seeing problems because you're trying to use  
a serial application in parallel improperly.  "make" was not designed  
to be invoked in parallel via MPI -- trying to do so can probably  
result in several different errors.


MPI is designed for explicit parallelism -- applications that are  
written specifically to use MPI (i.e., by invoking the MPI API).  So  
if you have a non-MPI application and try to "parallelize" it by  
running it via MPI, you'll likely either get errors, unexpected  
results, or simply N copies of your application running.


If you want to use a parallel software building tool, MPI is probably  
not what you are looking for.  Andrew suggested distcc and ccache --  
you might want to look into them.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [OMPI users] [O-MPI users] "alltoall" vs "alltoallv"

2006-02-10 Thread George Bosilca

Konstantin,

The all2all scheduling works only because we know they will all send  
the same amount of data, so the communications will take "nearly" the  
same time. Therefore, we can predict how to schedule the  
communications to get the best out of the network. But this approach  
can lead to worst performance for all2allv in most of the cases. From  
a user perspective we can imagine that if they will send roughly the  
same size they can get some benefit from this approach. But now from  
the MPI library how we can figure out the amount that have to be send  
globally ? Each one of the processes knows only how much data it have  
to send and how much data it have to receive ... but unfortunately  
does not have any informations about the communications that will  
take place between the others ...


Of course we can do an all2all with the sizes before taking the  
decision on how to do the all2allv but the cost can be prohibited on  
most of the cases.


Anyway, we're working on this issue and hopefully we will have a  
solution soon.


  Thanks,
george.

On Feb 7, 2006, at 11:45 AM, Konstantin Kudin wrote:


 Hi all,

 I was wondering if it would be possible to use the same scheduling  
for

"alltoallv" as for "alltoall". If one assumes the messages of roughly
the same size, then "alltoall" would not be an unreasonable
approximation for "alltoallv". As is, it appears that in v1.1
"alltoallv" is done via a bunch of "isend+irecv", while "alltoall"  
is a

bit more clever.

 One could then also have a runtime flag to use this sort of
substitution.


"Half of what I say is meaningless; but I say it so that the other  
half may reach you"

  Kahlil Gibran




Re: [OMPI users] [O-MPI users] A few benchmarks

2006-02-10 Thread Glen Kaukola

Jeff Squyres wrote:

On Feb 6, 2006, at 8:32 PM, Glen Kaukola wrote:

Anyway, here are the times on a few runs I did with Open MPI  
1.1a1r887.

  Basically what I'm seeing, my jobs run ok when they're local to one
machine, but as soon as they're split up between multiple machines
performance can vary:

4 cpu jobs:
2:16:27
2:01:35
1:24:20
1:03:55
1:22:43
1:31:53


Wow -- am I reading this correct in that you are seeing a delta of  
over 1 minute in runs of the same application with the same data?   
That should absolutely not be happening.


If you haven't already (I'm joining this thread late), can you send  
us your input deck so that we can try to reproduce this?



That's in hours actually.  So a delta of over one hour.  And yes, it's 
the same exact setup, same day, same input data and everything for each job.


You're wanting to run the cctm model though?  It would take a little 
doing, but I could whip up a tar file with all the necessary fixings, 
source, scripts, and input data.  Bear in mind that it would probably be 
anywhere from 2 to 5 gigs in size just to do a one day simulation.  I 
think the EPA might put out some sample data though that's a bit more 
manageable, so I could maybe set that up instead, but we'd still be 
talking around 1 gigabyte worth of data I think.  Oh, and you pretty 
much need the Portland Group's Fortran 95 compiler for things to work.



Glen