Re: [OMPI users] Topology functions from MPI 1.1

2008-01-28 Thread Jeff Squyres
It depends on what you are trying to do.  Is your network physically  
wired such that there is no direct link between nodes 1 and 2?  (i.e.,  
node 1 cannot directly send to node 2, such as via opening a socket  
from node 1 to node 2's IP address)


MPI topology communicators do not prohibit on process from sending to  
any other process in the communicator.  They might provide more  
optimal routing for nearest-neighbor communication if the underlying  
network topology supports it, for example.



On Jan 24, 2008, at 6:42 PM, David Souther wrote:


Hello,

My name is David Souther, and I am a student working on a parallel  
processing research project at Rocky Mountain College. We need to  
attach topology information to our processes, but the assertions we  
have been making about the MPI Topology mechanism seem to be false.


We would like to do something similar to the following:

Node 0 <---> Node 1
 |
 |
V
Node 2

That is, Node 0 can talk to Node 1 and 2, and Node 1 can talk to  
Node 0, but Node 2 can't talk to anyone. From what I understand, the  
code to do that would look like:

...
MPI_Comm graph_comm;
int nodes = 3;
int indexes[] = {2, 3, 3};
int edges[] = {1, 2, 0};
MPI_Graph_create(MPI_COMM_WORLD, nodes, indexes, edges, 0,  
&graph_comm);

...

That is how, with my understanding, I would build that graph.

Then, the following pseudocode  would work:

if(rank == 0)
MPI_SEND("Message", To Rank 1, graph_comm)
if(rank == 1)
MPI_RECV(buffer, From Rank 0, graph_comm)

but this would not (It would throw an error, maybe?):

if(rank == 2)
MPI_SEND("Message", To Rank 0, graph_comm)
if(rank == 0)
MPI_RECV(buffer, From Rank 2, graph_comm)


Anyway, the point is, that doesn't work/happen. The messages simply  
send and receive as if they were all in the global communicator  
(MPI_COMM_WORLD).


So, I have two questions: could (and how do I make) this work?

And, If I'm going at this entirely the wrong way, what is a good use  
for the topology mechanism?


Thank you so much for your time!

--
-DS
-
David Souther
1511 Poly Dr
Billings, MT, 59102
(406) 545-9223
http://www.davidsouther.com  
___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] SCALAPACK: Segmentation Fault (11) and Signal code: Address not mapped (1)

2008-01-28 Thread Jeff Squyres

Sorry for not replying earlier.

I'm not a SCALAPACK expert, but a common mistake I've seen users make  
is to use the mpif.h from a different MPI implementation when  
compiling their fortran programs.  Can you verify that you're getting  
the Open MPI mpif.h?


Also, there is a known problem that with the Pathscale compiler that  
they have stubbornly refused to comment on for about a year now  
(meaning: a problem was identified many moons ago, and it has not been  
tracked down to be either a Pathscale compiler problem or an Open MPI  
problem -- we did as much as we could and handed off to Pathscale, but  
with no forward progress since then).  So you *may* be running into  
that issue...?  FWIW, we only saw the pathscale problem when running  
on InfiniBand hardware, so YMMV.


Can you run any other MPI programs with Open MPI?



On Jan 22, 2008, at 4:06 PM, Backlund, Daniel wrote:



Hello all, I am using OMPI 1.2.4 on a Linux cluster (Rocks 4.2).  
OMPI was configured to use the
Pathscale Compiler Suite installed in the (NFS mounted on nodes) / 
home/PROGRAMS/pathscale. I am
trying to compile and run the example1.f that comes with the ACML  
package from AMD, and I am
unable to get it to run. All nodes have the same Opteron processors  
and 2GB ram per core. OMPI

was configured as below.

export CC=pathcc
export CXX=pathCC
export FC=pathf90
export F77=pathf90

./configure --prefix=/opt/openmpi/1.2.4 --enable-static --without- 
threads --without-memory-manager \

 --without-libnuma --disable-mpi-threads

The configuration was successful, the install was successful, I can  
even run a sample mpihello.f90
program. I would eventually like to link the ACML SCALAPACK and  
BLACS libraries to our code, but I
need some help. The ACML version is 3.1.0 for pathscale64. I go into  
the scalapack_examples directory,
modify GNUmakefile to the correct values, and compile successfully.  
I have made openmpi into an rpm and
pushed it to the nodes, modified LD_LIBRARY_PATH and PATH, and made  
sure I can see it on all nodes.
When I try to run the example1.exe which is generated, using /opt/ 
openmpi/1.2.4/bin/mpirun -np 6 example1.exe

I get the following output:

 example1.res 

[XXX:31295] *** Process received signal ***
[XXX:31295] Signal: Segmentation fault (11)
[XXX:31295] Signal code: Address not mapped (1)
[XXX:31295] Failing at address: 0x4470
[XXX:31295] *** End of error message ***
[XXX:31298] *** Process received signal ***
[XXX:31298] Signal: Segmentation fault (11)
[XXX:31298] Signal code: Address not mapped (1)
[XXX:31298] Failing at address: 0x4470
[XXX:31298] *** End of error message ***
[XXX:31299] *** Process received signal ***
[XXX:31299] Signal: Segmentation fault (11)
[XXX:31299] Signal code: Address not mapped (1)
[XXX:31299] Failing at address: 0x4470
[XXX:31299] *** End of error message ***
[XXX:31300] *** Process received signal ***
[XXX:31300] Signal: Segmentation fault (11)
[XXX:31300] Signal code: Address not mapped (1)
[XXX:31300] Failing at address: 0x4470
[XXX:31300] *** End of error message ***
[XXX:31296] *** Process received signal ***
[XXX:31296] Signal: Segmentation fault (11)
[XXX:31296] Signal code: Address not mapped (1)
[XXX:31296] Failing at address: 0x4470
[XXX:31296] *** End of error message ***
[XXX:31297] *** Process received signal ***
[XXX:31297] Signal: Segmentation fault (11)
[XXX:31297] Signal code: Address not mapped (1)
[XXX:31297] Failing at address: 0x4470
[XXX:31297] *** End of error message ***
mpirun noticed that job rank 0 with PID 31295 on node  
XXX.ourdomain.com exited on signal 11 (Segmentation fault).

5 additional processes aborted (not shown)

 end example1.res 

Here is the result of ldd example1.exe

 ldd example1.exe 
   libmpi_f90.so.0 => /opt/openmpi/1.2.4/lib/libmpi_f90.so.0  
(0x002a9557d000)
   libmpi_f77.so.0 => /opt/openmpi/1.2.4/lib/libmpi_f77.so.0  
(0x002a95681000)
   libmpi.so.0 => /opt/openmpi/1.2.4/lib/libmpi.so.0  
(0x002a957b3000)
   libopen-rte.so.0 => /opt/openmpi/1.2.4/lib/libopen-rte.so.0  
(0x002a959fb000)
   libopen-pal.so.0 => /opt/openmpi/1.2.4/lib/libopen-pal.so.0  
(0x002a95be7000)

   librt.so.1 => /lib64/tls/librt.so.1 (0x003e7cd0)
   libnsl.so.1 => /lib64/libnsl.so.1 (0x003e7c20)
   libutil.so.1 => /lib64/libutil.so.1 (0x003e79e0)
   libmv.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libmv.so.1  
(0x002a95d4d000)
   libmpath.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libmpath.so. 
1 (0x002a95e76000)

   libm.so.6 => /lib64/tls/libm.so.6 (0x003e77a0)
   libdl.so.2 => /lib64/libdl.so.2 (0x003e77c0)
   libpathfortran.so.1 => /home/PROGRAMS/pathscale/lib/3.0/ 
libpathfortran.so.1 (0x002a95f97000)

   libc.so.6 => /lib64/tls/libc.so.6

Re: [OMPI users] Information about multi-path on IB-based systems

2008-01-28 Thread Jeff Squyres

On Jan 23, 2008, at 3:12 PM, David Gunter wrote:


Do I need to do anything special to enable multi-path routing on
InfiniBand networks?  For example, are there command-line arguments to
mpiexec or the like?



It depends on what you mean by multi-path routing.  For each MPI peer  
pair (e.g., procs A and B), do you want Open MPI to use multiple  
routes for MPI traffic over the IB fabric?


If you set the LMC in your fabric to be greater than 0, your SM should  
issue multiple LIDs for each active port, and create routes  
accordingly.  I don't know what OpenSM does, but the Cisco SM will  
naturally try to make the routes be across different links.  Open MPI  
(>=v1.2.x) should automatically detect that your LMC is >0 and setup  
as many connections between a pair of MPI peers as the LMC allows and  
then round robin packets across all of them.


Did that make sense?

--
Jeff Squyres
Cisco Systems



[OMPI users] MPI-IO problems

2008-01-28 Thread R C
Hi,
 I compiled a molecular dynamics program DLPOLY3.09 on an AMD64 cluster running
openmpi 1.2.4 with Portland group compilers.The program seems to run alright,
however, each processor outputs:

ADIOI_GEN_DELETE (line 22): **io No such file or directory

It was suggested that it was an issue with MPI-IO. I would appreciate any help
in fixing the problem.

recif



Re: [OMPI users] (no subject)

2008-01-28 Thread Josh Hursey
I'm unable to reproduce this problem. :( I tried both the svn head  
(r17288) and the tarball that you were using (openmpi-1.3a1r17175) on  
a similar system without problem.


The error you are seeing may be caused by old connectivity  
information in the session directory. You may want to make sure that / 
tmp does not contain any "openmpi-session*" directories before  
starting mpirun.


Other than that you may want to try a clean build of Open MPI just to  
make sure that you are not seeing anything odd resulting from old  
Open MPI install files.


let me know if that helps.

-- Josh

On Jan 24, 2008, at 12:38 PM, Wong, Wayne wrote:

I'm having some difficulty geting the Open MPI checkpoint/restart  
fault tolerance working.  I have compiled Open MPI with the "--with- 
ft=cr" flag, but when I attempt to run my test program (ring), the  
ompi-checkpoint command fails.  I have verified that the test  
program works fine without the fault tolerance enabled.  Here are  
the details:


 [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
 [me@dev1 ~]$ ps -efa | grep mpirun
 me 3052  2820  1 08:25 pts/200:00:00 mpirun -np 4 -am  
ft-enable-cr ring



 [me@dev1 ~]$ ompi-checkpoint 3052
 [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown  
error: 5854512 in file sds_singleton_module.c at line 50
 [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown  
error: 5854512 in file runtime/orte_init.c at line 311
  
-- 

 It looks like orte_init failed for some reason; your parallel  
process is
 likely to abort.  There are many reasons that a parallel  
process can

 fail during orte_init; some of which are due to configuration or
 environment problems.  This failure appears to be an internal  
failure;
 here's some additional information (which may only be relevant  
to an

 Open MPI developer):

   orte_sds_base_set_name failed
   --> Returned value Unknown error: 5854512 (5854512) instead  
of ORTE_SUCCESS


  
-- 


Any help would be appreciated.  Thanks.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users