Re: [OMPI users] Topology functions from MPI 1.1
It depends on what you are trying to do. Is your network physically wired such that there is no direct link between nodes 1 and 2? (i.e., node 1 cannot directly send to node 2, such as via opening a socket from node 1 to node 2's IP address) MPI topology communicators do not prohibit on process from sending to any other process in the communicator. They might provide more optimal routing for nearest-neighbor communication if the underlying network topology supports it, for example. On Jan 24, 2008, at 6:42 PM, David Souther wrote: Hello, My name is David Souther, and I am a student working on a parallel processing research project at Rocky Mountain College. We need to attach topology information to our processes, but the assertions we have been making about the MPI Topology mechanism seem to be false. We would like to do something similar to the following: Node 0 <---> Node 1 | | V Node 2 That is, Node 0 can talk to Node 1 and 2, and Node 1 can talk to Node 0, but Node 2 can't talk to anyone. From what I understand, the code to do that would look like: ... MPI_Comm graph_comm; int nodes = 3; int indexes[] = {2, 3, 3}; int edges[] = {1, 2, 0}; MPI_Graph_create(MPI_COMM_WORLD, nodes, indexes, edges, 0, &graph_comm); ... That is how, with my understanding, I would build that graph. Then, the following pseudocode would work: if(rank == 0) MPI_SEND("Message", To Rank 1, graph_comm) if(rank == 1) MPI_RECV(buffer, From Rank 0, graph_comm) but this would not (It would throw an error, maybe?): if(rank == 2) MPI_SEND("Message", To Rank 0, graph_comm) if(rank == 0) MPI_RECV(buffer, From Rank 2, graph_comm) Anyway, the point is, that doesn't work/happen. The messages simply send and receive as if they were all in the global communicator (MPI_COMM_WORLD). So, I have two questions: could (and how do I make) this work? And, If I'm going at this entirely the wrong way, what is a good use for the topology mechanism? Thank you so much for your time! -- -DS - David Souther 1511 Poly Dr Billings, MT, 59102 (406) 545-9223 http://www.davidsouther.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] SCALAPACK: Segmentation Fault (11) and Signal code: Address not mapped (1)
Sorry for not replying earlier. I'm not a SCALAPACK expert, but a common mistake I've seen users make is to use the mpif.h from a different MPI implementation when compiling their fortran programs. Can you verify that you're getting the Open MPI mpif.h? Also, there is a known problem that with the Pathscale compiler that they have stubbornly refused to comment on for about a year now (meaning: a problem was identified many moons ago, and it has not been tracked down to be either a Pathscale compiler problem or an Open MPI problem -- we did as much as we could and handed off to Pathscale, but with no forward progress since then). So you *may* be running into that issue...? FWIW, we only saw the pathscale problem when running on InfiniBand hardware, so YMMV. Can you run any other MPI programs with Open MPI? On Jan 22, 2008, at 4:06 PM, Backlund, Daniel wrote: Hello all, I am using OMPI 1.2.4 on a Linux cluster (Rocks 4.2). OMPI was configured to use the Pathscale Compiler Suite installed in the (NFS mounted on nodes) / home/PROGRAMS/pathscale. I am trying to compile and run the example1.f that comes with the ACML package from AMD, and I am unable to get it to run. All nodes have the same Opteron processors and 2GB ram per core. OMPI was configured as below. export CC=pathcc export CXX=pathCC export FC=pathf90 export F77=pathf90 ./configure --prefix=/opt/openmpi/1.2.4 --enable-static --without- threads --without-memory-manager \ --without-libnuma --disable-mpi-threads The configuration was successful, the install was successful, I can even run a sample mpihello.f90 program. I would eventually like to link the ACML SCALAPACK and BLACS libraries to our code, but I need some help. The ACML version is 3.1.0 for pathscale64. I go into the scalapack_examples directory, modify GNUmakefile to the correct values, and compile successfully. I have made openmpi into an rpm and pushed it to the nodes, modified LD_LIBRARY_PATH and PATH, and made sure I can see it on all nodes. When I try to run the example1.exe which is generated, using /opt/ openmpi/1.2.4/bin/mpirun -np 6 example1.exe I get the following output: example1.res [XXX:31295] *** Process received signal *** [XXX:31295] Signal: Segmentation fault (11) [XXX:31295] Signal code: Address not mapped (1) [XXX:31295] Failing at address: 0x4470 [XXX:31295] *** End of error message *** [XXX:31298] *** Process received signal *** [XXX:31298] Signal: Segmentation fault (11) [XXX:31298] Signal code: Address not mapped (1) [XXX:31298] Failing at address: 0x4470 [XXX:31298] *** End of error message *** [XXX:31299] *** Process received signal *** [XXX:31299] Signal: Segmentation fault (11) [XXX:31299] Signal code: Address not mapped (1) [XXX:31299] Failing at address: 0x4470 [XXX:31299] *** End of error message *** [XXX:31300] *** Process received signal *** [XXX:31300] Signal: Segmentation fault (11) [XXX:31300] Signal code: Address not mapped (1) [XXX:31300] Failing at address: 0x4470 [XXX:31300] *** End of error message *** [XXX:31296] *** Process received signal *** [XXX:31296] Signal: Segmentation fault (11) [XXX:31296] Signal code: Address not mapped (1) [XXX:31296] Failing at address: 0x4470 [XXX:31296] *** End of error message *** [XXX:31297] *** Process received signal *** [XXX:31297] Signal: Segmentation fault (11) [XXX:31297] Signal code: Address not mapped (1) [XXX:31297] Failing at address: 0x4470 [XXX:31297] *** End of error message *** mpirun noticed that job rank 0 with PID 31295 on node XXX.ourdomain.com exited on signal 11 (Segmentation fault). 5 additional processes aborted (not shown) end example1.res Here is the result of ldd example1.exe ldd example1.exe libmpi_f90.so.0 => /opt/openmpi/1.2.4/lib/libmpi_f90.so.0 (0x002a9557d000) libmpi_f77.so.0 => /opt/openmpi/1.2.4/lib/libmpi_f77.so.0 (0x002a95681000) libmpi.so.0 => /opt/openmpi/1.2.4/lib/libmpi.so.0 (0x002a957b3000) libopen-rte.so.0 => /opt/openmpi/1.2.4/lib/libopen-rte.so.0 (0x002a959fb000) libopen-pal.so.0 => /opt/openmpi/1.2.4/lib/libopen-pal.so.0 (0x002a95be7000) librt.so.1 => /lib64/tls/librt.so.1 (0x003e7cd0) libnsl.so.1 => /lib64/libnsl.so.1 (0x003e7c20) libutil.so.1 => /lib64/libutil.so.1 (0x003e79e0) libmv.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libmv.so.1 (0x002a95d4d000) libmpath.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libmpath.so. 1 (0x002a95e76000) libm.so.6 => /lib64/tls/libm.so.6 (0x003e77a0) libdl.so.2 => /lib64/libdl.so.2 (0x003e77c0) libpathfortran.so.1 => /home/PROGRAMS/pathscale/lib/3.0/ libpathfortran.so.1 (0x002a95f97000) libc.so.6 => /lib64/tls/libc.so.6
Re: [OMPI users] Information about multi-path on IB-based systems
On Jan 23, 2008, at 3:12 PM, David Gunter wrote: Do I need to do anything special to enable multi-path routing on InfiniBand networks? For example, are there command-line arguments to mpiexec or the like? It depends on what you mean by multi-path routing. For each MPI peer pair (e.g., procs A and B), do you want Open MPI to use multiple routes for MPI traffic over the IB fabric? If you set the LMC in your fabric to be greater than 0, your SM should issue multiple LIDs for each active port, and create routes accordingly. I don't know what OpenSM does, but the Cisco SM will naturally try to make the routes be across different links. Open MPI (>=v1.2.x) should automatically detect that your LMC is >0 and setup as many connections between a pair of MPI peers as the LMC allows and then round robin packets across all of them. Did that make sense? -- Jeff Squyres Cisco Systems
[OMPI users] MPI-IO problems
Hi, I compiled a molecular dynamics program DLPOLY3.09 on an AMD64 cluster running openmpi 1.2.4 with Portland group compilers.The program seems to run alright, however, each processor outputs: ADIOI_GEN_DELETE (line 22): **io No such file or directory It was suggested that it was an issue with MPI-IO. I would appreciate any help in fixing the problem. recif
Re: [OMPI users] (no subject)
I'm unable to reproduce this problem. :( I tried both the svn head (r17288) and the tarball that you were using (openmpi-1.3a1r17175) on a similar system without problem. The error you are seeing may be caused by old connectivity information in the session directory. You may want to make sure that / tmp does not contain any "openmpi-session*" directories before starting mpirun. Other than that you may want to try a clean build of Open MPI just to make sure that you are not seeing anything odd resulting from old Open MPI install files. let me know if that helps. -- Josh On Jan 24, 2008, at 12:38 PM, Wong, Wayne wrote: I'm having some difficulty geting the Open MPI checkpoint/restart fault tolerance working. I have compiled Open MPI with the "--with- ft=cr" flag, but when I attempt to run my test program (ring), the ompi-checkpoint command fails. I have verified that the test program works fine without the fault tolerance enabled. Here are the details: [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring [me@dev1 ~]$ ps -efa | grep mpirun me 3052 2820 1 08:25 pts/200:00:00 mpirun -np 4 -am ft-enable-cr ring [me@dev1 ~]$ ompi-checkpoint 3052 [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error: 5854512 in file sds_singleton_module.c at line 50 [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error: 5854512 in file runtime/orte_init.c at line 311 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_sds_base_set_name failed --> Returned value Unknown error: 5854512 (5854512) instead of ORTE_SUCCESS -- Any help would be appreciated. Thanks. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users