Hi, I have kept doing my own investigation and recompiled OpenMPI to have only the barebones functionality with no support for any interconnects other than ethernet: # rpmbuild --rebuild --define="configure_options --prefix=/opt/openmpi/1.1.4" --define="install_in_opt 1" --define="mflags all" openmpi-1.1.4-1.src.rpm
The error detailed in my previous message persisted, which eliminates the possibility of interconnect support interfering with ethernet support. Here's an excerpt from ompi_info: # ompi_info Open MPI: 1.1.4 Open MPI SVN revision: r13362 Open RTE: 1.1.4 Open RTE SVN revision: r13362 OPAL: 1.1.4 OPAL SVN revision: r13362 Prefix: /opt/openmpi/1.1.4 Configured architecture: x86_64-redhat-linux-gnu . . . Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime . . . MCA btl: self (MCA v1.0, API v1.0, Component v1.1.4) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.4) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) Again, to replicate the error, I ran # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello In this case, you can even omit the runtime mca param specifications: # mpirun -hostfile ~/testdir/hosts ~/testdir/hello Thanks for reading this. I hope I've provided enough information. Sincerely, Alex. On 2/1/07, Alex Tumanov <atuma...@gmail.com> wrote:
Hello, I have tried a very basic test on a 2 node "cluster" consisting of 2 dell boxes. One of them is dual CPU Intel(R) Xeon(TM) CPU 2.80GHz with 1GB of RAM and the slave node is quad-CPU Intel(R) Xeon(TM) CPU 3.40GHz with 2GB of RAM. Both have Infiniband cards and Gig-E. The slave node is connected directly to the headnode. OpenMPI version 1.1.4 was compiled with support for the following btl's: openib,mx,gm, and mvapi. I got it to work over openib, but, ironically, the same trivial hello world job fails over tcp (please see the log below). I found that the same problem was already discussed on this list here: http://www.open-mpi.org/community/lists/users/2006/06/1347.php The discussion mentioned that there could be something wrong with the TCP setup of the nodes. Unfortunately it was taken offline. Could someone help me with this? Thanks, Alex. # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello Hello from Alex' MPI test program Process 0 on headnode out of 2 Hello from Alex' MPI test program Process 1 on compute-0-0.local out of 2 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0xdebdf8 [0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5] [1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430] [2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729] [3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a) [0x2a95880d7a] [4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf) [0x2a9588303f] [5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca] [6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_component_close+0x34f) [0x2a988ee8ef] [7] func:/opt/openmpi/1.1.4/lib/libopal.so.0(mca_base_components_close+0xde) [0x2a95872e1e] [8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close+0xe9) [0x2a955e5159] [9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9) [0x2a955e5029] [10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_component_close+0x25) [0x2a97f4dc55] [11] func:/opt/openmpi/1.1.4/lib/libopal.so.0(mca_base_components_close+0xde) [0x2a95872e1e] [12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close+0x69) [0x2a955ea3e9] [13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize+0xfe) [0x2a955ab57e] [14] func:/root/testdir/hello(main+0x7b) [0x4009d3] [15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d1951c3fb] [16] func:/root/testdir/hello [0x4008ca] *** End of error message *** mpirun noticed that job rank 0 with PID 15573 on node "dr11.local" exited on signal 11. 2 additional processes aborted (not shown)