Troy and I talked about this off-list and resolved that the issue was with the TCP setup on the nodes.
But it is worth noting that we had previously fixed a bug in the TCP setup in 1.0.2 with respect to the SEGVs that Troy was seeing -- hence, when he tested the 1.0.3 prerelease tarballs, there were no SEGVs. > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Troy Telford > Sent: Thursday, June 01, 2006 4:35 PM > To: Open MPI Users > Subject: Re: [OMPI users] Open MPI 1.0.2 and np >=64 > > On Wed, 31 May 2006 20:17:33 -0600, Brian Barrett > <brbar...@open-mpi.org> > wrote: > > > Did you happen to have a chance to try to run the 1.0.3 or 1.1 > > nightly tarballs? I'm 50/50 on whether we've fixed these issues > > already. > > For Ticket #41: > > Using Open MPI 1.0.3 and 1.1: > For some reason, I can't seem to get TCP to work with any > number of nodes > >1 (which is odd, because I've had it working on *this* > system before; > MPICH works fine, so there's at least *something* right about > the ethernet > config/hardware) > > But I do get a different error with the snapshots vs. 1.0.2: > > *****Open MPI 1.0.2***** > [root@zartan1 1.0.2]# mpirun -v -np 6 -prefix $MPIHOME -machinefile > machines -mca btl tcp,sm,self laten -o 10 > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) > Failing at addr:0x6 > [0] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libopal.so.0 > [0x2ab8333408ca] > [1] func:/lib64/libpthread.so.0 [0x2ab83394a380] > [2] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl _tcp.so(mca_btl_tcp_proc_remove+0xbb) > [0x2ab8364299ab] > [3] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl > _tcp.so > [0x2ab836427bec] > [4] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl _tcp.so(mca_btl_tcp_add_procs+0x155) > [0x2ab836425445] > *** End of error message *** > [5] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_bml _r2.so(mca_bml_r2_add_procs+0x26b) > [0x2ab835da72db] > [6] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_pml _ob1.so(mca_pml_ob1_add_procs+0xcc) > [0x2ab835b8bd5c] > [7] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libmpi.so.0(ompi_ > mpi_init+0x590) > [0x2ab8330b1c90] > [8] > func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libmpi.so.0(MPI_I > nit+0x83) > [0x2ab83309d2d3] > [9] func:laten(main+0x6a) [0x4015f2] > [10] func:/lib64/libc.so.6(__libc_start_main+0xdc) [0x2ab833a6f4cc] > [11] func:laten [0x4014f9] > > *****Open MPI 1.0.3***** > [root@zartan1 tmp]# mpirun -v -np 4 -prefix $MPIHOME -mca btl > tcp,sm,self > -machinefile machines laten -o 10 > MPI Bidirectional latency test (Send/Recv) > Processes Max Latency (us) > ------------------------------------------ > [0,1,3][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_c > onnect] > connect() failed with errno=113 > [0,1,2][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_c > onnect] > connect() failed with errno=113 > > *****Open MPI 1.1***** > [root@zartan1 1.1]# mpirun -v -np 4 -prefix $MPIHOME -mca btl tcp > -machinefile machines laten -o 10 > MPI Bidirectional latency test (Send/Recv) > Processes Max Latency (us) > ------------------------------------------ > [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c > onnect] > connect() failed with errno=113 > [0,1,3][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c > onnect] > connect() failed with errno=113 > > If I use -np 2 (ie. the job doesn't leave the node, it being > a dual-cpu > machine), it works fine. > -- > Troy Telford > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >