Troy and I talked about this off-list and resolved that the issue was
with the TCP setup on the nodes.

But it is worth noting that we had previously fixed a bug in the TCP
setup in 1.0.2 with respect to the SEGVs that Troy was seeing -- hence,
when he tested the 1.0.3 prerelease tarballs, there were no SEGVs.


> -----Original Message-----
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Troy Telford
> Sent: Thursday, June 01, 2006 4:35 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Open MPI 1.0.2 and np >=64
> 
> On Wed, 31 May 2006 20:17:33 -0600, Brian Barrett 
> <brbar...@open-mpi.org>  
> wrote:
> 
> > Did you happen to have a chance to try to run the 1.0.3 or 1.1
> > nightly tarballs?  I'm 50/50 on whether we've fixed these issues
> > already.
> 
> For Ticket #41:
> 
> Using Open MPI 1.0.3 and 1.1:
> For some reason, I can't seem to get TCP to work with any 
> number of nodes  
> >1 (which is odd, because I've had it working on *this* 
> system before;  
> MPICH works fine, so there's at least *something* right about 
> the ethernet  
> config/hardware)
> 
> But I do get a different error with the snapshots vs. 1.0.2:
> 
> *****Open MPI 1.0.2*****
> [root@zartan1 1.0.2]# mpirun -v -np 6 -prefix $MPIHOME -machinefile  
> machines -mca btl tcp,sm,self laten -o 10
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x6
> [0] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libopal.so.0  
> [0x2ab8333408ca]
> [1] func:/lib64/libpthread.so.0 [0x2ab83394a380]
> [2]  
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl
_tcp.so(mca_btl_tcp_proc_remove+0xbb)  
> [0x2ab8364299ab]
> [3] 
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl
> _tcp.so  
> [0x2ab836427bec]
> [4]  
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl
_tcp.so(mca_btl_tcp_add_procs+0x155)  
> [0x2ab836425445]
> *** End of error message ***
> [5]  
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_bml
_r2.so(mca_bml_r2_add_procs+0x26b)  
> [0x2ab835da72db]
> [6]  
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_pml
_ob1.so(mca_pml_ob1_add_procs+0xcc)  
> [0x2ab835b8bd5c]
> [7]  
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libmpi.so.0(ompi_
> mpi_init+0x590)  
> [0x2ab8330b1c90]
> [8]  
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libmpi.so.0(MPI_I
> nit+0x83)  
> [0x2ab83309d2d3]
> [9] func:laten(main+0x6a) [0x4015f2]
> [10] func:/lib64/libc.so.6(__libc_start_main+0xdc) [0x2ab833a6f4cc]
> [11] func:laten [0x4014f9]
> 
> *****Open MPI 1.0.3*****
> [root@zartan1 tmp]# mpirun -v -np 4 -prefix $MPIHOME -mca btl 
> tcp,sm,self  
> -machinefile machines laten -o 10
> MPI Bidirectional latency test (Send/Recv)
>               Processes    Max Latency (us)
> ------------------------------------------
> [0,1,3][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_c
> onnect]  
> connect() failed with errno=113
> [0,1,2][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_c
> onnect]  
> connect() failed with errno=113
> 
> *****Open MPI 1.1*****
> [root@zartan1 1.1]# mpirun -v -np 4 -prefix $MPIHOME -mca btl tcp  
> -machinefile machines laten -o 10
> MPI Bidirectional latency test (Send/Recv)
>               Processes    Max Latency (us)
> ------------------------------------------
> [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect]  
> connect() failed with errno=113
> [0,1,3][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect]  
> connect() failed with errno=113
> 
> If I use -np 2 (ie. the job doesn't leave the node, it being 
> a dual-cpu  
> machine), it works fine.
> -- 
> Troy Telford
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Reply via email to