Re: [OMPI users] How do I debug this?
Greetings Robert. Can you send all the information listed here: http://www.open-mpi.org/community/help/ Of particular interest will be the version that you are using. We had some bugs with the TCP connection code that were recently fixed. Can you try the latest 1.1.1 beta tarball and see if it fixes your problem? http://www.open-mpi.org/software/ompi/v1.1/ On 8/2/06 11:11 AM, "Robert Cummins" wrote: > I'm trying to run a 64 way mpi benchmark on my system. I > *always* get the following error and I'm wondering how do > I debug the problem node? I can not reproduce the problem > with a smaller number of nodes. > > snip... > [p1d049:18547] [0,1,48]-[0,1,20] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > [p1d049:18547] [0,1,48]-[0,1,21] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > [p1d049:18547] [0,1,48]-[0,1,24] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > [p1d049:18547] [0,1,48]-[0,1,25] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > ... > > It looks like I have well over 128 lines of similar output. A quick > eyeball of > the output seems to indicate about 1/2 of all nodes are reporting this > problem. > > I have checked the error counters on my IB switch and I > have 0 new errors during the run. > > TIA. > > > R. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] MPI application fails with errno 113
Greetings Gunnar. Can you provide all the information listed here: http://www.open-mpi.org/community/help/ For example, what version are you using? We had some bugs in the TCP connection code dealing with multiple networks that were recently fixed. Could you try the latest 1.1.1 beta and see if that fixes your problem? http://www.open-mpi.org/software/ompi/v1.1/ On 7/25/06 11:13 AM, "Gunnar Johansson" wrote: > Hi all, > > We have set up OpenMPI over a set of 5 machines (2 dual CPU) and get this > error when trying to start an MPI application with more than 5 processes: > > [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > connect() failed with errno=113 > > Anything below 5 proc. works great. We have set the btl_tcp_if_include and > and the oob_tcp_include to the correct interface to use on each machine. > Anything else we can try / diagnostics to run to provide more info about the > problem? > > Thanks, Gunnar Johansson > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] error in running openmpi on remote node
On 7/24/06 3:47 AM, "Chengwen Chen" wrote: > I have tried to use v1.1 openmpi. but the program (AMBER9) I am using can't > be compiled correctly by v1.1. So I seems that I have to keep using > openmpi-1.02. Can you explain what you mean by that? The user interface (i.e., the MPI API) should not have changed between v1.0.2 and v1.1 -- can you show what happened when you tried to compile your application with 1.1? > I am new in linux, I really have no idea about debugger. Would you please > give me some advice to try in a simple way? I still have not filled up the OMPI FAQ with information on how to use valgrind, but the information on the LAM/MPI FAQ is pretty relevant (note that OMPI does not have a --with-purify option, so disregard that): http://www.lam-mpi.org/faq/ See the "Debugging programs under LAM/MPI" section. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] MPI application fails with errno 113
Hi, Thank you very much for your answer, but I'm very embarassed to say that some of our machines in our setup still had a firewall enabled. Which of course I should have verified in the first place, before posting the question here. Still, I'm impressed that OpenMPI still made it through the firewalls to some extent! Thanks, Gunnar Johansson On 8/3/06, Jeff Squyres wrote: Greetings Gunnar. Can you provide all the information listed here: http://www.open-mpi.org/community/help/ For example, what version are you using? We had some bugs in the TCP connection code dealing with multiple networks that were recently fixed. Could you try the latest 1.1.1 beta and see if that fixes your problem? http://www.open-mpi.org/software/ompi/v1.1/ On 7/25/06 11:13 AM, "Gunnar Johansson" wrote: > Hi all, > > We have set up OpenMPI over a set of 5 machines (2 dual CPU) and get this > error when trying to start an MPI application with more than 5 processes: > > [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > connect() failed with errno=113 > > Anything below 5 proc. works great. We have set the btl_tcp_if_include and > and the oob_tcp_include to the correct interface to use on each machine. > Anything else we can try / diagnostics to run to provide more info about the > problem? > > Thanks, Gunnar Johansson > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI application fails with errno 113
So am I! :-) Glad you got it working. On 8/3/06 8:04 AM, "Gunnar Johansson" wrote: > Hi, > > Thank you very much for your answer, but I'm very embarassed to say > that some of our machines in our setup still had a firewall enabled. > Which of course I should have verified in the first place, before > posting the question here. > > Still, I'm impressed that OpenMPI still made it through the firewalls > to some extent! > > Thanks, Gunnar Johansson > > On 8/3/06, Jeff Squyres wrote: >> Greetings Gunnar. >> >> Can you provide all the information listed here: >> >> http://www.open-mpi.org/community/help/ >> >> For example, what version are you using? We had some bugs in the TCP >> connection code dealing with multiple networks that were recently fixed. >> Could you try the latest 1.1.1 beta and see if that fixes your problem? >> >> http://www.open-mpi.org/software/ompi/v1.1/ >> >> >> On 7/25/06 11:13 AM, "Gunnar Johansson" wrote: >> >>> Hi all, >>> >>> We have set up OpenMPI over a set of 5 machines (2 dual CPU) and get this >>> error when trying to start an MPI application with more than 5 processes: >>> >>> [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] >>> connect() failed with errno=113 >>> >>> Anything below 5 proc. works great. We have set the btl_tcp_if_include and >>> and the oob_tcp_include to the correct interface to use on each machine. >>> Anything else we can try / diagnostics to run to provide more info about the >>> problem? >>> >>> Thanks, Gunnar Johansson >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
[OMPI users] OpenMPI / PBS / TM interaction
Hi all, I noticed by accident that if I allocate only 1 of 2 CPUs on, say, x nodes with PBS, it is possible to run mpiexec with -np = 2*x, i.e. I can use both CPUs on each SMP node. This is not would I would expect or want. I would guess that this is due to the fact that even if the _one_ orted launched through PBS' TM interface can launch as many processes as you like which is not the case if one would launch MPI processes directly through the TM interface. How can this behavior be fixed? Regards, -- Martin Schafföner Cognitive Systems Group, Institute of Electronics, Signal Processing and Communication Technologies, Department of Electrical Engineering, Otto-von-Guericke University Magdeburg Phone: +49 391 6720063 PGP-Public-Key: http://ca.uni-magdeburg.de/certs/pgp0Cschaffoener64D8BEC0.asc
Re: [OMPI users] OpenMPI / PBS / TM interaction
Depending upon what version you are using, this could be resolved fairly simply. Check to see if your version supports the "nooversubscribe" command line option. If it does, then setting that option may (I believe) resolve the problem - at the least, it will only allow you to run one application process on each node the allocation you described. If you were to give -np = 2*x, you would get an error and the job would not be run. Ralph On 8/3/06 7:47 AM, "Martin Schafföner" wrote: > Hi all, I noticed by accident that if I allocate only 1 of 2 CPUs on, say, x > nodes with PBS, it is possible to run mpiexec with -np = 2*x, i.e. I can use > both CPUs on each SMP node. This is not would I would expect or want. I would > guess that this is due to the fact that even if the _one_ orted launched > through PBS' TM interface can launch as many processes as you like which is > not the case if one would launch MPI processes directly through the TM > interface. How can this behavior be fixed? Regards, -- Martin > Schafföner Cognitive Systems Group, Institute of Electronics, Signal > Processing and Communication Technologies, Department of Electrical > Engineering, Otto-von-Guericke University Magdeburg Phone: +49 391 > 6720063 PGP-Public-Key: > http://ca.uni-magdeburg.de/certs/pgp0Cschaffoener64D8BEC0.asc ___ > users mailing > list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] seg fault in MPI_Comm_size
Hi, there: How are you guys doing? I got a seg fault at the beginning of the Aztec library an application is using, please find the attached config.log.gz and the following stack trace and ompi_info output: The following stack trace repeats for each process: Failing at addr:(nil) Signal:11 info.si_errno:0(Success) si_code:128() Failing at addr:(nil) [0] func:pgeofe [0x606665] [1] func:/lib64/tls/libpthread.so.0 [0x3f8460c420] [2] func:pgeofe(MPI_Comm_size+0x4d) [0x549c5d] [3] func:pgeofe(parallel_info+0x3e) [0x4fcffe] [4] func:pgeofe(AZ_set_proc_config+0x34) [0x4e8594] [5] func:pgeofe(MAIN__+0xb2) [0x44ba8a] [6] func:pgeofe(main+0x32) [0x4420aa] [7] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3f83b1c4bb] [8] func:pgeofe [0x441fea] *** End of error message *** ompi_info output: Open MPI: 1.1 Open MPI SVN revision: r10477 Open RTE: 1.1 Open RTE SVN revision: r10477 OPAL: 1.1 OPAL SVN revision: r10477 Prefix: /home/pewang/openmpi Configured architecture: x86_64-unknown-linux-gnu Configured by: pewang Configured on: Thu Aug 3 14:22:22 EDT 2006 Configure host: bl-geol-karst.geology.indiana.edu Built by: pewang Built on: Thu Aug 3 14:34:17 EDT 2006 Built host: bl-geol-karst.geology.indiana.edu C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: ifort Fortran77 compiler abs: /opt/intel/fce/9.0/bin/ifort Fortran90 compiler: ifort Fortran90 compiler abs: /opt/intel/fce/9.0/bin/ifort C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: yes, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1) MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1) MCA timer: linux (MCA v1.0, API v1.0, Component v1.1) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1) MCA coll: self (MCA v1.0, API v1.0, Component v1.1) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1) MCA btl: self (MCA v1.0, API v1.0, Component v1.1) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1) MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1) MCA rml: oob (MCA v1.0, API v1.0, Component v1.1) MCA pls: fork (MCA v1.0, API v1.0, Component v1.1)