Re: [OMPI users] How do I debug this?

2006-08-03 Thread Jeff Squyres
Greetings Robert.

Can you send all the information listed here:

http://www.open-mpi.org/community/help/

Of particular interest will be the version that you are using.  We had some
bugs with the TCP connection code that were recently fixed.  Can you try the
latest 1.1.1 beta tarball and see if it fixes your problem?

http://www.open-mpi.org/software/ompi/v1.1/


On 8/2/06 11:11 AM, "Robert Cummins"  wrote:

> I'm trying to run a 64 way mpi benchmark on my system.  I
> *always* get the following error and I'm wondering how do
> I debug the problem node?  I can not reproduce the problem
> with a smaller number of nodes.
> 
> snip...
> [p1d049:18547] [0,1,48]-[0,1,20] mca_oob_tcp_peer_complete_connect:
> connect() fa
> iled with errno=113
> [p1d049:18547] [0,1,48]-[0,1,21] mca_oob_tcp_peer_complete_connect:
> connect() fa
> iled with errno=113
> [p1d049:18547] [0,1,48]-[0,1,24] mca_oob_tcp_peer_complete_connect:
> connect() fa
> iled with errno=113
> [p1d049:18547] [0,1,48]-[0,1,25] mca_oob_tcp_peer_complete_connect:
> connect() fa
> iled with errno=113
> ...
> 
> It looks like I have well over 128 lines of similar output.  A quick
> eyeball of
> the output seems to indicate about 1/2 of all nodes are reporting this
> problem.
> 
> I have checked the error counters on my IB switch and I
> have 0 new errors during the run.
> 
> TIA.
> 
> 
> R.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] MPI application fails with errno 113

2006-08-03 Thread Jeff Squyres
Greetings Gunnar.

Can you provide all the information listed here:

http://www.open-mpi.org/community/help/

For example, what version are you using?  We had some bugs in the TCP
connection code dealing with multiple networks that were recently fixed.
Could you try the latest 1.1.1 beta and see if that fixes your problem?

http://www.open-mpi.org/software/ompi/v1.1/


On 7/25/06 11:13 AM, "Gunnar Johansson"  wrote:

> Hi all,
> 
> We have set up OpenMPI over a set of 5 machines (2 dual CPU) and get this
> error when trying to start an MPI application with more than 5 processes:
> 
> [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
> 
> Anything below 5 proc. works great. We have set the btl_tcp_if_include and
> and the oob_tcp_include to the correct interface to use on each machine.
> Anything else we can try / diagnostics to run to provide more info about the
> problem?
> 
> Thanks, Gunnar Johansson
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] error in running openmpi on remote node

2006-08-03 Thread Jeff Squyres
On 7/24/06 3:47 AM, "Chengwen Chen"  wrote:

> I have tried to use v1.1 openmpi. but the program (AMBER9) I am using can't
> be compiled correctly by v1.1. So I seems that I have to keep using
> openmpi-1.02.

Can you explain what you mean by that?  The user interface (i.e., the MPI
API) should not have changed between v1.0.2 and v1.1 -- can you show what
happened when you tried to compile your application with 1.1?

> I am new in linux, I really have no idea about debugger. Would you please
> give me some advice to try in a simple way?

I still have not filled up the OMPI FAQ with information on how to use
valgrind, but the information on the LAM/MPI FAQ is pretty relevant (note
that OMPI does not have a --with-purify option, so disregard that):

http://www.lam-mpi.org/faq/

See the "Debugging programs under LAM/MPI" section.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] MPI application fails with errno 113

2006-08-03 Thread Gunnar Johansson

Hi,

Thank you very much for your answer, but I'm very embarassed to say
that some of our machines in our setup still had a firewall enabled.
Which of course I should have verified in the first place, before
posting the question here.

Still, I'm impressed that OpenMPI still made it through the firewalls
to some extent!

Thanks,  Gunnar Johansson

On 8/3/06, Jeff Squyres  wrote:

Greetings Gunnar.

Can you provide all the information listed here:

http://www.open-mpi.org/community/help/

For example, what version are you using?  We had some bugs in the TCP
connection code dealing with multiple networks that were recently fixed.
Could you try the latest 1.1.1 beta and see if that fixes your problem?

http://www.open-mpi.org/software/ompi/v1.1/


On 7/25/06 11:13 AM, "Gunnar Johansson"  wrote:

> Hi all,
>
> We have set up OpenMPI over a set of 5 machines (2 dual CPU) and get this
> error when trying to start an MPI application with more than 5 processes:
>
> [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
>
> Anything below 5 proc. works great. We have set the btl_tcp_if_include and
> and the oob_tcp_include to the correct interface to use on each machine.
> Anything else we can try / diagnostics to run to provide more info about the
> problem?
>
> Thanks, Gunnar Johansson
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI application fails with errno 113

2006-08-03 Thread Jeff Squyres
So am I!  :-)

Glad you got it working.


On 8/3/06 8:04 AM, "Gunnar Johansson"  wrote:

> Hi,
> 
> Thank you very much for your answer, but I'm very embarassed to say
> that some of our machines in our setup still had a firewall enabled.
> Which of course I should have verified in the first place, before
> posting the question here.
> 
> Still, I'm impressed that OpenMPI still made it through the firewalls
> to some extent!
> 
> Thanks,  Gunnar Johansson
> 
> On 8/3/06, Jeff Squyres  wrote:
>> Greetings Gunnar.
>> 
>> Can you provide all the information listed here:
>> 
>> http://www.open-mpi.org/community/help/
>> 
>> For example, what version are you using?  We had some bugs in the TCP
>> connection code dealing with multiple networks that were recently fixed.
>> Could you try the latest 1.1.1 beta and see if that fixes your problem?
>> 
>> http://www.open-mpi.org/software/ompi/v1.1/
>> 
>> 
>> On 7/25/06 11:13 AM, "Gunnar Johansson"  wrote:
>> 
>>> Hi all,
>>> 
>>> We have set up OpenMPI over a set of 5 machines (2 dual CPU) and get this
>>> error when trying to start an MPI application with more than 5 processes:
>>> 
>>> [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>> connect() failed with errno=113
>>> 
>>> Anything below 5 proc. works great. We have set the btl_tcp_if_include and
>>> and the oob_tcp_include to the correct interface to use on each machine.
>>> Anything else we can try / diagnostics to run to provide more info about the
>>> problem?
>>> 
>>> Thanks, Gunnar Johansson
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


[OMPI users] OpenMPI / PBS / TM interaction

2006-08-03 Thread Martin Schafföner
Hi all,

I noticed by accident that if I allocate only 1 of 2 CPUs on, say, x nodes 
with PBS, it is possible to run mpiexec with -np = 2*x, i.e. I can use both 
CPUs on each SMP node. This is not would I would expect or want. I would 
guess that this is due to the fact that even if the _one_ orted launched 
through PBS' TM interface can launch as many processes as you like which is 
not the case if one would launch MPI processes directly through the TM 
interface. How can this behavior be fixed?

Regards,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063
PGP-Public-Key: http://ca.uni-magdeburg.de/certs/pgp0Cschaffoener64D8BEC0.asc



Re: [OMPI users] OpenMPI / PBS / TM interaction

2006-08-03 Thread Ralph H Castain

Depending upon what version you are using, this could be resolved fairly
simply. Check to see if your version supports the "nooversubscribe" command
line option. If it does, then setting that option may (I believe) resolve
the problem - at the least, it will only allow you to run one application
process on each node the allocation you described. If you were to give -np =
2*x, you would get an error and the job would not be run.

Ralph

On 8/3/06 7:47 AM, "Martin Schafföner"
 wrote:

> Hi all,

I noticed by accident that if I allocate only 1 of 2 CPUs on, say, x
> nodes 
with PBS, it is possible to run mpiexec with -np = 2*x, i.e. I can use
> both 
CPUs on each SMP node. This is not would I would expect or want. I would
> 
guess that this is due to the fact that even if the _one_ orted launched
> 
through PBS' TM interface can launch as many processes as you like which is
> 
not the case if one would launch MPI processes directly through the TM
> 
interface. How can this behavior be fixed?

Regards,
-- 
Martin
> Schafföner

Cognitive Systems Group, Institute of Electronics, Signal
> Processing and 
Communication Technologies, Department of Electrical
> Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391
> 6720063
PGP-Public-Key:
> http://ca.uni-magdeburg.de/certs/pgp0Cschaffoener64D8BEC0.asc

___
> 
users mailing
> list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] seg fault in MPI_Comm_size

2006-08-03 Thread Peng Wang

Hi, there:

How are you guys doing?

I got a seg fault at the beginning of the Aztec library
an application is using, please find the attached config.log.gz
and the following stack trace and ompi_info output:

The following stack trace repeats for each process:


Failing at addr:(nil)
Signal:11 info.si_errno:0(Success) si_code:128()
Failing at addr:(nil)
[0] func:pgeofe [0x606665]
[1] func:/lib64/tls/libpthread.so.0 [0x3f8460c420]
[2] func:pgeofe(MPI_Comm_size+0x4d) [0x549c5d]
[3] func:pgeofe(parallel_info+0x3e) [0x4fcffe]
[4] func:pgeofe(AZ_set_proc_config+0x34) [0x4e8594]
[5] func:pgeofe(MAIN__+0xb2) [0x44ba8a]
[6] func:pgeofe(main+0x32) [0x4420aa]
[7] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3f83b1c4bb]
[8] func:pgeofe [0x441fea]
*** End of error message ***



ompi_info output:


Open MPI: 1.1
   Open MPI SVN revision: r10477
Open RTE: 1.1
   Open RTE SVN revision: r10477
OPAL: 1.1
   OPAL SVN revision: r10477
  Prefix: /home/pewang/openmpi
 Configured architecture: x86_64-unknown-linux-gnu
   Configured by: pewang
   Configured on: Thu Aug  3 14:22:22 EDT 2006
  Configure host: bl-geol-karst.geology.indiana.edu
Built by: pewang
Built on: Thu Aug  3 14:34:17 EDT 2006
  Built host: bl-geol-karst.geology.indiana.edu
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: ifort
  Fortran77 compiler abs: /opt/intel/fce/9.0/bin/ifort
  Fortran90 compiler: ifort
  Fortran90 compiler abs: /opt/intel/fce/9.0/bin/ifort
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: yes, progress: no)
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1)
   MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.1)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.1)
MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1)
MCA coll: self (MCA v1.0, API v1.0, Component v1.1)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.1)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1)
  MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: self (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.1)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1)
  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1)
  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1)
 MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1)
 MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1)
 MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1)
 MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1)
 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1)
   MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1)
MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1)
MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.1)
 MCA pls: fork (MCA v1.0, API v1.0, Component v1.1)