On Jul 6, 2007, at 7:37 AM, SLIM H.A. wrote:

Dear Michael

I have now tried both

mpirun --mca btl mx,sm -np 4 ./cpi

which gives the same error message again, and,

mpirun --mca btl mx,sm,self -np 4 ./cpi_gcc_ompi_mx

actually locks some of the mx ports but not all 4, ie this is the output from endpoint info:

1 Myrinet board installed.
The MX driver is configured to support up to 4 endpoints on 4 boards.
===================================================================
Board #0:
Endpoint         PID             Command                 Info
<raw>           5061            mx_mapper
0               20315           cpi
There are currently 1 regular endpoint open

This is the output from the node:
>mpirun --mca btl mx,sm,self -np 4 ./cpi_gcc_ompi_mx
[node001:20312] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [node001:20314] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [node001:20313] mca_btl_mx_init: mx_open_endpoint() failed with status=20
Thanks

Henk



From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Michael Edwards
Sent: 05 July 2007 18:06
To: Open MPI Users
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy

If the machine is multi-processor you might want to add the sm btl. That cleared up some similar problems for me, though I don't use mx so your millage may vary.

On 7/5/07, SLIM H.A. <h.a.s...@durham.ac.uk> wrote:
Hello

I have compiled openmpi-1.2.3 with the --with-mx=<directory>
configuration and gcc compiler. On testing with 4-8 slots I get an error
message, the mx ports are busy:

>mpirun --mca btl mx,self -np 4 ./cpi
[node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:10074] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:10073] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
---------------------------------------------------------------------- --
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
... snipped
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
---------------------------------------------------------------------- --
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 10071 on node node001 exited on
signal 1 (Hangup).


I would not expect mx messages as communication should not go through
the mx card? (This is a twin dual core  shared memory node)
The same happens when testing on 2 nodes, using a hostfile.
I checked the state of the mx card with mx_endpoint_info and mx_info,
they are healthy and free.
What is missing here?

Thanks

Henk

Henk,

OMPI is successfully opening one endpoint and the other three fail with MX_BUSY (error 20). This might happen if they are all trying to open the same endpoint ID. OMPI normally does not do this. I do not see a hostfile or host parameters specified. What is OMPI using for a machinefile?

Also, could you try creating a host file named "hosts" with the names of your machines and then try:

$ mpirun -np 2 --hostfile hosts ./cpi

and then

$ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi

Scott


Reply via email to