-----Original Message-----
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins
Sent: 09 July 2007 16:34
To: Open MPI Users
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
SLIM H.A. wrote:
Dear Tim and Scott
I followed the suggestions made:
So you should either pass '-mca btl mx,sm,self', or just
pass nothing
at all.
Open MPI is fairly smart at figuring out what components
to use, so
you really should not need to specify anything.
Using
node001>mpirun --mca btl mx,sm,self -np 4 -hostfile
ompi_machinefile
./cpi
conects to some of the mx ports, not all 4, but the program runs:
[node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with
status=20 [node001:01564] mca_btl_mx_init:
mx_open_endpoint() failed
with status=20
I finally figured out the problem here. What is happening is
that Open MPI now has 2 different network stacks, only one of
which can be used at a time: the mtl and the btl. What is
happening is that both the mx btl and the mx mtl is being
opened, each of which open an endpoint. Then the mtl is
closed because it will not be used, which releases the endpoint.
Meanwhile, while the number of endpoints are exhausted while
others are trying to open them.
There are two solutions:
1. Increase the number of available endpoints. According to
the Myrinet documentation, upping the limit to 16 or so
should have no performance impact.
2. Alternatively, you can tell the mx mtl not to run with -mca mtl ^mx
So, you should just be able to run:
mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
ompi_machinefile ./cpi
And it should work.
It spawned 4 processes on node001. Passing nothing at all gave the
same problem.
Also, could you try creating a host file named "hosts"
with the names
of your machines and then try:
$ mpirun -np 2 --hostfile hosts ./cpi
and then
$ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi
node001>mpirun -np 2 -hostfile ompi_machinefile ./cpi_gcc_ompi_mx
works but increasing to 4 cores again uses less than 4 ports.
Finally
node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx
is successful even for -np 4. From here I tried 2 nodes:
node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx
This gave:
orted: Command not found.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275 [node001:04585] [0,0,0]
ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c
at line 90 [node001:04585] ERROR: A daemon on node node002
failed to
start as expected.
[node001:04585] ERROR: There may be more information available from
[node001:04585] ERROR: the remote shell (see above).
[node001:04585] ERROR: The daemon exited unexpectedly with status 1.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188 [node001:04585] [0,0,0]
ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
----------------------------------------------------------------------
--
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
----------------------------------------------------------------------
--
--
The problem is that on the remote ompi cannot find the 'orted'
executable. Is the Open MPI install available on the remote node?
Try:
ssh remote_node which orted
This should locate the 'orted' program. If it does not, you
may need to modify your PATH, as described here:
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
Hope this helps,
Tim
Apparently orted is not started up properly. Something
missing in the
installation?
Thanks
Henk
-----Original Message-----
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins
Sent: 06 July 2007 15:59
To: Open MPI Users
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
Henk,
On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:
Dear Tim
I followed the use of "--mca btl mx,self" as suggested in the FAQ
http://www.open-mpi.org/faq/?category=myrinet#myri-btl
Yeah, that FAQ is wrong. I am working right now to fix it up.
It should be updated this afternoon.
When I use your extra mca value I get:
mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
---------------------------------------------------------------------
-
--
--
WARNING: A user-supplied value attempted to override the
read-only
MCA parameter named "btl_mx_shared_mem".
The user-supplied value was ignored.
Opps, on the 1.2 branch this is a read-only parameter. On
the current
trunk the user can change it. Sorry for the confusion. Oh
well, you
should probably use Open MPI's shared memory support
instead anyways.
So you should either pass '-mca btl mx,sm,self', or just
pass nothing
at all.
Open MPI is fairly smart at figuring out what components
to use, so
you really should not need to specify anything.
followed by the same error messages as before.
Note that although I add "self" the error messages
complain about it
missing:
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a
component (such as "self") in the list of usable components.
I checked the output from mx_info for both the current node and
another, there seems not to be a problem.
I attch the output from ompi_info --all Also
ompi_info | grep mx
Prefix:
/usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
MCA btl: mx (MCA v1.0, API v1.0.1,
Component v1.2.3)
MCA mtl: mx (MCA v1.0, API v1.0,
Component v1.2.3)
As a further check, I rebuild the exe with mpich and that
works fine
on the same node over myrinet. I wonder whether mx is
properly include
in my openmpi build.
Use of ldd -v on the mpich exe gives references to
libmyriexpress.so,
which is not the case for the ompi built exe, suggesting
something is
missing?
No, this is expected behavior. The Open MPI executeables are not
linked to libmyriexpress.so, only the mx components. So if
you do a
ldd on
/usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mc
a_btl_mx.so,
this should show the Myrinet library.
I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1
and a listing of that directory is
ls /usr/local/Cluster-Apps/mx/mx-1.1.1
bin etc include lib lib32 lib64 sbin
This should be sufficient, I don't need --with-mx-libdir?
Correct.
Hope this helps,
Tim
Thanks
Henk
-----Original Message-----
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins
Sent: 05 July 2007 18:16
To: Open MPI Users
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
Hi Henk,
By specifying '--mca btl mx,self' you are telling Open
MPI not to
use its shared memory support. If you want to use Open
MPI's shared
memory support, you must add 'sm' to the list.
I.e. '--mca btl mx,self'. If you would rather use MX's
shared memory
support, instead use '--mca btl mx,self --mca
btl_mx_shared_mem 1'.
However, in most cases I believe Open MPI's shared memory
support is
a bit better.
Alternatively, if you don't specify any btls, Open MPI
should figure
out what to use automatically.
Hope this helps,
Tim
SLIM H.A. wrote:
Hello
I have compiled openmpi-1.2.3 with the --with-mx=<directory>
configuration and gcc compiler. On testing with 4-8
slots I get an
error message, the mx ports are busy:
mpirun --mca btl mx,self -np 4 ./cpi
[node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with
status=20 [node001:10074] mca_btl_mx_init:
mx_open_endpoint() failed
with status=20 [node001:10073] mca_btl_mx_init:
mx_open_endpoint()
failed with status=20
--------------------------------------------------------------------
--
--
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a
component (such as "self") in the list of usable components.
... snipped
It looks like MPI_INIT failed for some reason; your
parallel process
is likely to abort. There are many reasons that a parallel
process can fail during MPI_INIT; some of which are due to
configuration or environment problems. This failure
appears to be
an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------
--
--
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that
job rank 0
with PID 10071 on node
node001 exited
on signal 1 (Hangup).
I would not expect mx messages as communication should not
go through
the mx card? (This is a twin dual core shared memory node)
The same
happens when testing on 2 nodes, using a hostfile.
I checked the state of the mx card with mx_endpoint_info
and mx_info,
they are healthy and free.
What is missing here?
Thanks
Henk
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users