I am having problems running openmpi 1.3 on my claster and I was wondering if
anyone else is seeing this problem and/or can give hints on how to solve it

As far as I understand the error, mpiexec resolves host names on the master node
it is run on instead of an each host seperately. This works in an environment 
where
each hostname resolves to the same address on each host (cluster connected via a
switch) but fails where it resolves to different addresses (ring/star setups for
example where each computer is connected directly to all/some of the others)

I'm not 100% sure that this is the problem as I'm seeing success on a single
case where this should probably fail but it is my best bet from the error 
message.

version 1.2.8 worked fine for the same simple program (a simple hellow world 
that
just comunicated the computer name for each process)

An example output:

mpiexec is run on the master node hubert and is set to run the processes on two 
nodes
fry and leela. As is understood from the error messages leela tries to connect 
to
fry on address 192.168.1.2 which is it's address on hubert but not leela (where 
it
is 192.168.4.1)

This is a four node claster all interconnected

    192.168.1.1      192.168.1.2
hubert ------------------------ fry
  |    \                    /    | 192.168.4.1
  |       \              /       |
  |          \        /          |
  |             \  /             |
  |             /  \             |
  |          /        \          |
  |       /              \       |
  |    /                     \   | 192.168.4.2
hermes ----------------------- leelas

=================================================================
mpiexec -np 8 -H fry,leela test_mpi
Hello MPI from the server process of 8 on fry!
[[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
 from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
Network is unreachable

[[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
 from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
Network is unreachable

[[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
 from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
Network is unreachable

[leela:4436] *** An error occurred in MPI_Send
[leela:4436] *** on communicator MPI_COMM_WORLD
[leela:4436] *** MPI_ERR_INTERN: internal error
[leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
 from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
Network is unreachable

--------------------------------------------------------------------------
mpiexec has exited due to process rank 1 with PID 4433 on
node leela exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
=================================================================

This seems to be a directional issue as running the program -H fry,leela failes
where -H leela,fry works, same behaviour for all senarious except those that 
include
the master node (hubert) where it resolves the external ip (from an external 
dns) instead
of the internal ip (from the hosts file). thus one direction fails (no external 
connection
at the moment for all but the master) and the other causes a lockup

I hope that the explenation is not too convoluted

Reply via email to