Micha --

(re-digging up this really, really old issue because Manuel just pointed me at 
the Debian bug for the same issue: 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=524553)

Can you confirm that this is still an issue on the latest Open MPI?

If so, it should probably piggyback onto this Open MPI tickets:

    https://svn.open-mpi.org/trac/ompi/ticket/2045
    https://svn.open-mpi.org/trac/ompi/ticket/2383
    https://svn.open-mpi.org/trac/ompi/ticket/1983



On Apr 17, 2009, at 8:45 PM, Micha Feigin wrote:

> I am having problems running openmpi 1.3 on my claster and I was wondering if
> anyone else is seeing this problem and/or can give hints on how to solve it
> 
> As far as I understand the error, mpiexec resolves host names on the master 
> node
> it is run on instead of an each host seperately. This works in an environment 
> where
> each hostname resolves to the same address on each host (cluster connected 
> via a
> switch) but fails where it resolves to different addresses (ring/star setups 
> for
> example where each computer is connected directly to all/some of the others)
> 
> I'm not 100% sure that this is the problem as I'm seeing success on a single
> case where this should probably fail but it is my best bet from the error 
> message.
> 
> version 1.2.8 worked fine for the same simple program (a simple hellow world 
> that
> just comunicated the computer name for each process)
> 
> An example output:
> 
> mpiexec is run on the master node hubert and is set to run the processes on 
> two nodes
> fry and leela. As is understood from the error messages leela tries to 
> connect to
> fry on address 192.168.1.2 which is it's address on hubert but not leela 
> (where it
> is 192.168.4.1)
> 
> This is a four node claster all interconnected
> 
>     192.168.1.1      192.168.1.2
> hubert ------------------------ fry
>   |    \                    /    | 192.168.4.1
>   |       \              /       |
>   |          \        /          |
>   |             \  /             |
>   |             /  \             |
>   |          /        \          |
>   |       /              \       |
>   |    /                     \   | 192.168.4.2
> hermes ----------------------- leelas
> 
> =================================================================
> mpiexec -np 8 -H fry,leela test_mpi
> Hello MPI from the server process of 8 on fry!
> [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> [leela:4436] *** An error occurred in MPI_Send
> [leela:4436] *** on communicator MPI_COMM_WORLD
> [leela:4436] *** MPI_ERR_INTERN: internal error
> [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1 with PID 4433 on
> node leela exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> =================================================================
> 
> This seems to be a directional issue as running the program -H fry,leela 
> failes
> where -H leela,fry works, same behaviour for all senarious except those that 
> include
> the master node (hubert) where it resolves the external ip (from an external 
> dns) instead
> of the internal ip (from the hosts file). thus one direction fails (no 
> external connection
> at the moment for all but the master) and the other causes a lockup
> 
> I hope that the explenation is not too convoluted
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to