Micha -- (re-digging up this really, really old issue because Manuel just pointed me at the Debian bug for the same issue: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=524553)
Can you confirm that this is still an issue on the latest Open MPI? If so, it should probably piggyback onto this Open MPI tickets: https://svn.open-mpi.org/trac/ompi/ticket/2045 https://svn.open-mpi.org/trac/ompi/ticket/2383 https://svn.open-mpi.org/trac/ompi/ticket/1983 On Apr 17, 2009, at 8:45 PM, Micha Feigin wrote: > I am having problems running openmpi 1.3 on my claster and I was wondering if > anyone else is seeing this problem and/or can give hints on how to solve it > > As far as I understand the error, mpiexec resolves host names on the master > node > it is run on instead of an each host seperately. This works in an environment > where > each hostname resolves to the same address on each host (cluster connected > via a > switch) but fails where it resolves to different addresses (ring/star setups > for > example where each computer is connected directly to all/some of the others) > > I'm not 100% sure that this is the problem as I'm seeing success on a single > case where this should probably fail but it is my best bet from the error > message. > > version 1.2.8 worked fine for the same simple program (a simple hellow world > that > just comunicated the computer name for each process) > > An example output: > > mpiexec is run on the master node hubert and is set to run the processes on > two nodes > fry and leela. As is understood from the error messages leela tries to > connect to > fry on address 192.168.1.2 which is it's address on hubert but not leela > (where it > is 192.168.4.1) > > This is a four node claster all interconnected > > 192.168.1.1 192.168.1.2 > hubert ------------------------ fry > | \ / | 192.168.4.1 > | \ / | > | \ / | > | \ / | > | / \ | > | / \ | > | / \ | > | / \ | 192.168.4.2 > hermes ----------------------- leelas > > ================================================================= > mpiexec -np 8 -H fry,leela test_mpi > Hello MPI from the server process of 8 on fry! > [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > [leela:4436] *** An error occurred in MPI_Send > [leela:4436] *** on communicator MPI_COMM_WORLD > [leela:4436] *** MPI_ERR_INTERN: internal error > [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > -------------------------------------------------------------------------- > mpiexec has exited due to process rank 1 with PID 4433 on > node leela exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpiexec (as reported here). > -------------------------------------------------------------------------- > [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / > mpi_errors_are_fatal > [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > ================================================================= > > This seems to be a directional issue as running the program -H fry,leela > failes > where -H leela,fry works, same behaviour for all senarious except those that > include > the master node (hubert) where it resolves the external ip (from an external > dns) instead > of the internal ip (from the hosts file). thus one direction fails (no > external connection > at the moment for all but the master) and the other causes a lockup > > I hope that the explenation is not too convoluted > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/