On a CentOS Linux box, I see the following:
> grep 113 /usr/include/asm-i386/errno.h
#define EHOSTUNREACH 113 /* No route to host */
I have also seen folks do this to figure out the errno.
> perl -e 'die$!=113'
No route to host at -e line 1.
I am not sure why this is happening, but you could also check the Open
MPI User's Mailing List Archives where there are other examples of
people running into this error. A search of "113" had a few hits.
http://www.open-mpi.org/community/lists/users
Also, I assume you would see this problem with or without the
MPI_Barrier if you add this parameter to your mpirun line:
--mca mpi_preconnect_all 1
The MPI_Barrier is causing the bad behavior because by default
connections are setup up lazily. Therefore only when the MPI_Barrier
call is made and we start communicating and establishing connections do
we start seeing the communication problems.
Rolf
jody wrote:
Rolf,
I was able to run hostname on the two noes that way,
and also a simplified version of my testprogram (without a barrier)
works. Only MPI_Barrier shows bad behaviour.
Do you know what this message means?
[aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Does it give an idea what could be the problem?
Jody
On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
<rolf.vandeva...@sun.com> wrote:
This worked for me although I am not sure how extensive our 32/64
interoperability support is. I tested on Solaris using the TCP
interconnect and a 1.2.5 version of Open MPI. Also, we configure with
the --enable-heterogeneous flag which may make a difference here. Also
this did not work for me over the sm btl.
By the way, can you run a simple /bin/hostname across the two nodes?
burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
simple.32
burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
simple.64
burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3
simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
[burl-ct-v20z-4]I am #0/6 before the barrier
[burl-ct-v20z-5]I am #3/6 before the barrier
[burl-ct-v20z-5]I am #4/6 before the barrier
[burl-ct-v20z-4]I am #1/6 before the barrier
[burl-ct-v20z-4]I am #2/6 before the barrier
[burl-ct-v20z-5]I am #5/6 before the barrier
[burl-ct-v20z-5]I am #3/6 after the barrier
[burl-ct-v20z-4]I am #1/6 after the barrier
[burl-ct-v20z-5]I am #5/6 after the barrier
[burl-ct-v20z-5]I am #4/6 after the barrier
[burl-ct-v20z-4]I am #2/6 after the barrier
[burl-ct-v20z-4]I am #0/6 after the barrier
burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
MPI) 1.2.5r16572
Report bugs to http://www.open-mpi.org/community/help/
burl-ct-v20z-4 65 =>
jody wrote:
> i narrowed it down:
> The majority of processes get stuck in MPI_Barrier.
> My Test application looks like this:
>
> #include <stdio.h>
> #include <unistd.h>
> #include "mpi.h"
>
> int main(int iArgC, char *apArgV[]) {
> int iResult = 0;
> int iRank1;
> int iNum1;
>
> char sName[256];
> gethostname(sName, 255);
>
> MPI_Init(&iArgC, &apArgV);
>
> MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
> MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
>
> printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
> MPI_Barrier(MPI_COMM_WORLD);
> printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);
>
> MPI_Finalize();
>
> return iResult;
> }
>
>
> If i make this call:
> mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
> ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
> ./run_gdb.sh ./MPITest64
>
> (run_gdb.sh is a script which starts gdb in a xterm for each process)
> Process 0 (on aim-plankton) passes the barrier and gets stuck in
PMPI_Finalize,
> all other processes get stuck in PMPI_Barrier,
> Process 1 (on aim-plankton) displays the message
>
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
> Process 2 on (aim-plankton) displays the same message twice.
>
> Any ideas?
>
> Thanks Jody
>
> On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote:
>> Hi
>> Using a more realistic application than a simple "Hello, world"
>> even the --host version doesn't work correctly
>> Called this way
>>
>> mpirun -np 3 --host aim-plankton ./QHGLauncher
>> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4
>> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>>
>> the application starts but seems to hang after a while.
>>
>> Running the application in gdb:
>>
>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
>> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4
>> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
>> -o bruzlopf -n 12
>> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>>
>> i can see that the processes on aim-fanta4 have indeed gotten stuck
>> after a few initial outputs,
>> and the processes on aim-plankton all have a messsage:
>>
>>
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>>
>> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
>> as expected.
>>
>> BTW: i'm, using open MPI 1.2.2
>>
>> Thanks
>> Jody
>>
>>
>> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote:
>> > HI
>> > In my network i have some 32 bit machines and some 64 bit machines.
>> > With --host i successfully call my application:
>> > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
>> > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>> > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit
machine)
>> >
>> > But when i use hostfiles:
>> > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
>> > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>> > all 6 processes are started on the 64 bit machine aim-fanta4.
>> >
>> > hosts32:
>> > aim-plankton slots=3
>> > hosts64
>> > aim-fanta4 slots
>> >
>> > Is this a bug or a feature? ;)
>> >
>> > Jody
>> >
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================