Rolf,
I was able to run hostname on the two noes that way,
and also a simplified version of my testprogram (without a barrier)
works. Only MPI_Barrier shows bad behaviour.

Do you know what this message means?
[aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Does it give an idea what could be the problem?

Jody

On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
<rolf.vandeva...@sun.com> wrote:
>
>  This worked for me although I am not sure how extensive our 32/64
>  interoperability support is.  I tested on Solaris using the TCP
>  interconnect and a 1.2.5 version of Open MPI.  Also, we configure with
>  the --enable-heterogeneous flag which may make a difference here.  Also
>  this did not work for me over the sm btl.
>
>  By the way, can you run a simple /bin/hostname across the two nodes?
>
>
>   burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
>  simple.32
>   burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
>  simple.64
>   burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
>  btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3
>  simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
>  [burl-ct-v20z-4]I am #0/6 before the barrier
>  [burl-ct-v20z-5]I am #3/6 before the barrier
>  [burl-ct-v20z-5]I am #4/6 before the barrier
>  [burl-ct-v20z-4]I am #1/6 before the barrier
>  [burl-ct-v20z-4]I am #2/6 before the barrier
>  [burl-ct-v20z-5]I am #5/6 before the barrier
>  [burl-ct-v20z-5]I am #3/6 after the barrier
>  [burl-ct-v20z-4]I am #1/6 after the barrier
>  [burl-ct-v20z-5]I am #5/6 after the barrier
>  [burl-ct-v20z-5]I am #4/6 after the barrier
>  [burl-ct-v20z-4]I am #2/6 after the barrier
>  [burl-ct-v20z-4]I am #0/6 after the barrier
>   burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
>  MPI) 1.2.5r16572
>
>  Report bugs to http://www.open-mpi.org/community/help/
>   burl-ct-v20z-4 65 =>
>
>
>
>
>  jody wrote:
>  > i narrowed it down:
>  > The majority of processes get stuck in MPI_Barrier.
>  > My Test application looks like this:
>  >
>  > #include <stdio.h>
>  > #include <unistd.h>
>  > #include "mpi.h"
>  >
>  > int main(int iArgC, char *apArgV[]) {
>  >     int iResult = 0;
>  >     int iRank1;
>  >     int iNum1;
>  >
>  >     char sName[256];
>  >     gethostname(sName, 255);
>  >
>  >     MPI_Init(&iArgC, &apArgV);
>  >
>  >     MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
>  >     MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
>  >
>  >     printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
>  >     MPI_Barrier(MPI_COMM_WORLD);
>  >     printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);
>  >
>  >     MPI_Finalize();
>  >
>  >     return iResult;
>  > }
>  >
>  >
>  > If i make this call:
>  > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
>  > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
>  > ./run_gdb.sh ./MPITest64
>  >
>  > (run_gdb.sh is a script which starts gdb in a xterm for each process)
>  > Process 0 (on aim-plankton) passes the barrier and gets stuck in 
> PMPI_Finalize,
>  > all other processes get stuck in PMPI_Barrier,
>  > Process 1 (on aim-plankton) displays the message
>  >    
> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>  > connect() failed with errno=113
>  > Process 2 on (aim-plankton) displays the same message twice.
>  >
>  > Any ideas?
>  >
>  >   Thanks Jody
>  >
>  > On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote:
>  >> Hi
>  >>  Using a more realistic application than a simple "Hello, world"
>  >>  even the --host version doesn't work correctly
>  >>  Called this way
>  >>
>  >>  mpirun -np 3 --host aim-plankton ./QHGLauncher
>  >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
>  >>  ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>  >>
>  >>  the application starts but seems to hang after a while.
>  >>
>  >>  Running the application in gdb:
>  >>
>  >>  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
>  >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
>  >>  -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
>  >>  -o bruzlopf -n 12
>  >>  --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>  >>
>  >>  i can see that the processes on aim-fanta4 have indeed gotten stuck
>  >>  after a few initial outputs,
>  >>  and the processes on aim-plankton all have a messsage:
>  >>
>  >>  
> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>  >>  connect() failed with errno=113
>  >>
>  >>  If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
>  >>  as expected.
>  >>
>  >>  BTW: i'm, using open MPI 1.2.2
>  >>
>  >>  Thanks
>  >>   Jody
>  >>
>  >>
>  >> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote:
>  >>  > HI
>  >>  >  In my network i have some 32 bit machines and some 64 bit machines.
>  >>  >  With --host i successfully call my application:
>  >>  >   mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
>  >>  >  -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>  >>  >  (MPITest64 has the same code as MPITest, but was compiled on the 64 
> bit machine)
>  >>  >
>  >>  >  But when i use hostfiles:
>  >>  >   mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
>  >>  >  -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>  >>  >  all 6 processes are started on the 64 bit machine aim-fanta4.
>  >>  >
>  >>  >  hosts32:
>  >>  >    aim-plankton slots=3
>  >>  >  hosts64
>  >>  >   aim-fanta4 slots
>  >>  >
>  >>  >  Is this a bug or a feature?  ;)
>  >>  >
>  >>  >  Jody
>  >>  >
>  >>
>  > _______________________________________________
>  > users mailing list
>  > us...@open-mpi.org
>  > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>  --
>
>  =========================
>  rolf.vandeva...@sun.com
>  781-442-3043
>  =========================
>  _______________________________________________
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to