Rolf, I was able to run hostname on the two noes that way, and also a simplified version of my testprogram (without a barrier) works. Only MPI_Barrier shows bad behaviour.
Do you know what this message means? [aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Does it give an idea what could be the problem? Jody On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart <rolf.vandeva...@sun.com> wrote: > > This worked for me although I am not sure how extensive our 32/64 > interoperability support is. I tested on Solaris using the TCP > interconnect and a 1.2.5 version of Open MPI. Also, we configure with > the --enable-heterogeneous flag which may make a difference here. Also > this did not work for me over the sm btl. > > By the way, can you run a simple /bin/hostname across the two nodes? > > > burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o > simple.32 > burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o > simple.64 > burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca > btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 > simple.32 : -host burl-ct-v20z-5 -np 3 simple.64 > [burl-ct-v20z-4]I am #0/6 before the barrier > [burl-ct-v20z-5]I am #3/6 before the barrier > [burl-ct-v20z-5]I am #4/6 before the barrier > [burl-ct-v20z-4]I am #1/6 before the barrier > [burl-ct-v20z-4]I am #2/6 before the barrier > [burl-ct-v20z-5]I am #5/6 before the barrier > [burl-ct-v20z-5]I am #3/6 after the barrier > [burl-ct-v20z-4]I am #1/6 after the barrier > [burl-ct-v20z-5]I am #5/6 after the barrier > [burl-ct-v20z-5]I am #4/6 after the barrier > [burl-ct-v20z-4]I am #2/6 after the barrier > [burl-ct-v20z-4]I am #0/6 after the barrier > burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open > MPI) 1.2.5r16572 > > Report bugs to http://www.open-mpi.org/community/help/ > burl-ct-v20z-4 65 => > > > > > jody wrote: > > i narrowed it down: > > The majority of processes get stuck in MPI_Barrier. > > My Test application looks like this: > > > > #include <stdio.h> > > #include <unistd.h> > > #include "mpi.h" > > > > int main(int iArgC, char *apArgV[]) { > > int iResult = 0; > > int iRank1; > > int iNum1; > > > > char sName[256]; > > gethostname(sName, 255); > > > > MPI_Init(&iArgC, &apArgV); > > > > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); > > MPI_Comm_size(MPI_COMM_WORLD, &iNum1); > > > > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1); > > MPI_Barrier(MPI_COMM_WORLD); > > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1); > > > > MPI_Finalize(); > > > > return iResult; > > } > > > > > > If i make this call: > > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY > > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY > > ./run_gdb.sh ./MPITest64 > > > > (run_gdb.sh is a script which starts gdb in a xterm for each process) > > Process 0 (on aim-plankton) passes the barrier and gets stuck in > PMPI_Finalize, > > all other processes get stuck in PMPI_Barrier, > > Process 1 (on aim-plankton) displays the message > > > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > > connect() failed with errno=113 > > Process 2 on (aim-plankton) displays the same message twice. > > > > Any ideas? > > > > Thanks Jody > > > > On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote: > >> Hi > >> Using a more realistic application than a simple "Hello, world" > >> even the --host version doesn't work correctly > >> Called this way > >> > >> mpirun -np 3 --host aim-plankton ./QHGLauncher > >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > >> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt > >> > >> the application starts but seems to hang after a while. > >> > >> Running the application in gdb: > >> > >> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher > >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > >> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg > >> -o bruzlopf -n 12 > >> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim > >> > >> i can see that the processes on aim-fanta4 have indeed gotten stuck > >> after a few initial outputs, > >> and the processes on aim-plankton all have a messsage: > >> > >> > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > >> connect() failed with errno=113 > >> > >> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs > >> as expected. > >> > >> BTW: i'm, using open MPI 1.2.2 > >> > >> Thanks > >> Jody > >> > >> > >> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote: > >> > HI > >> > In my network i have some 32 bit machines and some 64 bit machines. > >> > With --host i successfully call my application: > >> > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest : > >> > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 > >> > (MPITest64 has the same code as MPITest, but was compiled on the 64 > bit machine) > >> > > >> > But when i use hostfiles: > >> > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest : > >> > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 > >> > all 6 processes are started on the 64 bit machine aim-fanta4. > >> > > >> > hosts32: > >> > aim-plankton slots=3 > >> > hosts64 > >> > aim-fanta4 slots > >> > > >> > Is this a bug or a feature? ;) > >> > > >> > Jody > >> > > >> > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > > ========================= > rolf.vandeva...@sun.com > 781-442-3043 > ========================= > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >