i narrowed it down: The majority of processes get stuck in MPI_Barrier. My Test application looks like this:
#include <stdio.h> #include <unistd.h> #include "mpi.h" int main(int iArgC, char *apArgV[]) { int iResult = 0; int iRank1; int iNum1; char sName[256]; gethostname(sName, 255); MPI_Init(&iArgC, &apArgV); MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); MPI_Comm_size(MPI_COMM_WORLD, &iNum1); printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1); MPI_Barrier(MPI_COMM_WORLD); printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1); MPI_Finalize(); return iResult; } If i make this call: mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 (run_gdb.sh is a script which starts gdb in a xterm for each process) Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize, all other processes get stuck in PMPI_Barrier, Process 1 (on aim-plankton) displays the message [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Process 2 on (aim-plankton) displays the same message twice. Any ideas? Thanks Jody On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote: > Hi > Using a more realistic application than a simple "Hello, world" > even the --host version doesn't work correctly > Called this way > > mpirun -np 3 --host aim-plankton ./QHGLauncher > --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt > > the application starts but seems to hang after a while. > > Running the application in gdb: > > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher > --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg > -o bruzlopf -n 12 > --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim > > i can see that the processes on aim-fanta4 have indeed gotten stuck > after a few initial outputs, > and the processes on aim-plankton all have a messsage: > > > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > connect() failed with errno=113 > > If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs > as expected. > > BTW: i'm, using open MPI 1.2.2 > > Thanks > Jody > > > On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote: > > HI > > In my network i have some 32 bit machines and some 64 bit machines. > > With --host i successfully call my application: > > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest : > > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 > > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit > machine) > > > > But when i use hostfiles: > > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest : > > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 > > all 6 processes are started on the 64 bit machine aim-fanta4. > > > > hosts32: > > aim-plankton slots=3 > > hosts64 > > aim-fanta4 slots > > > > Is this a bug or a feature? ;) > > > > Jody > > >