(This may be a duplicate. An earlier post seems to have been lost). I'm using openmpi (1.3.2) to run on 3 dual processor machines (running linux, slackware-12.1, gcc-4.4.0). Two are directly on my LAN while the 3rd is connected to my LAN via VPN and NAT (I can communicate in either direction from any of the machines to the remote machines using its NAT address).
The program I'm trying to run is very simple in terms of MPI. Basically it is: main() { [snip]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); [snip]; if(myrank==0) i=MPI_Reduce(MPI_IN_PLACE, C, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); else i=MPI_Reduce(C, MPI_IN_PLACE, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if(i!=MPI_SUCCESS) { fprintf(stderr,"MPI_Reduce (C) fails on processor %d\n", myrank); MPI_Finalize(); exit(1); } MPI_Barrier(MPI_COMM_WORLD); [snip]; } I run by invoking: mpirun -v -np ${NPROC} -hostfile ${HOSTFILE} --stdin none $* > /dev/null If I run on the 4 nodes that are physically on the LAN it works as expected. When I add the nodes on the remote machine things don't work properly: 1. If I start with NPROC=6 on one of the LAN machines all 6 nodes start (as shown by running ps), and all get to the MPI_HARVEST calls. At that point things hang (I see no network traffic, which given the size of the array I'm trying to reduce is strange). 2. If I start on the remote with NPROC=6, the only the mpirun call shows up under ps on the remote, while nothing shows up on the other nodes. Killing the process gives messages like: hostname - daemon did not report back when launched 3. If I start on the remote with NPROC=2, the 2 processes start on the remote and finish properly. My suspicion is that there's some bad interaction with NAT and authentication. Any suggestions? David