I have an application that appears to function as I expect when compiled with openmpi-1.4.2 on FreeBSD 9.0. But, it appears to hang during communication between nodes. What follows is the long version.
I configure 1.4.2 with ./configure --prefix=/usr/local/openmpi-1.4.2 \ --enable-mpirun-prefix-by-default --disable-shared --enable-static The Fortran compiler is gfortran 4.5.3. I rebuild my application and launch the app from node10 with % /usr/local/openmpi-1.4.2/bin/mpiexec -mca btl tcp,self -machinefile mf1 \ -np 4 sasmp sas.in where the machine file is % cat mf1 node10 slots=3 node11 slots=4 Using top(1) on node10 and node11, I see node10 PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 74158 kargl 1 65 0 302M 293M CPU1 1 57:06 99.12% sasmp 74160 kargl 1 65 0 306M 298M CPU0 0 57:06 99.07% sasmp 74159 kargl 1 65 0 306M 298M CPU3 3 57:06 99.02% sasmp node11 PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 13144 kargl 1 48 0 306M 297M CPU3 3 55:55 99.02% sasmp The above is the expected process information, and more important the application is producing the right answer. Now, if I repeat everything above for 1.4.3, I get ./configure --prefix=/usr/local/openmpi-1.4.3 \ --enable-mpirun-prefix-by-default --disable-shared --enable-static Rebuild my application and launch the app from node10 with % /usr/local/openmpi-1.4.3/bin/mpiexec -mca btl tcp,self -machinefile mf1 \ -np 4 sasmp sas.in node10 PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 74460 kargl 1 66 0 302M 291M CPU2 2 3:15 99.03% sasmp 74462 kargl 1 66 0 302M 291M CPU3 3 3:15 99.03% sasmp 74461 kargl 1 66 0 14472K 4616K CPU1 1 3:15 99.03% sasmp node11 PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 13298 kargl 1 49 0 14472K 3336K CPU3 3 3:11 99.03% sasmp I've left the application running for up to 12 minutes, and it never reaches the ~300 MB SIZE nor 293M RES on node11 and the one process of node10. Now, if I reduce -np from 4 to 3, then only 3 processes are started on node10, and I get the expected results. So, as soon as I try to send something over tcp, the application stalls. Any idea on how I might debug this problem? -- Steve