I have an application that appears to function as I expect
when compiled with openmpi-1.4.2 on FreeBSD 9.0.  But, it
appears to hang during communication between nodes.  What
follows is the long version.

I configure 1.4.2 with 

./configure --prefix=/usr/local/openmpi-1.4.2 \
--enable-mpirun-prefix-by-default --disable-shared --enable-static

The Fortran compiler is gfortran 4.5.3.  I rebuild my application
and launch the app from node10 with

% /usr/local/openmpi-1.4.2/bin/mpiexec -mca btl tcp,self -machinefile mf1 \
  -np 4 sasmp sas.in

where the machine file is

% cat mf1
node10 slots=3
node11 slots=4

Using top(1) on node10 and node11, I see 

node10
  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
74158 kargl         1  65    0   302M   293M CPU1    1  57:06 99.12% sasmp
74160 kargl         1  65    0   306M   298M CPU0    0  57:06 99.07% sasmp
74159 kargl         1  65    0   306M   298M CPU3    3  57:06 99.02% sasmp

node11
  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
13144 kargl         1  48    0   306M   297M CPU3    3  55:55 99.02% sasmp

The above is the expected process information, and more important
the application is producing the right answer.

Now, if I repeat everything above for 1.4.3, I get

./configure --prefix=/usr/local/openmpi-1.4.3 \
--enable-mpirun-prefix-by-default --disable-shared --enable-static

Rebuild my application and launch the app from node10 with

% /usr/local/openmpi-1.4.3/bin/mpiexec -mca btl tcp,self -machinefile mf1 \
  -np 4 sasmp sas.in

node10
  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
74460 kargl         1  66    0   302M   291M CPU2    2   3:15 99.03% sasmp
74462 kargl         1  66    0   302M   291M CPU3    3   3:15 99.03% sasmp
74461 kargl         1  66    0 14472K  4616K CPU1    1   3:15 99.03% sasmp

node11
  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
13298 kargl         1  49    0 14472K  3336K CPU3    3   3:11 99.03% sasmp

I've left the application running for up to 12 minutes, and it never
reaches the ~300 MB SIZE nor 293M RES on node11 and the one process
of node10.

Now, if I reduce -np from 4 to 3, then only 3 processes are started
on node10, and I get the expected results.  So, as soon as I try to
send something over tcp, the application stalls.  Any idea on how
I might debug this problem?

-- 
Steve

Reply via email to