On Apr 16, 2007, at 6:48 PM, Adams, Brian M wrote:

I am attempting to port Sandia's DAKOTA code from MVAPICH to the default
OpenMPI/Intel environment on Sandia's thunderbird cluster.  I can
successfully build DAKOTA in the default tbird software environment, but I'm having runtime problems when DAKOTA attempts to make a system call.
Typical output looks like:

[0,1,1][btl_openib_component.c:897:mca_btl_openib_component_progress]
from an64 to: an64 error polling HP CQ with status LOCAL LENGTH ERROR
status number 1 for wr_id 5714048 opcode 0

Unfortunately, making calls to system() or fork() will fail when using the OFED 1.1 stack (such as on thunderbird). The fun part is that the failure is not immediate; calling fork() or system() will cause odd/interesting errors later in your program (such as what you described above).

The only way around this is to call fork()/system() before the call to MPI_INIT or after the call to MPI_FINALIZE.

The OFED 1.2 stack has proper support for fork()/system(), but I don't know what tbird's plans are for upgrading (I doubt it has been discussed yet since OFED 1.2 is still going through its release process -- it's not final yet).

Note:  Both programs run fine with MVAPICH on tbird,

This is probably luck; I wouldn't count on it happening reliably.

and with OpenMPI or
MPICH on my Linux x86_64 SMP workstation.

There are many environments where fork() and system() work fine (e.g., when only using tcp and shared memory), but the OFED 1.1 stack is unfortunately not one of them.

I wish I had a better answer for you, but I don't.  Sorry!

--
Jeff Squyres
Cisco Systems

Reply via email to