[OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

Fischer, Greg A. Sun, 19 Jan 2014 08:53:30 -0500 (EST)

Hello,

I have a simple, 1-process test case that gets stuck on the mpi_finalize call. 
The test case is a dead-simple calculation of pi - 50 lines of Fortran. The 
process gradually consumes more and more memory until the system becomes 
unresponsive and needs to be rebooted, unless the job is killed first.


In the output, attached, I see the warning message about OpenFabrics being 
configured to only allow registering part of physical memory. I've tried to 
chase this down with my administrator to no avail yet. (I am aware of the 
relevant FAQ entry.)  A different installation of MPI on the same system, made 
with a different compiler, does not produce the OpenFabrics memory registration 
warning - which seems strange because I thought it was a system configuration 
issue independent of MPI. Also curious in the output is that LSF seems to think 
there are 7 processes and 11 threads associated with this job.

The particulars of my configuration are attached and detailed below. Does 
anyone see anything potentially problematic?

Thanks,
Greg

OpenMPI Version: 1.6.5
Compiler: GCC 4.6.1
OS: SuSE Linux Enterprise Server 10, Patchlevel 2

uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 
x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib

PATH= 
/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin:/usr/bin:.:/bin:/usr/scripts

Execution command: (executed via LSF - effectively "mpirun -np 1 test_program")

Sender: LSF System <lsfadmin@bl1211>
Subject: Job 900527: <mpirun.lsf pi> Exited

Job <mpirun.lsf pi> was submitted from host <lxlogin2> by user <fischega> in 
cluster <ec_cluster>.
Job was executed on host(s) <bl1211>, in queue <sles10>, as user <fischega> in 
cluster <ec_cluster>.
</home/fischega> was used as the home directory.
</data/fischega/petsc_configure/mpi_test> was used as the working directory.
Started at Sat Jan 18 21:47:47 2014
Results reported at Sat Jan 18 21:48:33 2014

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun.lsf pi
------------------------------------------------------------

TERM_OWNER: job killed by owner.
Exited with exit code 1.

Resource usage summary:

    CPU time   :     41.56 sec.
    Max Memory :     12075 MB
    Max Swap   :     12213 MB

    Max Processes  :         7
    Max Threads    :        11

The output (if any) follows:

--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              bl1211
  Registerable memory:     32768 MiB
  Total memory:            64618 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
 MPI process            0  running on node bl1211.
 Running     50000000  samples over            1  proc(s).
 pi is    3.1415926535895617       Error is  2.31370478331882623E-013
 THIS IS THE END.
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 29294 on
node bl1211 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Job  /tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper pi

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00000 bl1211     pi                Killed by PAM (SIGKILL)  01/18/2014 21:48:30

config.log.bz2
Description: config.log.bz2

ompi_info.bz2
Description: ompi_info.bz2

[OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

Reply via email to