Hello, I have a simple, 1-process test case that gets stuck on the mpi_finalize call. The test case is a dead-simple calculation of pi - 50 lines of Fortran. The process gradually consumes more and more memory until the system becomes unresponsive and needs to be rebooted, unless the job is killed first.
In the output, attached, I see the warning message about OpenFabrics being configured to only allow registering part of physical memory. I've tried to chase this down with my administrator to no avail yet. (I am aware of the relevant FAQ entry.) A different installation of MPI on the same system, made with a different compiler, does not produce the OpenFabrics memory registration warning - which seems strange because I thought it was a system configuration issue independent of MPI. Also curious in the output is that LSF seems to think there are 7 processes and 11 threads associated with this job. The particulars of my configuration are attached and detailed below. Does anyone see anything potentially problematic? Thanks, Greg OpenMPI Version: 1.6.5 Compiler: GCC 4.6.1 OS: SuSE Linux Enterprise Server 10, Patchlevel 2 uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib PATH= /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin:/usr/bin:.:/bin:/usr/scripts Execution command: (executed via LSF - effectively "mpirun -np 1 test_program")
Sender: LSF System <lsfadmin@bl1211> Subject: Job 900527: <mpirun.lsf pi> Exited Job <mpirun.lsf pi> was submitted from host <lxlogin2> by user <fischega> in cluster <ec_cluster>. Job was executed on host(s) <bl1211>, in queue <sles10>, as user <fischega> in cluster <ec_cluster>. </home/fischega> was used as the home directory. </data/fischega/petsc_configure/mpi_test> was used as the working directory. Started at Sat Jan 18 21:47:47 2014 Results reported at Sat Jan 18 21:48:33 2014 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input mpirun.lsf pi ------------------------------------------------------------ TERM_OWNER: job killed by owner. Exited with exit code 1. Resource usage summary: CPU time : 41.56 sec. Max Memory : 12075 MB Max Swap : 12213 MB Max Processes : 7 Max Threads : 11 The output (if any) follows: -------------------------------------------------------------------------- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: bl1211 Registerable memory: 32768 MiB Total memory: 64618 MiB Your MPI job will continue, but may be behave poorly and/or hang. -------------------------------------------------------------------------- MPI process 0 running on node bl1211. Running 50000000 samples over 1 proc(s). pi is 3.1415926535895617 Error is 2.31370478331882623E-013 THIS IS THE END. -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 29294 on node bl1211 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- Job /tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper pi TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME ===== ========== ================ ======================= =================== 00000 bl1211 pi Killed by PAM (SIGKILL) 01/18/2014 21:48:30
config.log.bz2
Description: config.log.bz2
ompi_info.bz2
Description: ompi_info.bz2