Hi all. I have been fighting with this problem for weeks now, and I am getting quite desperate about it. Hope I can get help here, because local folks couldn't help me.
There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4 (Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca) They have some mpi libraries (LAM I beleive) installed, but since they don't support Fortran90, I compile my own library. I install it in my home directory /home/victor/programs. I configure with the following options F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90 --prefix=/home/victor/programs --enable-pretty-print-stacktrace --config-cache --disable-shared --enable-static It compiles and installs with no errors. But when I run my code by using mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable (mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to avoid conflict with system-wide mpiexec) it dies silently with no errors shown - just stops and says 2 additional processes aborted (not shown) It depends on the number of grid points, because for some small grid sizes (40x10x10) it runs fine. But the number at which I start getting problems is stupidly small (like 40x20x10) so it can't be an insufficient memory issue - the cluster server has 2Gb of memory and I can run my code in serial mode with at least 200x100x100. Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to compile the library, but I've tried different compilers (g95-gcc, ifort-gcc4.1) - same result all the time. As far as I can say, it's not an error in my code either, because I've done numerous checks and also it runs fine on my pc, though on my pc I compiled the library with ifort and icc. And here comes the weirdest part - if I run my code through valgrind in mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it runs fine with grid sizes it fails on without valgrind!!! It doesn't exit mpiexec, but does get to the last statement of my code. I am attaching config.log and ompi_info.log The following is the output of mpiexex -d -np 4 ./model-0.0.9: [obelix:08876] procdir: (null) [obelix:08876] jobdir: (null) [obelix:08876] unidir: /tmp/openmpi-sessions-victor@obelix_0/default-universe [obelix:08876] top: openmpi-sessions-victor@obelix_0 [obelix:08876] tmp: /tmp [obelix:08876] connect_uni: contact info read [obelix:08876] connect_uni: connection not allowed [obelix:08876] [0,0,0] setting up session dir with [obelix:08876] tmpdir /tmp [obelix:08876] universe default-universe-8876 [obelix:08876] user victor [obelix:08876] host obelix [obelix:08876] jobid 0 [obelix:08876] procid 0 [obelix:08876] procdir: /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0/0 [obelix:08876] jobdir: /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0 [obelix:08876] unidir: /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876 [obelix:08876] top: openmpi-sessions-victor@obelix_0 [obelix:08876] tmp: /tmp [obelix:08876] [0,0,0] contact_file /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/universe-setup.txt [obelix:08876] [0,0,0] wrote setup file [obelix:08876] pls:rsh: local csh: 0, local bash: 1 [obelix:08876] pls:rsh: assuming same remote shell as local shell [obelix:08876] pls:rsh: remote csh: 0, remote bash: 1 [obelix:08876] pls:rsh: final template argv: [obelix:08876] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename <template> --universe victor@obelix:default-universe-8876 --nsreplica "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --mpi-call-yield 0 [obelix:08876] pls:rsh: launching on node localhost [obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (1 4) [obelix:08876] pls:rsh: localhost is a LOCAL node [obelix:08876] pls:rsh: changing to directory /home/victor [obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe victor@obelix:default-universe-8876 --nsreplica "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --mpi-call-yield 1 [obelix:08877] [0,0,1] setting up session dir with [obelix:08877] universe default-universe-8876 [obelix:08877] user victor [obelix:08877] host localhost [obelix:08877] jobid 0 [obelix:08877] procid 1 [obelix:08877] procdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0/1 [obelix:08877] jobdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0 [obelix:08877] unidir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 [obelix:08877] top: openmpi-sessions-victor@localhost_0 [obelix:08877] tmp: /tmp [obelix:08878] [0,1,0] setting up session dir with [obelix:08878] universe default-universe-8876 [obelix:08878] user victor [obelix:08878] host localhost [obelix:08878] jobid 1 [obelix:08878] procid 0 [obelix:08878] procdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/0 [obelix:08878] jobdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 [obelix:08878] unidir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 [obelix:08878] top: openmpi-sessions-victor@localhost_0 [obelix:08878] tmp: /tmp [obelix:08879] [0,1,1] setting up session dir with [obelix:08879] universe default-universe-8876 [obelix:08879] user victor [obelix:08879] host localhost [obelix:08879] jobid 1 [obelix:08879] procid 1 [obelix:08879] procdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/1 [obelix:08879] jobdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 [obelix:08879] unidir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 [obelix:08879] top: openmpi-sessions-victor@localhost_0 [obelix:08879] tmp: /tmp [obelix:08880] [0,1,2] setting up session dir with [obelix:08880] universe default-universe-8876 [obelix:08880] user victor [obelix:08880] host localhost [obelix:08880] jobid 1 [obelix:08880] procid 2 [obelix:08880] procdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/2 [obelix:08880] jobdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 [obelix:08880] unidir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 [obelix:08880] top: openmpi-sessions-victor@localhost_0 [obelix:08880] tmp: /tmp [obelix:08881] [0,1,3] setting up session dir with [obelix:08881] universe default-universe-8876 [obelix:08881] user victor [obelix:08881] host localhost [obelix:08881] jobid 1 [obelix:08881] procid 3 [obelix:08881] procdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/3 [obelix:08881] jobdir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 [obelix:08881] unidir: /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 [obelix:08881] top: openmpi-sessions-victor@localhost_0 [obelix:08881] tmp: /tmp [obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4) [obelix:08876] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 4 MPIR_proctable: (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878) (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879) (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880) (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881) [obelix:08878] [0,1,0] ompi_mpi_init completed [obelix:08879] [0,1,1] ompi_mpi_init completed [obelix:08880] [0,1,2] ompi_mpi_init completed [obelix:08881] [0,1,3] ompi_mpi_init completed [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting [obelix:08877] sess_dir_finalize: job session dir not empty - leaving [obelix:08877] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting [obelix:08877] sess_dir_finalize: job session dir not empty - leaving [obelix:08877] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [obelix:08877] sess_dir_finalize: job session dir not empty - leaving [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting [obelix:08877] sess_dir_finalize: found job session dir empty - deleting [obelix:08877] sess_dir_finalize: univ session dir not empty - leaving Thank you, Victor Prosolin.
config.log.tar.gz
Description: GNU Zip compressed data
Open MPI: 1.1.2 Open MPI SVN revision: r12073 Open RTE: 1.1.2 Open RTE SVN revision: r12073 OPAL: 1.1.2 OPAL SVN revision: r12073 Prefix: /home/victor/programs Configured architecture: i686-pc-linux-gnu Configured by: victor Configured on: Thu Nov 16 13:06:12 MST 2006 Configure host: obelix Built by: victor Built on: Thu Nov 16 13:42:40 MST 2006 Built host: obelix C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: distcc C compiler absolute: /home/victor/programs/bin/distcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: ifort Fortran77 compiler abs: /opt/intel/fc/9.1.037/bin/ifort Fortran90 compiler: ifort Fortran90 compiler abs: /opt/intel/fc/9.1.037/bin/ifort C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2) MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2) MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2) MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2) MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2) MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2) MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2) MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2) MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2) MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2)