Hi all.
I have been fighting with this problem for weeks now, and I am getting
quite desperate about it. Hope I can get help here, because local folks
couldn't help me.

There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4
(Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca)
They have some mpi libraries (LAM I beleive) installed, but since they
don't support
Fortran90, I compile my own library. I install it in my home directory
/home/victor/programs. I configure with the following options

F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90
--prefix=/home/victor/programs --enable-pretty-print-stacktrace
--config-cache --disable-shared --enable-static

It compiles and installs with no errors. But when I run my code by using
mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable
(mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to
avoid conflict with system-wide mpiexec)

it dies silently with no errors shown - just stops and says
2 additional processes aborted (not shown)

It depends on the number of grid points, because for some
small grid sizes (40x10x10) it runs fine. But the number at which I
start getting problems is stupidly small (like 40x20x10) so it can't be
an insufficient memory issue - the cluster server has 2Gb of memory and
I can run my code in serial mode with at least 200x100x100.

Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to
compile the library, but I've tried different compilers (g95-gcc,
ifort-gcc4.1) - same result all the time. As far as I can say, it's not
an error in my code either, because I've done numerous checks and also
it runs fine on my pc, though on my pc I compiled the library with ifort
and icc.
And here comes the weirdest part - if I run my code through valgrind in
mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it
runs fine with grid sizes it fails on without valgrind!!! It doesn't
exit mpiexec, but does get to the last statement of my code.

I am attaching config.log and ompi_info.log
The following is the output of mpiexex -d -np 4 ./model-0.0.9:

[obelix:08876] procdir: (null)
[obelix:08876] jobdir: (null)
[obelix:08876] unidir:
/tmp/openmpi-sessions-victor@obelix_0/default-universe
[obelix:08876] top: openmpi-sessions-victor@obelix_0
[obelix:08876] tmp: /tmp
[obelix:08876] connect_uni: contact info read
[obelix:08876] connect_uni: connection not allowed
[obelix:08876] [0,0,0] setting up session dir with
[obelix:08876]  tmpdir /tmp
[obelix:08876]  universe default-universe-8876
[obelix:08876]  user victor
[obelix:08876]  host obelix
[obelix:08876]  jobid 0
[obelix:08876]  procid 0
[obelix:08876] procdir:
/tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0/0
[obelix:08876] jobdir:
/tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0
[obelix:08876] unidir:
/tmp/openmpi-sessions-victor@obelix_0/default-universe-8876
[obelix:08876] top: openmpi-sessions-victor@obelix_0
[obelix:08876] tmp: /tmp
[obelix:08876] [0,0,0] contact_file
/tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/universe-setup.txt
[obelix:08876] [0,0,0] wrote setup file
[obelix:08876] pls:rsh: local csh: 0, local bash: 1
[obelix:08876] pls:rsh: assuming same remote shell as local shell
[obelix:08876] pls:rsh: remote csh: 0, remote bash: 1
[obelix:08876] pls:rsh: final template argv:
[obelix:08876] pls:rsh:     /usr/bin/ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe victor@obelix:default-universe-8876 --nsreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
--mpi-call-yield 0
[obelix:08876] pls:rsh: launching on node localhost
[obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
1 (1 4)
[obelix:08876] pls:rsh: localhost is a LOCAL node
[obelix:08876] pls:rsh: changing to directory /home/victor
[obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
victor@obelix:default-universe-8876 --nsreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
--mpi-call-yield 1
[obelix:08877] [0,0,1] setting up session dir with
[obelix:08877]  universe default-universe-8876
[obelix:08877]  user victor
[obelix:08877]  host localhost
[obelix:08877]  jobid 0
[obelix:08877]  procid 1
[obelix:08877] procdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0/1
[obelix:08877] jobdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0
[obelix:08877] unidir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
[obelix:08877] top: openmpi-sessions-victor@localhost_0
[obelix:08877] tmp: /tmp
[obelix:08878] [0,1,0] setting up session dir with
[obelix:08878]  universe default-universe-8876
[obelix:08878]  user victor
[obelix:08878]  host localhost
[obelix:08878]  jobid 1
[obelix:08878]  procid 0
[obelix:08878] procdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/0
[obelix:08878] jobdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
[obelix:08878] unidir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
[obelix:08878] top: openmpi-sessions-victor@localhost_0
[obelix:08878] tmp: /tmp
[obelix:08879] [0,1,1] setting up session dir with
[obelix:08879]  universe default-universe-8876
[obelix:08879]  user victor
[obelix:08879]  host localhost
[obelix:08879]  jobid 1
[obelix:08879]  procid 1
[obelix:08879] procdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/1
[obelix:08879] jobdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
[obelix:08879] unidir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
[obelix:08879] top: openmpi-sessions-victor@localhost_0
[obelix:08879] tmp: /tmp
[obelix:08880] [0,1,2] setting up session dir with
[obelix:08880]  universe default-universe-8876
[obelix:08880]  user victor
[obelix:08880]  host localhost
[obelix:08880]  jobid 1
[obelix:08880]  procid 2
[obelix:08880] procdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/2
[obelix:08880] jobdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
[obelix:08880] unidir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
[obelix:08880] top: openmpi-sessions-victor@localhost_0
[obelix:08880] tmp: /tmp
[obelix:08881] [0,1,3] setting up session dir with
[obelix:08881]  universe default-universe-8876
[obelix:08881]  user victor
[obelix:08881]  host localhost
[obelix:08881]  jobid 1
[obelix:08881]  procid 3
[obelix:08881] procdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/3
[obelix:08881] jobdir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
[obelix:08881] unidir:
/tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
[obelix:08881] top: openmpi-sessions-victor@localhost_0
[obelix:08881] tmp: /tmp
[obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4)
[obelix:08876] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878)
    (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879)
    (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880)
    (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881)
[obelix:08878] [0,1,0] ompi_mpi_init completed
[obelix:08879] [0,1,1] ompi_mpi_init completed
[obelix:08880] [0,1,2] ompi_mpi_init completed
[obelix:08881] [0,1,3] ompi_mpi_init completed
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_TERMINATED)
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: found job session dir empty - deleting
[obelix:08877] sess_dir_finalize: univ session dir not empty - leaving

Thank you,
Victor Prosolin.

Attachment: config.log.tar.gz
Description: GNU Zip compressed data

                Open MPI: 1.1.2
   Open MPI SVN revision: r12073
                Open RTE: 1.1.2
   Open RTE SVN revision: r12073
                    OPAL: 1.1.2
       OPAL SVN revision: r12073
                  Prefix: /home/victor/programs
 Configured architecture: i686-pc-linux-gnu
           Configured by: victor
           Configured on: Thu Nov 16 13:06:12 MST 2006
          Configure host: obelix
                Built by: victor
                Built on: Thu Nov 16 13:42:40 MST 2006
              Built host: obelix
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: distcc
     C compiler absolute: /home/victor/programs/bin/distcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: ifort
  Fortran77 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
      Fortran90 compiler: ifort
  Fortran90 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2)
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2)
               MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
                MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2)
                MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
                MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
                MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
                  MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
              MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
                 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
                  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
                  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
               MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2)
                MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
                MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2)
                 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2)

Reply via email to