>From what you sent, it appears that Open MPI thinks your processes called
MPI_Abort (as opposed to segfaulting or some other failure mode). The system
appears to be operating exactly as it should - it just thinks your program
aborted the job - i.e., that one or more processes actually called MPI_Abort
for some reason.

Have you tried running your code without valgrind? I'm wondering if the
valgrind interaction may be part of the problem.

Do you have a code path in your program that would lead to MPI_Abort? I'm
wondering if you have some logic that might abort if it encounters what it
believes is a problem. If so, you might put some output in that path to see
if you are traversing it. Then we would have some idea as to why the code
thinks it *should* abort.

Others may also have suggestions. Most of the team is at the Supercomputing
show this week and won't really be available until next week or after
Thanksgiving.

Ralph


On 11/16/06 2:51 PM, "Victor Prosolin" <victor.proso...@gmail.com> wrote:

> Hi all.
> I have been fighting with this problem for weeks now, and I am getting
> quite desperate about it. Hope I can get help here, because local folks
> couldn't help me.
> 
> There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4
> (Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca)
> They have some mpi libraries (LAM I beleive) installed, but since they
> don't support
> Fortran90, I compile my own library. I install it in my home directory
> /home/victor/programs. I configure with the following options
> 
> F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90
> --prefix=/home/victor/programs --enable-pretty-print-stacktrace
> --config-cache --disable-shared --enable-static
> 
> It compiles and installs with no errors. But when I run my code by using
> mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable
> (mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to
> avoid conflict with system-wide mpiexec)
> 
> it dies silently with no errors shown - just stops and says
> 2 additional processes aborted (not shown)
> 
> It depends on the number of grid points, because for some
> small grid sizes (40x10x10) it runs fine. But the number at which I
> start getting problems is stupidly small (like 40x20x10) so it can't be
> an insufficient memory issue - the cluster server has 2Gb of memory and
> I can run my code in serial mode with at least 200x100x100.
> 
> Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to
> compile the library, but I've tried different compilers (g95-gcc,
> ifort-gcc4.1) - same result all the time. As far as I can say, it's not
> an error in my code either, because I've done numerous checks and also
> it runs fine on my pc, though on my pc I compiled the library with ifort
> and icc.
> And here comes the weirdest part - if I run my code through valgrind in
> mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it
> runs fine with grid sizes it fails on without valgrind!!! It doesn't
> exit mpiexec, but does get to the last statement of my code.
> 
> I am attaching config.log and ompi_info.log
> The following is the output of mpiexex -d -np 4 ./model-0.0.9:
> 
> [obelix:08876] procdir: (null)
> [obelix:08876] jobdir: (null)
> [obelix:08876] unidir:
> /tmp/openmpi-sessions-victor@obelix_0/default-universe
> [obelix:08876] top: openmpi-sessions-victor@obelix_0
> [obelix:08876] tmp: /tmp
> [obelix:08876] connect_uni: contact info read
> [obelix:08876] connect_uni: connection not allowed
> [obelix:08876] [0,0,0] setting up session dir with
> [obelix:08876]  tmpdir /tmp
> [obelix:08876]  universe default-universe-8876
> [obelix:08876]  user victor
> [obelix:08876]  host obelix
> [obelix:08876]  jobid 0
> [obelix:08876]  procid 0
> [obelix:08876] procdir:
> /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0/0
> [obelix:08876] jobdir:
> /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0
> [obelix:08876] unidir:
> /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876
> [obelix:08876] top: openmpi-sessions-victor@obelix_0
> [obelix:08876] tmp: /tmp
> [obelix:08876] [0,0,0] contact_file
> /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/universe-setup.txt
> [obelix:08876] [0,0,0] wrote setup file
> [obelix:08876] pls:rsh: local csh: 0, local bash: 1
> [obelix:08876] pls:rsh: assuming same remote shell as local shell
> [obelix:08876] pls:rsh: remote csh: 0, remote bash: 1
> [obelix:08876] pls:rsh: final template argv:
> [obelix:08876] pls:rsh:     /usr/bin/ssh <template> orted --debug
> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> <template> --universe victor@obelix:default-universe-8876 --nsreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
> --mpi-call-yield 0
> [obelix:08876] pls:rsh: launching on node localhost
> [obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
> 1 (1 4)
> [obelix:08876] pls:rsh: localhost is a LOCAL node
> [obelix:08876] pls:rsh: changing to directory /home/victor
> [obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name
> 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
> victor@obelix:default-universe-8876 --nsreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
> --mpi-call-yield 1
> [obelix:08877] [0,0,1] setting up session dir with
> [obelix:08877]  universe default-universe-8876
> [obelix:08877]  user victor
> [obelix:08877]  host localhost
> [obelix:08877]  jobid 0
> [obelix:08877]  procid 1
> [obelix:08877] procdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0/1
> [obelix:08877] jobdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0
> [obelix:08877] unidir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
> [obelix:08877] top: openmpi-sessions-victor@localhost_0
> [obelix:08877] tmp: /tmp
> [obelix:08878] [0,1,0] setting up session dir with
> [obelix:08878]  universe default-universe-8876
> [obelix:08878]  user victor
> [obelix:08878]  host localhost
> [obelix:08878]  jobid 1
> [obelix:08878]  procid 0
> [obelix:08878] procdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/0
> [obelix:08878] jobdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
> [obelix:08878] unidir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
> [obelix:08878] top: openmpi-sessions-victor@localhost_0
> [obelix:08878] tmp: /tmp
> [obelix:08879] [0,1,1] setting up session dir with
> [obelix:08879]  universe default-universe-8876
> [obelix:08879]  user victor
> [obelix:08879]  host localhost
> [obelix:08879]  jobid 1
> [obelix:08879]  procid 1
> [obelix:08879] procdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/1
> [obelix:08879] jobdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
> [obelix:08879] unidir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
> [obelix:08879] top: openmpi-sessions-victor@localhost_0
> [obelix:08879] tmp: /tmp
> [obelix:08880] [0,1,2] setting up session dir with
> [obelix:08880]  universe default-universe-8876
> [obelix:08880]  user victor
> [obelix:08880]  host localhost
> [obelix:08880]  jobid 1
> [obelix:08880]  procid 2
> [obelix:08880] procdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/2
> [obelix:08880] jobdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
> [obelix:08880] unidir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
> [obelix:08880] top: openmpi-sessions-victor@localhost_0
> [obelix:08880] tmp: /tmp
> [obelix:08881] [0,1,3] setting up session dir with
> [obelix:08881]  universe default-universe-8876
> [obelix:08881]  user victor
> [obelix:08881]  host localhost
> [obelix:08881]  jobid 1
> [obelix:08881]  procid 3
> [obelix:08881] procdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/3
> [obelix:08881] jobdir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1
> [obelix:08881] unidir:
> /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876
> [obelix:08881] top: openmpi-sessions-victor@localhost_0
> [obelix:08881] tmp: /tmp
> [obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4)
> [obelix:08876] Info: Setting up debugger process table for applications
>   MPIR_being_debugged = 0
>   MPIR_debug_gate = 0
>   MPIR_debug_state = 1
>   MPIR_acquired_pre_main = 0
>   MPIR_i_am_starter = 0
>   MPIR_proctable_size = 4
>   MPIR_proctable:
>     (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878)
>     (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879)
>     (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880)
>     (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881)
> [obelix:08878] [0,1,0] ompi_mpi_init completed
> [obelix:08879] [0,1,1] ompi_mpi_init completed
> [obelix:08880] [0,1,2] ompi_mpi_init completed
> [obelix:08881] [0,1,3] ompi_mpi_init completed
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] orted: job_state_callback(jobid = 1, state =
> ORTE_PROC_STATE_ABORTED)
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] orted: job_state_callback(jobid = 1, state =
> ORTE_PROC_STATE_TERMINATED)
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: found job session dir empty - deleting
> [obelix:08877] sess_dir_finalize: univ session dir not empty - leaving
> 
> Thank you,
> Victor Prosolin.
>                 Open MPI: 1.1.2
>    Open MPI SVN revision: r12073
>                 Open RTE: 1.1.2
>    Open RTE SVN revision: r12073
>                     OPAL: 1.1.2
>        OPAL SVN revision: r12073
>                   Prefix: /home/victor/programs
>  Configured architecture: i686-pc-linux-gnu
>            Configured by: victor
>            Configured on: Thu Nov 16 13:06:12 MST 2006
>           Configure host: obelix
>                 Built by: victor
>                 Built on: Thu Nov 16 13:42:40 MST 2006
>               Built host: obelix
>               C bindings: yes
>             C++ bindings: yes
>       Fortran77 bindings: yes (all)
>       Fortran90 bindings: yes
>  Fortran90 bindings size: small
>               C compiler: distcc
>      C compiler absolute: /home/victor/programs/bin/distcc
>             C++ compiler: g++
>    C++ compiler absolute: /usr/bin/g++
>       Fortran77 compiler: ifort
>   Fortran77 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
>       Fortran90 compiler: ifort
>   Fortran90 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
>              C profiling: yes
>            C++ profiling: yes
>      Fortran77 profiling: yes
>      Fortran90 profiling: yes
>           C++ exceptions: no
>           Thread support: posix (mpi: no, progress: no)
>   Internal debug support: no
>      MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>          libltdl support: yes
>               MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2)
>            MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2)
>            MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2)
>                MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2)
>            MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>            MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>                 MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
>                 MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2)
>                 MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
>                 MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
>                 MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
>                   MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
>                MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
>               MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>                 MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
>                  MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
>                   MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>                   MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>                  MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
>                MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2)
>                 MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>                 MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2)
>                  MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to