>From what you sent, it appears that Open MPI thinks your processes called MPI_Abort (as opposed to segfaulting or some other failure mode). The system appears to be operating exactly as it should - it just thinks your program aborted the job - i.e., that one or more processes actually called MPI_Abort for some reason.
Have you tried running your code without valgrind? I'm wondering if the valgrind interaction may be part of the problem. Do you have a code path in your program that would lead to MPI_Abort? I'm wondering if you have some logic that might abort if it encounters what it believes is a problem. If so, you might put some output in that path to see if you are traversing it. Then we would have some idea as to why the code thinks it *should* abort. Others may also have suggestions. Most of the team is at the Supercomputing show this week and won't really be available until next week or after Thanksgiving. Ralph On 11/16/06 2:51 PM, "Victor Prosolin" <victor.proso...@gmail.com> wrote: > Hi all. > I have been fighting with this problem for weeks now, and I am getting > quite desperate about it. Hope I can get help here, because local folks > couldn't help me. > > There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4 > (Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca) > They have some mpi libraries (LAM I beleive) installed, but since they > don't support > Fortran90, I compile my own library. I install it in my home directory > /home/victor/programs. I configure with the following options > > F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90 > --prefix=/home/victor/programs --enable-pretty-print-stacktrace > --config-cache --disable-shared --enable-static > > It compiles and installs with no errors. But when I run my code by using > mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable > (mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to > avoid conflict with system-wide mpiexec) > > it dies silently with no errors shown - just stops and says > 2 additional processes aborted (not shown) > > It depends on the number of grid points, because for some > small grid sizes (40x10x10) it runs fine. But the number at which I > start getting problems is stupidly small (like 40x20x10) so it can't be > an insufficient memory issue - the cluster server has 2Gb of memory and > I can run my code in serial mode with at least 200x100x100. > > Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to > compile the library, but I've tried different compilers (g95-gcc, > ifort-gcc4.1) - same result all the time. As far as I can say, it's not > an error in my code either, because I've done numerous checks and also > it runs fine on my pc, though on my pc I compiled the library with ifort > and icc. > And here comes the weirdest part - if I run my code through valgrind in > mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it > runs fine with grid sizes it fails on without valgrind!!! It doesn't > exit mpiexec, but does get to the last statement of my code. > > I am attaching config.log and ompi_info.log > The following is the output of mpiexex -d -np 4 ./model-0.0.9: > > [obelix:08876] procdir: (null) > [obelix:08876] jobdir: (null) > [obelix:08876] unidir: > /tmp/openmpi-sessions-victor@obelix_0/default-universe > [obelix:08876] top: openmpi-sessions-victor@obelix_0 > [obelix:08876] tmp: /tmp > [obelix:08876] connect_uni: contact info read > [obelix:08876] connect_uni: connection not allowed > [obelix:08876] [0,0,0] setting up session dir with > [obelix:08876] tmpdir /tmp > [obelix:08876] universe default-universe-8876 > [obelix:08876] user victor > [obelix:08876] host obelix > [obelix:08876] jobid 0 > [obelix:08876] procid 0 > [obelix:08876] procdir: > /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0/0 > [obelix:08876] jobdir: > /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/0 > [obelix:08876] unidir: > /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876 > [obelix:08876] top: openmpi-sessions-victor@obelix_0 > [obelix:08876] tmp: /tmp > [obelix:08876] [0,0,0] contact_file > /tmp/openmpi-sessions-victor@obelix_0/default-universe-8876/universe-setup.txt > [obelix:08876] [0,0,0] wrote setup file > [obelix:08876] pls:rsh: local csh: 0, local bash: 1 > [obelix:08876] pls:rsh: assuming same remote shell as local shell > [obelix:08876] pls:rsh: remote csh: 0, remote bash: 1 > [obelix:08876] pls:rsh: final template argv: > [obelix:08876] pls:rsh: /usr/bin/ssh <template> orted --debug > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename > <template> --universe victor@obelix:default-universe-8876 --nsreplica > "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica > "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" > --mpi-call-yield 0 > [obelix:08876] pls:rsh: launching on node localhost > [obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to > 1 (1 4) > [obelix:08876] pls:rsh: localhost is a LOCAL node > [obelix:08876] pls:rsh: changing to directory /home/victor > [obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name > 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe > victor@obelix:default-universe-8876 --nsreplica > "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica > "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" > --mpi-call-yield 1 > [obelix:08877] [0,0,1] setting up session dir with > [obelix:08877] universe default-universe-8876 > [obelix:08877] user victor > [obelix:08877] host localhost > [obelix:08877] jobid 0 > [obelix:08877] procid 1 > [obelix:08877] procdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0/1 > [obelix:08877] jobdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/0 > [obelix:08877] unidir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 > [obelix:08877] top: openmpi-sessions-victor@localhost_0 > [obelix:08877] tmp: /tmp > [obelix:08878] [0,1,0] setting up session dir with > [obelix:08878] universe default-universe-8876 > [obelix:08878] user victor > [obelix:08878] host localhost > [obelix:08878] jobid 1 > [obelix:08878] procid 0 > [obelix:08878] procdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/0 > [obelix:08878] jobdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 > [obelix:08878] unidir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 > [obelix:08878] top: openmpi-sessions-victor@localhost_0 > [obelix:08878] tmp: /tmp > [obelix:08879] [0,1,1] setting up session dir with > [obelix:08879] universe default-universe-8876 > [obelix:08879] user victor > [obelix:08879] host localhost > [obelix:08879] jobid 1 > [obelix:08879] procid 1 > [obelix:08879] procdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/1 > [obelix:08879] jobdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 > [obelix:08879] unidir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 > [obelix:08879] top: openmpi-sessions-victor@localhost_0 > [obelix:08879] tmp: /tmp > [obelix:08880] [0,1,2] setting up session dir with > [obelix:08880] universe default-universe-8876 > [obelix:08880] user victor > [obelix:08880] host localhost > [obelix:08880] jobid 1 > [obelix:08880] procid 2 > [obelix:08880] procdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/2 > [obelix:08880] jobdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 > [obelix:08880] unidir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 > [obelix:08880] top: openmpi-sessions-victor@localhost_0 > [obelix:08880] tmp: /tmp > [obelix:08881] [0,1,3] setting up session dir with > [obelix:08881] universe default-universe-8876 > [obelix:08881] user victor > [obelix:08881] host localhost > [obelix:08881] jobid 1 > [obelix:08881] procid 3 > [obelix:08881] procdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1/3 > [obelix:08881] jobdir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876/1 > [obelix:08881] unidir: > /tmp/openmpi-sessions-victor@localhost_0/default-universe-8876 > [obelix:08881] top: openmpi-sessions-victor@localhost_0 > [obelix:08881] tmp: /tmp > [obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4) > [obelix:08876] Info: Setting up debugger process table for applications > MPIR_being_debugged = 0 > MPIR_debug_gate = 0 > MPIR_debug_state = 1 > MPIR_acquired_pre_main = 0 > MPIR_i_am_starter = 0 > MPIR_proctable_size = 4 > MPIR_proctable: > (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878) > (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879) > (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880) > (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881) > [obelix:08878] [0,1,0] ompi_mpi_init completed > [obelix:08879] [0,1,1] ompi_mpi_init completed > [obelix:08880] [0,1,2] ompi_mpi_init completed > [obelix:08881] [0,1,3] ompi_mpi_init completed > [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting > [obelix:08877] sess_dir_finalize: job session dir not empty - leaving > [obelix:08877] orted: job_state_callback(jobid = 1, state = > ORTE_PROC_STATE_ABORTED) > [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting > [obelix:08877] sess_dir_finalize: job session dir not empty - leaving > [obelix:08877] orted: job_state_callback(jobid = 1, state = > ORTE_PROC_STATE_TERMINATED) > [obelix:08877] sess_dir_finalize: job session dir not empty - leaving > [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting > [obelix:08877] sess_dir_finalize: found job session dir empty - deleting > [obelix:08877] sess_dir_finalize: univ session dir not empty - leaving > > Thank you, > Victor Prosolin. > Open MPI: 1.1.2 > Open MPI SVN revision: r12073 > Open RTE: 1.1.2 > Open RTE SVN revision: r12073 > OPAL: 1.1.2 > OPAL SVN revision: r12073 > Prefix: /home/victor/programs > Configured architecture: i686-pc-linux-gnu > Configured by: victor > Configured on: Thu Nov 16 13:06:12 MST 2006 > Configure host: obelix > Built by: victor > Built on: Thu Nov 16 13:42:40 MST 2006 > Built host: obelix > C bindings: yes > C++ bindings: yes > Fortran77 bindings: yes (all) > Fortran90 bindings: yes > Fortran90 bindings size: small > C compiler: distcc > C compiler absolute: /home/victor/programs/bin/distcc > C++ compiler: g++ > C++ compiler absolute: /usr/bin/g++ > Fortran77 compiler: ifort > Fortran77 compiler abs: /opt/intel/fc/9.1.037/bin/ifort > Fortran90 compiler: ifort > Fortran90 compiler abs: /opt/intel/fc/9.1.037/bin/ifort > C profiling: yes > C++ profiling: yes > Fortran77 profiling: yes > Fortran90 profiling: yes > C++ exceptions: no > Thread support: posix (mpi: no, progress: no) > Internal debug support: no > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > libltdl support: yes > MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2) > MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2) > MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2) > MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2) > MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) > MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) > MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2) > MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2) > MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2) > MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2) > MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2) > MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2) > MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2) > MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2) > MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2) > MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2) > MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2) > MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2) > MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) > MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2) > MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) > MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2) > MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2) > MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2) > MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2) > MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2) > MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2) > MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2) > MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) > MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2) > MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2) > MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2) > MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2) > MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2) > MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2) > MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2) > MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2) > MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2) > MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2) > MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2) > MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2) > MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2) > MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2) > MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2) > MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2) > MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2) > MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2) > MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2) > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users