On Wed, 4 Jan 2006, Jeff Squyres wrote:

> On Jan 4, 2006, at 2:08 PM, Anthony Chan wrote:
>
> >> Either my program quits without writing the logfile (and without
> >> complaining) or it crashes in MPI_Finalize. I get the message
> >> "33 additional processes aborted (not shown)".
> >
> > This is not MPE error message.  If the logging crashes in
> > MPI_Finalize,
> > it usually means the merging of logging data from child nodes fails.
> > Since you didn't get any MPE error messages, so it means the cause of
> > the crash isn't expected by MPE.  Does anyone know if "33 additional
> > processes aborted (not shown)" is from OpenMPI ?
>
> Yes, it is.  It is from mpirun telling you that 33 processes -- in
> addition to the error message that it must have shown above that --
> aborted.  So I'm guessing that 34 total processes aborted.
>
> Are you getting corefiles for these processes?  (might need to check
> the limit of your coredumpsize)

Anthony, thanks for your suggestions. I tried the cpilog.c program with
logging and it also crashes when using more than 33 (!) processes. This
also happens when I let it run on a single node - so it is not due to
some network settings.

Actually it seems to depend on the OpenMPI version I use. With version
1.0.1 it works, and I have a logfile for 128 CPUs now. With the nightly
tarball version 1.1a1r8626 (tuned collectives) it does not work (I get
no corefile)

For 33 processes I get:
---
ckutzne@wes:~/mpe2test> mpirun -np 33 ./cpilog.x
Process 0 running on wes
Process 31 running on wes
...
Process 30 running on wes
Process 21 running on wes
pi is approximately 3.1415926535898770, Error is 0.0000000000000839
wall clock time = 0.449936
Writing logfile....
Enabling the synchronization of the clocks...
Finished writing logfile ./cpilog.x.clog2.
---

For 34 processes I get something like (slighly shortened):
---
ckutzne@wes:~/mpe2test> mpirun -np 34 ./cpilog.x
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa0]
[3]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d)
[0x403f376d]
[4]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b)
[0x403f442b]
[5]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30)
[0x403f34c0]
[6]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb)
[0x40069d9b]
[7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b]
[8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6]
[9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3]
[10] func:./cpilog.x(MPI_Init+0x20) [0x805206d]
[11] func:./cpilog.x(main+0x43) [0x804f325]
[12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17]
[13] func:./cpilog.x(free+0x49) [0x804f221]
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa0]
[3]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d)
[0x403f376d]
[4]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b)
[0x403f442b]
[5]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30)
[0x403f34c0]
[6]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb)
[0x40069d9b]
[7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b]
[8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6]
[9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3]
[10] func:./cpilog.x(MPI_Init+0x20) [0x805206d]
[11] func:./cpilog.x(main+0x43) [0x804f325]
[12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17]
[13] func:./cpilog.x(free+0x49) [0x804f221]
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
...
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
mpirun noticed that job rank 1 with PID 9014 on node "localhost" exited on
signal 11.
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
...
2[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0
[0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa0]
[3]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d)
[0x403f376d]
[4]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b)
[0x403f442b]
[5]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30)
[0x403f34c0]
[6]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb)
[0x40069d9b]
[7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b]
[8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6]
[9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3]
[10] func:./cpilog.x(MPI_Init+0x20) [0x805206d]
[11] func:./cpilog.x(main+0x43) [0x804f325]
[12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17]
[13] func:./cpilog.x(free+0x49) [0x804f221]
[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa0]
[3]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d)
[0x403f376d]
[4]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b)
[0x403f442b]
[5]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30)
[0x403f34c0]
[6]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb)
[0x40069d9b]
[7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b]
[8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6]
[9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3]
[10] func:./cpilog.x(MPI_Init+0x20) [0x805206d]
[11] func:./cpilog.x(main+0x43) [0x804f325]
[12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17]
[13] func:./cpilog.x(free+0x49) [0x804f221]
...
30 additional processes aborted (not shown)
3 processes killed (possibly by Open MPI)


Looks like the problem is somewhere in the tuned collectives?
Unfortunately I need a logfile with exactly those :(

   Carsten


---------------------------------------------------
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckut...@gwdg.de
http://www.gwdg.de/~ckutzne

Reply via email to