On Wed, 4 Jan 2006, Jeff Squyres wrote: > On Jan 4, 2006, at 2:08 PM, Anthony Chan wrote: > > >> Either my program quits without writing the logfile (and without > >> complaining) or it crashes in MPI_Finalize. I get the message > >> "33 additional processes aborted (not shown)". > > > > This is not MPE error message. If the logging crashes in > > MPI_Finalize, > > it usually means the merging of logging data from child nodes fails. > > Since you didn't get any MPE error messages, so it means the cause of > > the crash isn't expected by MPE. Does anyone know if "33 additional > > processes aborted (not shown)" is from OpenMPI ? > > Yes, it is. It is from mpirun telling you that 33 processes -- in > addition to the error message that it must have shown above that -- > aborted. So I'm guessing that 34 total processes aborted. > > Are you getting corefiles for these processes? (might need to check > the limit of your coredumpsize)
Anthony, thanks for your suggestions. I tried the cpilog.c program with logging and it also crashes when using more than 33 (!) processes. This also happens when I let it run on a single node - so it is not due to some network settings. Actually it seems to depend on the OpenMPI version I use. With version 1.0.1 it works, and I have a logfile for 128 CPUs now. With the nightly tarball version 1.1a1r8626 (tuned collectives) it does not work (I get no corefile) For 33 processes I get: --- ckutzne@wes:~/mpe2test> mpirun -np 33 ./cpilog.x Process 0 running on wes Process 31 running on wes ... Process 30 running on wes Process 21 running on wes pi is approximately 3.1415926535898770, Error is 0.0000000000000839 wall clock time = 0.449936 Writing logfile.... Enabling the synchronization of the clocks... Finished writing logfile ./cpilog.x.clog2. --- For 34 processes I get something like (slighly shortened): --- ckutzne@wes:~/mpe2test> mpirun -np 34 ./cpilog.x Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** [0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa0] [3] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d) [0x403f376d] [4] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b) [0x403f442b] [5] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30) [0x403f34c0] [6] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb) [0x40069d9b] [7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b] [8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6] [9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3] [10] func:./cpilog.x(MPI_Init+0x20) [0x805206d] [11] func:./cpilog.x(main+0x43) [0x804f325] [12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17] [13] func:./cpilog.x(free+0x49) [0x804f221] Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** [0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa0] [3] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d) [0x403f376d] [4] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b) [0x403f442b] [5] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30) [0x403f34c0] [6] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb) [0x40069d9b] [7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b] [8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6] [9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3] [10] func:./cpilog.x(MPI_Init+0x20) [0x805206d] [11] func:./cpilog.x(main+0x43) [0x804f325] [12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17] [13] func:./cpilog.x(free+0x49) [0x804f221] Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 ... *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 mpirun noticed that job rank 1 with PID 9014 on node "localhost" exited on signal 11. *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** ... 2[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa0] [3] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d) [0x403f376d] [4] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b) [0x403f442b] [5] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30) [0x403f34c0] [6] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb) [0x40069d9b] [7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b] [8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6] [9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3] [10] func:./cpilog.x(MPI_Init+0x20) [0x805206d] [11] func:./cpilog.x(main+0x43) [0x804f325] [12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17] [13] func:./cpilog.x(free+0x49) [0x804f221] [0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa0] [3] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d) [0x403f376d] [4] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b) [0x403f442b] [5] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30) [0x403f34c0] [6] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb) [0x40069d9b] [7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b] [8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6] [9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3] [10] func:./cpilog.x(MPI_Init+0x20) [0x805206d] [11] func:./cpilog.x(main+0x43) [0x804f325] [12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17] [13] func:./cpilog.x(free+0x49) [0x804f221] ... 30 additional processes aborted (not shown) 3 processes killed (possibly by Open MPI) Looks like the problem is somewhere in the tuned collectives? Unfortunately I need a logfile with exactly those :( Carsten --------------------------------------------------- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Department Am Fassberg 11 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 eMail ckut...@gwdg.de http://www.gwdg.de/~ckutzne