The same run on 32 CPUs almost completes, starting to write 32 re-start files and fails with the same problem:
Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR) Failing at addr:33 /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10 /opt/ompi/lib/libopal.so.0.0.0:0x99df5 /lib/amd64/libc.so.1:0xcb276 /lib/amd64/libc.so.1:0xc0642 /opt/mx/lib/amd64/libmyriexpress.so:0x102c7 [ Signal 11 (SEGV)] /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0x3d /opt/mx/lib/amd64/libmyriexpress.so:mx__test_common+0x22 /opt/mx/lib/amd64/libmyriexpress.so:mx_test+0x37 /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_send+0x288 /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_send+0x3fc /opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_sendrecv_actual_localcompleted+0x85 /opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_recursivedoubling+0x1a3 /opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_dec_fixed+0x44 /opt/ompi/lib/libmpi.so.0.0.0:MPI_Barrier+0x9d /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:restart+0x9a0 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x219 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191 /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc *** End of error message *** mv: cannot access ./restart.20 31 additional processes aborted (not shown) m2001(27) > On Thu, 23 Nov 2006, Lydia Heck wrote: > > Gadget2 - I cannot attach it because it is not publicly available, > runs perfectly fine on any number of processes on systems such > as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta .. > > Sorry to be so expansive ... > > When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers > on a solaris x64 system the code works fine, until about the end, when > it fails to write all the restart files. > > When I run the code on 64 CPUs it fails with an error message which is > > Topnodes=218193 costlimit=0.0890015 countlimit=428.229 > Before=44417 > After=46281 > NTopleaves= 40496 NTopnodes=46281 (space for 347252) > desired memory imbalance=2.83425 (limit=100719, needed=114185) > Note: the domain decomposition is suboptimum because the ceiling for > memory-imbalance is reached > work-load balance=1.28529 memory-balance=1.01948 > exchange of 0002589387 particles > Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR) > Failing at addr:5192cbd0 > /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10 > /opt/ompi/lib/libopal.so.0.0.0:0x99df5 > /lib/amd64/libc.so.1:0xcb276 > /lib/amd64/libc.so.1:0xc0642 > /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)] > /opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174 > /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116 > /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b > /opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7 > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467 > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191 > /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc > *** End of error message *** > 63 additional processes aborted (not shown) > m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm > ./Gadget2 param.txt > > As this is one of our predominant production codes, I need to make sure > that it is running on any system which I install. Any idea would be welcome. > > Lydia > > > > ------------------------------------------ > Dr E L Heck > > University of Durham > Institute for Computational Cosmology > Ogden Centre > Department of Physics > South Road > > DURHAM, DH1 3LE > United Kingdom > > e-mail: lydia.h...@durham.ac.uk > > Tel.: + 44 191 - 334 3628 > Fax.: + 44 191 - 334 3645 > ___________________________________________ > ------------------------------------------ Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___________________________________________