I fear that Jeff and Brian are both out for the holiday, Gus, so we are unlikely to have much info on this until they return
I'm unaware of any such problems in 1.6.5. It looks like something isn't properly aligned in memory - could be an error on our part, but might be in the program. You might want to build a debug version and see if that segfaults, and then look at the core with gdb to see where it happened. On Dec 23, 2013, at 3:27 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Dear OMPI experts > > I have been using OMPI 1.6.5 built with gcc 4.4.7 and > PGI pgfortran 11.10 to successfully compile and run > a large climate modeling program (CESM) in several > different configurations. > > However, today I hit a segmentation fault when running a new model > configuration. > [In the climate modeling jargon, a program is called a "model".] > > This is somewhat unpleasant because that OMPI build > is a central piece of the production CESM model setup available > to all users in our two clusters at this point. > I have other OMPI 1.6.5 builds, with other compilers, but that one > was working very well with CESM, until today. > > Unless I am misinterpreting it, the error message, > reproduced below, seems to indicate the problem > happened inside the OMPI library. > Or not? > > Other details: > > Nodes are AMD Opteron 6376 x86_64, interconnect is Infiniband QDR, > OS is stock CentOS 6.4, kernel 2.6.32-358.2.1.el6.x86_64. > The program is compiled with the OMPI wrappers (mpicc and mpif90), > and somewhat conservative optimization flags: > > FFLAGS := $(CPPDEFS) -i4 -gopt -Mlist -Mextend -byteswapio > -Minform=inform -traceback -O2 -Mvect=nosse -Kieee > > Is this a known issue? > Any clues on how to address it? > > Thank you for your help, > Gus Correa > > **************** error message ******************* > > [1,31]<stderr>:[node30:17008] *** Process received signal *** > [1,31]<stderr>:[node30:17008] Signal: Segmentation fault (11) > [1,31]<stderr>:[node30:17008] Signal code: Address not mapped (1) > [1,31]<stderr>:[node30:17008] Failing at address: 0x17 > [1,31]<stderr>:[node30:17008] [ 0] /lib64/libpthread.so.0(+0xf500) > [0x2b788ef9f500] > [1,31]<stderr>:[node30:17008] [ 1] > /sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(+0x100ee3) > [0x2b788e200ee3] > [1,31]<stderr>:[node30:17008] [ 2] > /sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x111) > [0x2b788e203771] > [1,31]<stderr>:[node30:17008] [ 3] > /sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x97) > [0x2b788e2046d7] > [1,31]<stderr>:[node30:17008] [ 4] > /sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x8b) > [0x2b788e2052ab] > [1,31]<stderr>:[node30:17008] [ 5] ./ccsm.exe(pgf90_auto_alloc+0x73) > [0xe2c4c3] > [1,31]<stderr>:[node30:17008] *** End of error message *** > -------------------------------------------------------------------------- > mpiexec noticed that process rank 31 with PID 17008 on node node30 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users