Open MPI unfortunately has to play some tricks with the malloc system when using InfiniBand or the Cray interconnects. One other option is to set the environment variable
OMPI_MCA_memory_linux_disable to some non-zero value. That will disable the evil memory hooks, which might help if PGI is doing something unexpected. If not, it will also make it a bit easier to use the standard Linux memory debugging tools. Brian On 12/24/13 4:10 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >I'm *very loosely* checking email. :-) > >Agree with what Ralph said: it looks like your program called memalign, >and that ended up segv'ing. That could be an OMPI problem, or it could >be an application problem. Try also configuring OMPI --with-valgrind and >running your app through a memory-checking debugger (although OMPI is not >very valgrind-clean in the 1.6 series :-\ -- you'll get a bunch of false >positives about reads from unallocated and memory being left >still-allocated after MPI_FINALIZE). > > > >On Dec 23, 2013, at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> I fear that Jeff and Brian are both out for the holiday, Gus, so we are >>unlikely to have much info on this until they return >> >> I'm unaware of any such problems in 1.6.5. It looks like something >>isn't properly aligned in memory - could be an error on our part, but >>might be in the program. You might want to build a debug version and see >>if that segfaults, and then look at the core with gdb to see where it >>happened. >> >> >> On Dec 23, 2013, at 3:27 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: >> >>> Dear OMPI experts >>> >>> I have been using OMPI 1.6.5 built with gcc 4.4.7 and >>> PGI pgfortran 11.10 to successfully compile and run >>> a large climate modeling program (CESM) in several >>> different configurations. >>> >>> However, today I hit a segmentation fault when running a new model >>>configuration. >>> [In the climate modeling jargon, a program is called a "model".] >>> >>> This is somewhat unpleasant because that OMPI build >>> is a central piece of the production CESM model setup available >>> to all users in our two clusters at this point. >>> I have other OMPI 1.6.5 builds, with other compilers, but that one >>> was working very well with CESM, until today. >>> >>> Unless I am misinterpreting it, the error message, >>> reproduced below, seems to indicate the problem >>> happened inside the OMPI library. >>> Or not? >>> >>> Other details: >>> >>> Nodes are AMD Opteron 6376 x86_64, interconnect is Infiniband QDR, >>> OS is stock CentOS 6.4, kernel 2.6.32-358.2.1.el6.x86_64. >>> The program is compiled with the OMPI wrappers (mpicc and mpif90), >>> and somewhat conservative optimization flags: >>> >>> FFLAGS := $(CPPDEFS) -i4 -gopt -Mlist -Mextend -byteswapio >>>-Minform=inform -traceback -O2 -Mvect=nosse -Kieee >>> >>> Is this a known issue? >>> Any clues on how to address it? >>> >>> Thank you for your help, >>> Gus Correa >>> >>> **************** error message ******************* >>> >>> [1,31]<stderr>:[node30:17008] *** Process received signal *** >>> [1,31]<stderr>:[node30:17008] Signal: Segmentation fault (11) >>> [1,31]<stderr>:[node30:17008] Signal code: Address not mapped (1) >>> [1,31]<stderr>:[node30:17008] Failing at address: 0x17 >>> [1,31]<stderr>:[node30:17008] [ 0] /lib64/libpthread.so.0(+0xf500) >>>[0x2b788ef9f500] >>> [1,31]<stderr>:[node30:17008] [ 1] >>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(+0x100ee3) >>>[0x2b788e200ee3] >>> [1,31]<stderr>:[node30:17008] [ 2] >>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmall >>>oc2_int_malloc+0x111) [0x2b788e203771] >>> [1,31]<stderr>:[node30:17008] [ 3] >>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmall >>>oc2_int_memalign+0x97) [0x2b788e2046d7] >>> [1,31]<stderr>:[node30:17008] [ 4] >>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmall >>>oc2_memalign+0x8b) [0x2b788e2052ab] >>> [1,31]<stderr>:[node30:17008] [ 5] ./ccsm.exe(pgf90_auto_alloc+0x73) >>>[0xe2c4c3] >>> [1,31]<stderr>:[node30:17008] *** End of error message *** >>> >>>------------------------------------------------------------------------ >>>-- >>> mpiexec noticed that process rank 31 with PID 17008 on node node30 >>>exited on signal 11 (Segmentation fault). >>> >>>------------------------------------------------------------------------ >>>-- >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >_______________________________________________ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories