Open MPI unfortunately has to play some tricks with the malloc system when
using InfiniBand or the Cray interconnects.  One other option is to set
the environment variable

  OMPI_MCA_memory_linux_disable

to some non-zero value.  That will disable the evil memory hooks, which
might help if PGI is doing something unexpected.  If not, it will also
make it a bit easier to use the standard Linux memory debugging tools.

Brian

On 12/24/13 4:10 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:

>I'm *very loosely* checking email.  :-)
>
>Agree with what Ralph said: it looks like your program called memalign,
>and that ended up segv'ing.  That could be an OMPI problem, or it could
>be an application problem.  Try also configuring OMPI --with-valgrind and
>running your app through a memory-checking debugger (although OMPI is not
>very valgrind-clean in the 1.6 series :-\ -- you'll get a bunch of false
>positives about reads from unallocated and memory being left
>still-allocated after MPI_FINALIZE).
>
>
>
>On Dec 23, 2013, at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> I fear that Jeff and Brian are both out for the holiday, Gus, so we are
>>unlikely to have much info on this until they return
>> 
>> I'm unaware of any such problems in 1.6.5. It looks like something
>>isn't properly aligned in memory - could be an error on our part, but
>>might be in the program. You might want to build a debug version and see
>>if that segfaults, and then look at the core with gdb to see where it
>>happened.
>> 
>> 
>> On Dec 23, 2013, at 3:27 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:
>> 
>>> Dear OMPI experts
>>> 
>>> I have been using OMPI 1.6.5 built with gcc 4.4.7 and
>>> PGI pgfortran 11.10 to successfully compile and run
>>> a large climate modeling program (CESM) in several
>>> different configurations.
>>> 
>>> However, today I hit a segmentation fault when running a new model
>>>configuration.
>>> [In the climate modeling jargon, a program is called a "model".]
>>> 
>>> This is somewhat unpleasant because that OMPI build
>>> is a central piece of the production CESM model setup available
>>> to all users in our two clusters at this point.
>>> I have other OMPI 1.6.5 builds, with other compilers, but that one
>>> was working very well with CESM, until today.
>>> 
>>> Unless I am misinterpreting it, the error message,
>>> reproduced below, seems to indicate the problem
>>> happened inside the OMPI library.
>>> Or not?
>>> 
>>> Other details:
>>> 
>>> Nodes are AMD Opteron 6376 x86_64, interconnect is Infiniband QDR,
>>> OS is stock CentOS 6.4, kernel 2.6.32-358.2.1.el6.x86_64.
>>> The program is compiled with the OMPI wrappers (mpicc and mpif90),
>>> and somewhat conservative optimization flags:
>>> 
>>> FFLAGS       := $(CPPDEFS) -i4 -gopt -Mlist -Mextend -byteswapio
>>>-Minform=inform -traceback -O2 -Mvect=nosse -Kieee
>>> 
>>> Is this a known issue?
>>> Any clues on how to address it?
>>> 
>>> Thank you for your help,
>>> Gus Correa
>>> 
>>> **************** error message *******************
>>> 
>>> [1,31]<stderr>:[node30:17008] *** Process received signal ***
>>> [1,31]<stderr>:[node30:17008] Signal: Segmentation fault (11)
>>> [1,31]<stderr>:[node30:17008] Signal code: Address not mapped (1)
>>> [1,31]<stderr>:[node30:17008] Failing at address: 0x17
>>> [1,31]<stderr>:[node30:17008] [ 0] /lib64/libpthread.so.0(+0xf500)
>>>[0x2b788ef9f500]
>>> [1,31]<stderr>:[node30:17008] [ 1]
>>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(+0x100ee3)
>>>[0x2b788e200ee3]
>>> [1,31]<stderr>:[node30:17008] [ 2]
>>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmall
>>>oc2_int_malloc+0x111) [0x2b788e203771]
>>> [1,31]<stderr>:[node30:17008] [ 3]
>>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmall
>>>oc2_int_memalign+0x97) [0x2b788e2046d7]
>>> [1,31]<stderr>:[node30:17008] [ 4]
>>>/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmall
>>>oc2_memalign+0x8b) [0x2b788e2052ab]
>>> [1,31]<stderr>:[node30:17008] [ 5] ./ccsm.exe(pgf90_auto_alloc+0x73)
>>>[0xe2c4c3]
>>> [1,31]<stderr>:[node30:17008] *** End of error message ***
>>> 
>>>------------------------------------------------------------------------
>>>--
>>> mpiexec noticed that process rank 31 with PID 17008 on node node30
>>>exited on signal 11 (Segmentation fault).
>>> 
>>>------------------------------------------------------------------------
>>>--
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>-- 
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>_______________________________________________
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories



Reply via email to