Hi Justin, If you can build application in debug mode, try inserting valgrind into your MPI command. It's usually very good in tracking down failing memory allocations origins.
Kind regards, - Dmitry. 2017-06-20 1:10 GMT+03:00 Sylvain Jeaugey <sjeau...@nvidia.com>: > Justin, can you try setting mpi_leave_pinned to 0 to disable libptmalloc2 > and confirm this is related to ptmalloc ? > > Thanks, > Sylvain > On 06/19/2017 03:05 PM, Justin Luitjens wrote: > > I have an application that works on other systems but on the current > system I’m running I’m seeing the following crash: > > > > [dt04:22457] *** Process received signal *** > > [dt04:22457] Signal: Segmentation fault (11) > > [dt04:22457] Signal code: Address not mapped (1) > > [dt04:22457] Failing at address: 0x55556a1da250 > > [dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370] > > [dt04:22457] [ 1] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_ > memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810] > > [dt04:22457] [ 2] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_ > memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b] > > [dt04:22457] [ 3] ./hacc_tpm[0x42f068] > > [dt04:22457] [ 4] ./hacc_tpm[0x42f231] > > [dt04:22457] [ 5] ./hacc_tpm[0x40f64d] > > [dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35] > > [dt04:22457] [ 7] ./hacc_tpm[0x4115cf] > > [dt04:22457] *** End of error message *** > > > > > > This app is a CUDA app but doesn’t use GPU direct so that should be > irrelevant. > > > > I’m building with ggc/5.3.0 cuda/8.0.44 openmpi/1.10.7 > > > > I’m using this on centos 7 and am using a vanilla MPI configure line: > ./configure --prefix=/home/jluitjens/libs/openmpi/ > > > > Currently I’m trying to do this with just a single MPI process but > multiple MPI processes fail in the same way: > > > > mpirun --oversubscribe -np 1 ./command > > > > What is odd is the crash occurs around the same spot in the code but not > consistently at the same spot. The spot in the code where the single > thread is at the time of the crash is nowhere near MPI code. The code > where it is crashing is just using malloc to allocate some memory. This > makes me think the crash is due to a thread outside of the application I’m > working on (perhaps in OpenMPI itself) or perhaps due to openmpi hijacking > malloc/free. > > > > Does anyone have any ideas of what I could try to work around this issue? > > > > Thanks, > > Justin > > > > > > > > > > > > > > > > > > > > > > > ------------------------------ > This email message is for the sole use of the intended recipient(s) and > may contain confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply email and destroy all copies > of the original message. > ------------------------------ > > > _______________________________________________ > users mailing > listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users