Re: [OMPI users] Bus Error in ompi_free_list_grow

Jeff Squyres Fri, 14 Nov 2008 13:23:06 -0500

If this is the problem, that's good -- it just means that we need abetter error detection in the case where we run out of memory, etc.Stay tuned to that thread to see what happens.


On Nov 14, 2008, at 1:14 PM, Peter Cebull wrote:

Jeff Squyres wrote:
Could this issue actually be related to:

   http://www.open-mpi.org/community/lists/devel/2008/11/4882.php

(read through the thread to get to the error handling stuff)
You might be right that this issue is the problem. Our system hasdiskless nodes, so /tmp uses a ramdisk. It was initially configuredso that /tmp could use up to 8 GB of the 16 GB of memory on eachnode. We didn't notice until recently that something in the upgradewe made to the system dropped the size of /tmp to 48 MB, so maybethat's the cause of the problem. We've increased the size of /tmpagain in the compute node image, but I'll have to wait until we geta chance to push out the new image before I can tell if that willfix our problem.
Thanks!
Peter
On Nov 14, 2008, at 7:41 AM, Geraldo Veiga wrote:
Thanks Peter. Blocking the shared memory layer did the trick forour program too.
For the record, we also have SGI Propack 6 installed (sgi-propack-release-6-sgi600r3).
Is the on-node shared memory support completely blocked? What ifthe MPI process calls a procedure that uses OpenMP threads, forinstance?
On Thu, Nov 13, 2008 at 1:44 PM, Peter Cebull<peter.ceb...@inl.gov> wrote:
Geraldo,
The previous message you saw was for our Altix ICE system. Sincewe started seeing these errors after upgrading to SGI Propack 6, Iwonder if there's a bug somewhere in the Propack software or anincompatibility between Open MPI and OFED 1.3 (we had no problemsunder OFED 1.2). A workaround I stumbled across is to turn off thesm component:
mpirun --mca btl ^sm . . .
That seems to allow our application to run, although I guess atthe expense of losing on-node shared memory support.
Peter

Geraldo Veiga wrote:
Hi to all,
I am using the same subject of a recent message I found in thelist archives of this mailing list:
http://www.open-mpi.org/community/lists/users/2008/10/7025.php
There was no follow-up on that one, but will add this similarreport in case a list member can give us an idea of how to correctit. Or whose bug this could be.
My application behaves as expected when I run it in a single hostand multiple MPI nodes of our SGI Altix ICE 8200 cluster within InfiniBand. When I try the same with multiple hosts, usingthe PBS batch system the program terminates with a segmentationfault:
-------
[r1i0n9:09192] *** Process received signal ***
[r1i0n9:09192] Signal: Bus error (7)
[r1i0n9:09192] Signal code:  (2)
[r1i0n9:09192] Failing at address: 0x2b67ca0c8c20
[r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00]
[r1i0n9:09192] [ 1] /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(ompi_free_list_grow+0x14a) [0x2b67bf499b38][r1i0n9:09192] [ 2] /sw/openmpi_intel/1.2.8/lib/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321) [0x2b67c3a43e15][r1i0n9:09192] [ 3] /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d)[0x2b67c34e9041][r1i0n9:09192] [ 4] /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x540) [0x2b67c34e17ec][r1i0n9:09192] [ 5] /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(PMPI_Isend+0x63) [0x2b67bf4dcd1f][r1i0n9:09192] [ 6] /sw/openmpi_intel/1.2.8/lib/libmpi_f77.so.0(pmpi_isend+0x8f) [0x2b67bf36a03f][r1i0n9:09192] [ 7] dsimpletest(dmumps_comm_buffer_mp_dmumps_519_+0x449) [0x53e19b][r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b)[0x54fda1]
[r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b]
[r1i0n9:09192] [10] dsimpletest(dmumps_244_+0x808) [0x484e38]
[r1i0n9:09192] [11] dsimpletest(dmumps_142_+0x8717) [0x4bf5eb]
[r1i0n9:09192] [12] dsimpletest(dmumps_+0x1554) [0x43a720]
[r1i0n9:09192] [13] dsimpletest(MAIN__+0x50b) [0x41e4c3]
[r1i0n9:09192] [14] dsimpletest(main+0x3c) [0x683d4c]
[r1i0n9:09192] [15] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b67bfeda184]
[r1i0n9:09192] [16] dsimpletest(dtrmv_+0xa1) [0x41df29]
[r1i0n9:09192] *** End of error message ***
----
Most of the software infrastructure is provided by the Intelpropack. Any hints of where to look further into this bug?
Thanks in advance.

--
Geraldo Veiga <gve...@gmail.com <mailto:gve...@gmail.com>>
------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Peter Cebull


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Geraldo Veiga <gve...@gmail.com>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Peter Cebull
Idaho National Laboratory
P.O. Box 1625, MS3605
Idaho Falls, ID 83415
Phone: 208-526-1909
Email: peter.ceb...@inl.gov

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Bus Error in ompi_free_list_grow

Reply via email to