Thanks Peter. Blocking the shared memory layer did the trick for our program too.
For the record, we also have SGI Propack 6 installed (sgi-propack-release-6-sgi600r3). Is the on-node shared memory support completely blocked? What if the MPI process calls a procedure that uses OpenMP threads, for instance? On Thu, Nov 13, 2008 at 1:44 PM, Peter Cebull <peter.ceb...@inl.gov> wrote: > Geraldo, > > The previous message you saw was for our Altix ICE system. Since we started > seeing these errors after upgrading to SGI Propack 6, I wonder if there's a > bug somewhere in the Propack software or an incompatibility between Open MPI > and OFED 1.3 (we had no problems under OFED 1.2). A workaround I stumbled > across is to turn off the sm component: > > mpirun --mca btl ^sm . . . > > That seems to allow our application to run, although I guess at the expense > of losing on-node shared memory support. > > Peter > > Geraldo Veiga wrote: > >> Hi to all, >> >> I am using the same subject of a recent message I found in the list >> archives of this mailing list: >> >> http://www.open-mpi.org/community/lists/users/2008/10/7025.php >> >> There was no follow-up on that one, but will add this similar report in >> case a list member can give us an idea of how to correct it. Or whose bug >> this could be. >> >> My application behaves as expected when I run it in a single host and >> multiple MPI nodes of our SGI Altix ICE 8200 cluster with in InfiniBand. >> When I try the same with multiple hosts, using the PBS batch system the >> program terminates with a segmentation fault: >> >> ------- >> [r1i0n9:09192] *** Process received signal *** >> [r1i0n9:09192] Signal: Bus error (7) >> [r1i0n9:09192] Signal code: (2) >> [r1i0n9:09192] Failing at address: 0x2b67ca0c8c20 >> [r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00] >> [r1i0n9:09192] [ 1] >> /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(ompi_free_list_grow+0x14a) >> [0x2b67bf499b38] >> [r1i0n9:09192] [ 2] >> /sw/openmpi_intel/1.2.8/lib/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321) >> [0x2b67c3a43e15] >> [r1i0n9:09192] [ 3] >> /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d) >> [0x2b67c34e9041] >> [r1i0n9:09192] [ 4] >> /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x540) >> [0x2b67c34e17ec] >> [r1i0n9:09192] [ 5] >> /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(PMPI_Isend+0x63) [0x2b67bf4dcd1f] >> [r1i0n9:09192] [ 6] >> /sw/openmpi_intel/1.2.8/lib/libmpi_f77.so.0(pmpi_isend+0x8f) >> [0x2b67bf36a03f] >> [r1i0n9:09192] [ 7] dsimpletest(dmumps_comm_buffer_mp_dmumps_519_+0x449) >> [0x53e19b] >> [r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b) >> [0x54fda1] >> [r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b] >> [r1i0n9:09192] [10] dsimpletest(dmumps_244_+0x808) [0x484e38] >> [r1i0n9:09192] [11] dsimpletest(dmumps_142_+0x8717) [0x4bf5eb] >> [r1i0n9:09192] [12] dsimpletest(dmumps_+0x1554) [0x43a720] >> [r1i0n9:09192] [13] dsimpletest(MAIN__+0x50b) [0x41e4c3] >> [r1i0n9:09192] [14] dsimpletest(main+0x3c) [0x683d4c] >> [r1i0n9:09192] [15] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x2b67bfeda184] >> [r1i0n9:09192] [16] dsimpletest(dtrmv_+0xa1) [0x41df29] >> [r1i0n9:09192] *** End of error message *** >> ---- >> >> Most of the software infrastructure is provided by the Intel propack. Any >> hints of where to look further into this bug? >> >> Thanks in advance. >> >> -- >> Geraldo Veiga <gve...@gmail.com <mailto:gve...@gmail.com>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > Peter Cebull > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Geraldo Veiga <gve...@gmail.com>