I recompiled MPI with -g, but it didn't solve the problem. Two things that have changed are: buf in PMPI_Recv is no longer of value 0 and backtrace in gdb shows more functions (eg. mca_pml_ob1_recv_frag_callback_put as #1).
As you recommended, I will try to walk up the stack, but it's not so easy for me to follow this code. This is the backtrace I got with -g: ----- Program received signal SIGSEGV, Segmentation fault. 0x00007f1f1a11e4eb in mca_pml_ob1_send_request_put (sendreq=0x1437b00, btl=0xdae850, hdr=0xeb4870) at pml_ob1_sendreq.c:1231 1231 pml_ob1_sendreq.c: No such file or directory. in pml_ob1_sendreq.c (gdb) bt #0 0x00007f1f1a11e4eb in mca_pml_ob1_send_request_put (sendreq=0x1437b00, btl=0xdae850, hdr=0xeb4870) at pml_ob1_sendreq.c:1231 #1 0x00007f1f1a1124de in mca_pml_ob1_recv_frag_callback_put (btl=0xdae850, tag=72 'H', des=0x7f1f1ff6bb00, cbdata=0x0) at pml_ob1_recvfrag.c:361 #2 0x00007f1f19660e0f in mca_btl_tcp_endpoint_recv_handler (sd=24, flags=2, user=0xe2ab40) at btl_tcp_endpoint.c:718 #3 0x00007f1f1d74aa5b in event_process_active (base=0xd82af0) at event.c:651 #4 0x00007f1f1d74b087 in opal_event_base_loop (base=0xd82af0, flags=2) at event.c:823 #5 0x00007f1f1d74ac76 in opal_event_loop (flags=2) at event.c:730 #6 0x00007f1f1d73a360 in opal_progress () at runtime/opal_progress.c:189 #7 0x00007f1f1a10c0af in opal_condition_wait (c=0x7f1f1df3a5c0, m=0x7f1f1df3a620) at ../../../../opal/threads/condition.h:99 #8 0x00007f1f1a10bef1 in ompi_request_wait_completion (req=0xe1eb00) at ../../../../ompi/request/request.h:375 #9 0x00007f1f1a10bdb5 in mca_pml_ob1_recv (addr=0x7f1f1a083080, count=1, datatype=0xeb3da0, src=-1, tag=0, comm=0xeb0cd0, status=0xe43f00) at pml_ob1_irecv.c:104 #10 0x00007f1f1dc9e324 in PMPI_Recv (buf=0x7f1f1a083080, count=1, type=0xeb3da0, source=-1, tag=0, comm=0xeb0cd0, status=0xe43f00) at precv.c:75 #11 0x000000000049cc43 in BI_Srecv () #12 0x000000000049c555 in BI_IdringBR () #13 0x0000000000495ba1 in ilp64_Cdgebr2d () #14 0x000000000047ffa0 in Cdgebr2d () #15 0x00007f1f1f99c8e1 in PB_CInV2 () from /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so #16 0x00007f1f1f9c489c in PB_CpgemmAB () from /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so #17 0x00007f1f1fa748fd in pdgemm_ () from /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so ----- Thanks, Grzegorz Maj 2010/12/7 Terry Dontje <terry.don...@oracle.com> > I am not sure this has anything to do with your problem but if you look at > the stack entry for PMPI_Recv I noticed the buf has a value of 0. Shouldn't > that be an address? > > Does your code fail if the MPI library is built with -g? If it does fail > the same way, the next step I would do would be to walk up the stack and try > and figure out where the sendreq address is coming from because supposedly > it is that address that is not mapped according to the original stack. > > --td > > > On 12/07/2010 08:29 AM, Grzegorz Maj wrote: > > Some update on this issue. I've attached gdb to the crashing > application and I got: > > ----- > Program received signal SIGSEGV, Segmentation fault. > mca_pml_ob1_send_request_put (sendreq=0x130c480, btl=0xc49850, > hdr=0xd10e60) at pml_ob1_sendreq.c:1231 > 1231 pml_ob1_sendreq.c: No such file or directory. > in pml_ob1_sendreq.c > (gdb) bt > #0 mca_pml_ob1_send_request_put (sendreq=0x130c480, btl=0xc49850, > hdr=0xd10e60) at pml_ob1_sendreq.c:1231 > #1 0x00007fc55bf31693 in mca_btl_tcp_endpoint_recv_handler (sd=<value > optimized out>, flags=<value optimized out>, user=<value optimized > out>) at btl_tcp_endpoint.c:718 > #2 0x00007fc55fff7de4 in event_process_active (base=0xc1daf0, > flags=2) at event.c:651 > #3 opal_event_base_loop (base=0xc1daf0, flags=2) at event.c:823 > #4 0x00007fc55ffe9ff1 in opal_progress () at runtime/opal_progress.c:189 > #5 0x00007fc55c9d7115 in opal_condition_wait (addr=<value optimized > out>, count=<value optimized out>, datatype=<value optimized out>, > src=<value optimized out>, tag=<value optimized out>, > comm=<value optimized out>, status=0xcc6100) at > ../../../../opal/threads/condition.h:99 > #6 ompi_request_wait_completion (addr=<value optimized out>, > count=<value optimized out>, datatype=<value optimized out>, > src=<value optimized out>, tag=<value optimized out>, > comm=<value optimized out>, status=0xcc6100) at > ../../../../ompi/request/request.h:375 > #7 mca_pml_ob1_recv (addr=<value optimized out>, count=<value > optimized out>, datatype=<value optimized out>, src=<value optimized > out>, tag=<value optimized out>, comm=<value optimized out>, > status=0xcc6100) at pml_ob1_irecv.c:104 > #8 0x00007fc560511260 in PMPI_Recv (buf=0x0, count=12884048, > type=0xd10410, source=-1, tag=0, comm=0xd0daa0, status=0xcc6100) at > precv.c:75 > #9 0x000000000049cc43 in BI_Srecv () > #10 0x000000000049c555 in BI_IdringBR () > #11 0x0000000000495ba1 in ilp64_Cdgebr2d () > #12 0x000000000047ffa0 in Cdgebr2d () > #13 0x00007fc5621da8e1 in PB_CInV2 () from > /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so > #14 0x00007fc56220289c in PB_CpgemmAB () from > /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so > #15 0x00007fc5622b28fd in pdgemm_ () from > /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so > ----- > > So this looks like the line responsible for segmentation fault is: > mca_bml_base_endpoint_t *bml_endpoint = sendreq->req_endpoint; > > I repeated it several times: always crashes in the same line. > > I have no idea what to do with this. Again, any help would be appreciated. > > Thanks, > Grzegorz Maj > > > > 2010/12/6 Grzegorz Maj <ma...@wp.pl> <ma...@wp.pl>: > > Hi, > I'm using mkl scalapack in my project. Recently, I was trying to run > my application on new set of nodes. Unfortunately, when I try to > execute more than about 20 processes, I get segmentation fault. > > [compn7:03552] *** Process received signal *** > [compn7:03552] Signal: Segmentation fault (11) > [compn7:03552] Signal code: Address not mapped (1) > [compn7:03552] Failing at address: 0x20b2e68 > [compn7:03552] [ 0] /lib64/libpthread.so.0(+0xf3c0) [0x7f46e0fc33c0] > [compn7:03552] [ 1] > /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0xd577) > [0x7f46dd093577] > [compn7:03552] [ 2] > /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x5b4c) > [0x7f46dc5edb4c] > [compn7:03552] [ 3] > /home/gmaj/lib/openmpi/lib/libopen-pal.so.0(+0x1dbe8) [0x7f46e0679be8] > [compn7:03552] [ 4] > (home/gmaj/lib/openmpi/lib/libopen-pal.so.0(opal_progress+0xa1) > [0x7f46e066dbf1] > [compn7:03552] [ 5] > /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5945) > [0x7f46dd08b945] > [compn7:03552] [ 6] > /home/gmaj/lib/openmpi/lib/libmpi.so.0(MPI_Send+0x6a) [0x7f46e0b4f10a] > [compn7:03552] [ 7] /home/gmaj/matrix/matrix(BI_Ssend+0x21) [0x49cc11] > [compn7:03552] [ 8] /home/gmaj/matrix/matrix(BI_IdringBR+0x79) [0x49c579] > [compn7:03552] [ 9] /home/gmaj/matrix/matrix(ilp64_Cdgebr2d+0x221) [0x495bb1] > [compn7:03552] [10] /home/gmaj/matrix/matrix(Cdgebr2d+0xd0) [0x47ffb0] > [compn7:03552] [11] > /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so(PB_CInV2+0x1304) > [0x7f46e27f5124] > [compn7:03552] *** End of error message *** > > This error appears during some scalapack computation. My processes do > some mpi communication before this error appears. > > I found out, that by modifying btl_tcp_eager_limit and > btl_tcp_max_send_size parameters, I can run more processes - the > smaller those values are, the more processes I can run. Unfortunately, > by this method I've succeeded to run up to 30 processes, which is > still far to small. > > Some clue may be what valgrind says: > > ==3894== Syscall param writev(vector[...]) points to uninitialised byte(s) > ==3894== at 0x82D009B: writev (in /lib64/libc-2.12.90.so) > ==3894== by 0xBA2136D: mca_btl_tcp_frag_send (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so) > ==3894== by 0xBA203D0: mca_btl_tcp_endpoint_send (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so) > ==3894== by 0xB003583: mca_pml_ob1_send_request_start_rdma (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so) > ==3894== by 0xAFFA7C9: mca_pml_ob1_send (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so) > ==3894== by 0x6D4B109: PMPI_Send (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0x49CC10: BI_Ssend (in /home/gmaj/matrix/matrix) > ==3894== by 0x49C578: BI_IdringBR (in /home/gmaj/matrix/matrix) > ==3894== by 0x495BB0: ilp64_Cdgebr2d (in /home/gmaj/matrix/matrix) > ==3894== by 0x47FFAF: Cdgebr2d (in /home/gmaj/matrix/matrix) > ==3894== by 0x51B38E0: PB_CInV2 (in > /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so) > ==3894== by 0x51DB89B: PB_CpgemmAB (in > /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so) > ==3894== Address 0xadecdce is 461,886 bytes inside a block of size > 527,544 alloc'd > ==3894== at 0x4C2615D: malloc (vg_replace_malloc.c:195) > ==3894== by 0x6D0BBA3: ompi_free_list_grow (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0xBA1E1A4: mca_btl_tcp_component_init (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so) > ==3894== by 0x6D5C909: mca_btl_base_select (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0xB40E950: mca_bml_r2_component_init (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_bml_r2.so) > ==3894== by 0x6D5C07E: mca_bml_base_init (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0xAFF8A0E: mca_pml_ob1_component_init (in > /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so) > ==3894== by 0x6D663B2: mca_pml_base_select (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0x6D25D20: ompi_mpi_init (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0x6D45987: PMPI_Init_thread (in > /home/gmaj/lib/openmpi/lib/libmpi.so.0) > ==3894== by 0x42490A: MPI::Init_thread(int&, char**&, int) > (functions_inln.h:150) > ==3894== by 0x41F483: main (matrix.cpp:83) > > I've tried to configure open-mpi with option --without-memory-manager, > but it didn't help. > > I can successfully run exactly the same application on other machines, > having the number of nodes even over 800. > > Does anyone have any idea how to further debug this issue? Any help > would be appreciated. > > Thanks, > Grzegorz Maj > > > _______________________________________________ > users mailing > listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > [image: Oracle] > Terry D. Dontje | Principal Software Engineer > Developer Tools Engineering | +1.781.442.2631 > Oracle * - Performance Technologies* > 95 Network Drive, Burlington, MA 01803 > Email terry.don...@oracle.com > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >