Dear openmpi developers,
I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built by PGI13.10 as shown below: [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2 -report-bindings mPre [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B][./././././.] [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././././.][B/B/./././.] [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././.][./././././.] [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./.][./././././.] [manage:23082] *** Process received signal *** [manage:23082] Signal: Segmentation fault (11) [manage:23082] Signal code: Address not mapped (1) [manage:23082] Failing at address: 0x34 [manage:23082] *** End of error message *** Segmentation fault (core dumped) [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082 GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) Copyright (C) 2009 Free Software Foundation, Inc. ... Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings mPre'. Program terminated with signal 11, Segmentation fault. #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, sd=32767, hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); (gdb) where #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, sd=32767, hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767, cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 #2 0x00002b5f848eb06a in event_process_active_single_queue (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) at ./event.c:1366 #3 0x00002b5f848eb270 in event_process_active (base=0x5f848eb84900007f) at ./event.c:1435 #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop (base=0x4077a000007f, flags=32767) at ./event.c:1645 #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8) at ./orterun.c:1030 #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) at ./main.c:13 (gdb) quit The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary, which causes the segfault. 624 /* lookup the corresponding process */ 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin); 626 if (NULL == peer) { 627 ui64 = (uint64_t*)(&peer->name); 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, orte_oob_base_framework.framework_output, 629 "%s mca_oob_tcp_recv_connect: connection from new peer", 630 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); 632 peer->mod = mod; 633 peer->name = hdr->origin; 634 peer->state = MCA_OOB_TCP_ACCEPTING; 635 ui64 = (uint64_t*)(&peer->name); 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64(&mod-> peers, (*ui64), peer)) { 637 OBJ_RELEASE(peer); 638 return; 639 } Please fix this mistake in the next release. Regards, Tetsuya Mishima