Done - thanks!

On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Dear openmpi developers,
> 
> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built by
> PGI13.10 as shown below:
> 
> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2
> -report-bindings mPre
> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket
> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> [manage:23082] *** Process received signal ***
> [manage:23082] Signal: Segmentation fault (11)
> [manage:23082] Signal code: Address not mapped (1)
> [manage:23082] Failing at address: 0x34
> [manage:23082] *** End of error message ***
> Segmentation fault (core dumped)
> 
> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> Copyright (C) 2009 Free Software Foundation, Inc.
> ...
> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings
> mPre'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, sd=32767,
> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
> (gdb) where
> #0  0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, sd=32767,
> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767,
> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> #2  0x00002b5f848eb06a in event_process_active_single_queue
> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>    at ./event.c:1366
> #3  0x00002b5f848eb270 in event_process_active (base=0x5f848eb84900007f)
> at ./event.c:1435
> #4  0x00002b5f848eb849 in opal_libevent2021_event_base_loop
> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> #5  0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
> at ./orterun.c:1030
> #6  0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) at ./main.c:13
> (gdb) quit
> 
> 
> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary, which
> causes the segfault.
> 
>   624      /* lookup the corresponding process */
>   625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
>   626      if (NULL == peer) {
>   627          ui64 = (uint64_t*)(&peer->name);
>   628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> orte_oob_base_framework.framework_output,
>   629                              "%s mca_oob_tcp_recv_connect:
> connection from new peer",
>   630                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
>   631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
>   632          peer->mod = mod;
>   633          peer->name = hdr->origin;
>   634          peer->state = MCA_OOB_TCP_ACCEPTING;
>   635          ui64 = (uint64_t*)(&peer->name);
>   636          if (OPAL_SUCCESS != opal_hash_table_set_value_uint64(&mod->
> peers, (*ui64), peer)) {
>   637              OBJ_RELEASE(peer);
>   638              return;
>   639          }
> 
> 
> Please fix this mistake in the next release.
> 
> Regards,
> Tetsuya Mishima
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to