Artur,

in OpenMPI, MPI_Comm is an opaque pointer, so strictly speaking, high value
might not be an issue.

can you have your failed processes generate a core and post the stack trace
?
btw, do you MPI_Send on the intra communicator created by
MPI_Intercomm_merge ?

what is the minimal config needed to reproduce the issue ?
(number of nodes, number of tasks started with mpirun, number of tasks
spawn by MPI_Comm_spawn_multiple, how many different binaries are spawned ?)

Cheers,

Gilles

On Saturday, February 20, 2016, Artur Malinowski <artur.malinow...@pg.gda.pl>
wrote:

> Hi,
>
> I have a problem with my application that is based on dynamic process
> management. The scenario related to process creation is as follows:
>   1. All processes call MPI_Comm_spawn_multiple to spawn additional single
> process per each node.
>   2. Parent processes call MPI_Intercomm_merge.
>   3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent,
> MPI_Intercomm_merge.
>   4. Some of parent processes fail at their first MPI_Send with SIGSEGV.
> Before and after above steps, processes call plenty of other MPI routines
> (so it is hard to extract minimal example that suffer from the problem).
>
> Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge
> for parent processes that fail with SIGSEGV are slightly different.
> Depending on type used to print it (I'm not sure about the type of
> MPI_Comm), they are either negative (if printed as int), or bigger than
> others (if printed as unsigned long long). For instance, with code:
>   printf("%d %d %llu %\n", rank, intracomm, intracomm);
> and output:
>   4 -970650128 140564719013360
>   8 14458544 14458544
>   12 15121888 15121888
>   9 38104000 38104000
>   1 14921600 14921600
>   11 31413968 31413968
>   5 27737968 27737968
>   7 -934013376 140023589770816
>   13 24512096 24512096
>   0 31348624 31348624
>   3 -1091084352 139817274269632
>   2 27982528 27982528
>   10 8745056 8745056
>   14 9449856 9449856
>   6 10023360 10023360
> processes: 4, 7 and 3 fail. There is no connection between failed
> processes and particular node, it usually affects about 20% of processes
> and occurs both for tcp and ib. Any idea how to find source of the problem?
> More info included at the bottom of this message.
>
> Thanks for your help.
>
> Regards,
> Artur Malinowski
> PhD student at Gdansk University of Technology
>
> ----------------------------
>
> openmpi version:
>
> problem occurs both in 1.10.1 and 1.10.2, older untested
>
> ----------------------------
>
> config.log
>
> included in config.log.tar.bz2 attachment
>
> ----------------------------
>
> ompi_info
>
> included in ompi_info.tar.bz2 attachment
>
> ----------------------------
>
> execution command
>
> /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi
> /path/to/app
>
> ----------------------------
>
> system info
>
> - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox
> official page
> - Linux: CentOS release 6.5 (Final) under Rocks cluster
> - kernel: build on my own, 3.18.0 with some patches
>
> ----------------------------
>
> ibv_devinfo
>
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.35.5100
>         node_guid:                      0002:c903:009f:5b00
>         sys_image_guid:                 0002:c903:009f:5b03
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x1
>         board_id:                       MT_1090110028
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 4
>                         port_lid:               1
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
> ----------------------------
>
> ifconfig
>
> eth0      Link encap:Ethernet  HWaddr XXXXXXXXXX
>           inet addr:10.1.255.248  Bcast:10.1.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:63945289429 (59.5 GiB)  TX bytes:68561418011 (63.8 GiB)
>           Memory:d0960000-d097ffff
>

Reply via email to