Hi,

I have a problem with my application that is based on dynamic process management. The scenario related to process creation is as follows: 1. All processes call MPI_Comm_spawn_multiple to spawn additional single process per each node.
  2. Parent processes call MPI_Intercomm_merge.
3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, MPI_Intercomm_merge.
  4. Some of parent processes fail at their first MPI_Send with SIGSEGV.
Before and after above steps, processes call plenty of other MPI routines (so it is hard to extract minimal example that suffer from the problem).

Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge for parent processes that fail with SIGSEGV are slightly different. Depending on type used to print it (I'm not sure about the type of MPI_Comm), they are either negative (if printed as int), or bigger than others (if printed as unsigned long long). For instance, with code:
  printf("%d %d %llu %\n", rank, intracomm, intracomm);
and output:
  4 -970650128 140564719013360
  8 14458544 14458544
  12 15121888 15121888
  9 38104000 38104000
  1 14921600 14921600
  11 31413968 31413968
  5 27737968 27737968
  7 -934013376 140023589770816
  13 24512096 24512096
  0 31348624 31348624
  3 -1091084352 139817274269632
  2 27982528 27982528
  10 8745056 8745056
  14 9449856 9449856
  6 10023360 10023360
processes: 4, 7 and 3 fail. There is no connection between failed processes and particular node, it usually affects about 20% of processes and occurs both for tcp and ib. Any idea how to find source of the problem? More info included at the bottom of this message.

Thanks for your help.

Regards,
Artur Malinowski
PhD student at Gdansk University of Technology

----------------------------

openmpi version:

problem occurs both in 1.10.1 and 1.10.2, older untested

----------------------------

config.log

included in config.log.tar.bz2 attachment

----------------------------

ompi_info

included in ompi_info.tar.bz2 attachment

----------------------------

execution command

/path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi /path/to/app

----------------------------

system info

- OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox official page
- Linux: CentOS release 6.5 (Final) under Rocks cluster
- kernel: build on my own, 3.18.0 with some patches

----------------------------

ibv_devinfo

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.35.5100
        node_guid:                      0002:c903:009f:5b00
        sys_image_guid:                 0002:c903:009f:5b03
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        board_id:                       MT_1090110028
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 4
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             InfiniBand

----------------------------

ifconfig

eth0      Link encap:Ethernet  HWaddr XXXXXXXXXX
          inet addr:10.1.255.248  Bcast:10.1.255.255  Mask:255.255.0.0
          inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0
          TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:63945289429 (59.5 GiB)  TX bytes:68561418011 (63.8 GiB)
          Memory:d0960000-d097ffff

Attachment: config.log.tar.bz2
Description: application/bzip

Attachment: ompi_info.tar.bz2
Description: application/bzip

Reply via email to