Hi,I have a problem with my application that is based on dynamic process management. The scenario related to process creation is as follows: 1. All processes call MPI_Comm_spawn_multiple to spawn additional single process per each node.
2. Parent processes call MPI_Intercomm_merge.3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, MPI_Intercomm_merge.
4. Some of parent processes fail at their first MPI_Send with SIGSEGV.Before and after above steps, processes call plenty of other MPI routines (so it is hard to extract minimal example that suffer from the problem).
Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge for parent processes that fail with SIGSEGV are slightly different. Depending on type used to print it (I'm not sure about the type of MPI_Comm), they are either negative (if printed as int), or bigger than others (if printed as unsigned long long). For instance, with code:
printf("%d %d %llu %\n", rank, intracomm, intracomm); and output: 4 -970650128 140564719013360 8 14458544 14458544 12 15121888 15121888 9 38104000 38104000 1 14921600 14921600 11 31413968 31413968 5 27737968 27737968 7 -934013376 140023589770816 13 24512096 24512096 0 31348624 31348624 3 -1091084352 139817274269632 2 27982528 27982528 10 8745056 8745056 14 9449856 9449856 6 10023360 10023360processes: 4, 7 and 3 fail. There is no connection between failed processes and particular node, it usually affects about 20% of processes and occurs both for tcp and ib. Any idea how to find source of the problem? More info included at the bottom of this message.
Thanks for your help. Regards, Artur Malinowski PhD student at Gdansk University of Technology ---------------------------- openmpi version: problem occurs both in 1.10.1 and 1.10.2, older untested ---------------------------- config.log included in config.log.tar.bz2 attachment ---------------------------- ompi_info included in ompi_info.tar.bz2 attachment ---------------------------- execution command/path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi /path/to/app
---------------------------- system info- OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox official page
- Linux: CentOS release 6.5 (Final) under Rocks cluster - kernel: build on my own, 3.18.0 with some patches ---------------------------- ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.35.5100 node_guid: 0002:c903:009f:5b00 sys_image_guid: 0002:c903:009f:5b03 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1090110028 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 4 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand ---------------------------- ifconfig eth0 Link encap:Ethernet HWaddr XXXXXXXXXX inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:63945289429 (59.5 GiB) TX bytes:68561418011 (63.8 GiB) Memory:d0960000-d097ffff
config.log.tar.bz2
Description: application/bzip
ompi_info.tar.bz2
Description: application/bzip