Artur, do you check all the error codes returned by MPI_Comm_spawn_multiple ? (so you can confirm the requested number of tasks was spawned)
since the error occurs only on the first MPI_Send, you might want to retrieve rank and size and print them right before MPI_Send, just to make sure the communicator is valid (e.g. no memory corruption occurred before) out of curiosity, did you try your application with an other MPI library (such as mpich or derivates) Cheers, Gilles On Saturday, February 20, 2016, Artur Malinowski <artur.malinow...@pg.gda.pl> wrote: > Hi, > > I have a problem with my application that is based on dynamic process > management. The scenario related to process creation is as follows: > 1. All processes call MPI_Comm_spawn_multiple to spawn additional single > process per each node. > 2. Parent processes call MPI_Intercomm_merge. > 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, > MPI_Intercomm_merge. > 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. > Before and after above steps, processes call plenty of other MPI routines > (so it is hard to extract minimal example that suffer from the problem). > > Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge > for parent processes that fail with SIGSEGV are slightly different. > Depending on type used to print it (I'm not sure about the type of > MPI_Comm), they are either negative (if printed as int), or bigger than > others (if printed as unsigned long long). For instance, with code: > printf("%d %d %llu %\n", rank, intracomm, intracomm); > and output: > 4 -970650128 140564719013360 > 8 14458544 14458544 > 12 15121888 15121888 > 9 38104000 38104000 > 1 14921600 14921600 > 11 31413968 31413968 > 5 27737968 27737968 > 7 -934013376 140023589770816 > 13 24512096 24512096 > 0 31348624 31348624 > 3 -1091084352 139817274269632 > 2 27982528 27982528 > 10 8745056 8745056 > 14 9449856 9449856 > 6 10023360 10023360 > processes: 4, 7 and 3 fail. There is no connection between failed > processes and particular node, it usually affects about 20% of processes > and occurs both for tcp and ib. Any idea how to find source of the problem? > More info included at the bottom of this message. > > Thanks for your help. > > Regards, > Artur Malinowski > PhD student at Gdansk University of Technology > > ---------------------------- > > openmpi version: > > problem occurs both in 1.10.1 and 1.10.2, older untested > > ---------------------------- > > config.log > > included in config.log.tar.bz2 attachment > > ---------------------------- > > ompi_info > > included in ompi_info.tar.bz2 attachment > > ---------------------------- > > execution command > > /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi > /path/to/app > > ---------------------------- > > system info > > - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox > official page > - Linux: CentOS release 6.5 (Final) under Rocks cluster > - kernel: build on my own, 3.18.0 with some patches > > ---------------------------- > > ibv_devinfo > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.35.5100 > node_guid: 0002:c903:009f:5b00 > sys_image_guid: 0002:c903:009f:5b03 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1090110028 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 4 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > ---------------------------- > > ifconfig > > eth0 Link encap:Ethernet HWaddr XXXXXXXXXX > inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 > inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 > TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:63945289429 (59.5 GiB) TX bytes:68561418011 (63.8 GiB) > Memory:d0960000-d097ffff >