Arthur, Your email does not contain enough information to pinpoint the problem. However, there are several hints that tent to indicate a problem in your application.
1. in the collective communication that succeed, the MPI_Intercomm_merge, the processes are doing [at least] one MPI_Allreduce followed by one MPI_Allgatherv, two collective communications that force the establishment of most of the connections between processes. As all the communications involved in this step succeed, I see no reason for a subsequent MPI_Send to fail if all the call parameters are correct. 2. The communication fail for both TCP and IB suggests that either the buffer your datatype + count is pointing to is not correctly allocated, or that the combination of count and datatype are identifying the wrong memory pattern. In both cases, the faulty process will segfault during the pack operation. Can you check the stack on the processes where the fault occurs? George. On Fri, Feb 19, 2016 at 6:23 PM, Artur Malinowski < artur.malinow...@pg.gda.pl> wrote: > Hi, > > I have a problem with my application that is based on dynamic process > management. The scenario related to process creation is as follows: > 1. All processes call MPI_Comm_spawn_multiple to spawn additional single > process per each node. > 2. Parent processes call MPI_Intercomm_merge. > 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, > MPI_Intercomm_merge. > 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. > Before and after above steps, processes call plenty of other MPI routines > (so it is hard to extract minimal example that suffer from the problem). > > Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge > for parent processes that fail with SIGSEGV are slightly different. > Depending on type used to print it (I'm not sure about the type of > MPI_Comm), they are either negative (if printed as int), or bigger than > others (if printed as unsigned long long). For instance, with code: > printf("%d %d %llu %\n", rank, intracomm, intracomm); > and output: > 4 -970650128 140564719013360 > 8 14458544 14458544 > 12 15121888 15121888 > 9 38104000 38104000 > 1 14921600 14921600 > 11 31413968 31413968 > 5 27737968 27737968 > 7 -934013376 140023589770816 > 13 24512096 24512096 > 0 31348624 31348624 > 3 -1091084352 139817274269632 > 2 27982528 27982528 > 10 8745056 8745056 > 14 9449856 9449856 > 6 10023360 10023360 > processes: 4, 7 and 3 fail. There is no connection between failed > processes and particular node, it usually affects about 20% of processes > and occurs both for tcp and ib. Any idea how to find source of the problem? > More info included at the bottom of this message. > > Thanks for your help. > > Regards, > Artur Malinowski > PhD student at Gdansk University of Technology > > ---------------------------- > > openmpi version: > > problem occurs both in 1.10.1 and 1.10.2, older untested > > ---------------------------- > > config.log > > included in config.log.tar.bz2 attachment > > ---------------------------- > > ompi_info > > included in ompi_info.tar.bz2 attachment > > ---------------------------- > > execution command > > /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi > /path/to/app > > ---------------------------- > > system info > > - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox > official page > - Linux: CentOS release 6.5 (Final) under Rocks cluster > - kernel: build on my own, 3.18.0 with some patches > > ---------------------------- > > ibv_devinfo > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.35.5100 > node_guid: 0002:c903:009f:5b00 > sys_image_guid: 0002:c903:009f:5b03 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1090110028 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 4 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > ---------------------------- > > ifconfig > > eth0 Link encap:Ethernet HWaddr XXXXXXXXXX > inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 > inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 > TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:63945289429 (59.5 GiB) TX bytes:68561418011 (63.8 GiB) > Memory:d0960000-d097ffff > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28555.php >