Artur, in OpenMPI, MPI_Comm is an opaque pointer, so strictly speaking, high value might not be an issue.
can you have your failed processes generate a core and post the stack trace ? btw, do you MPI_Send on the intra communicator created by MPI_Intercomm_merge ? what is the minimal config needed to reproduce the issue ? (number of nodes, number of tasks started with mpirun, number of tasks spawn by MPI_Comm_spawn_multiple, how many different binaries are spawned ?) Cheers, Gilles On Saturday, February 20, 2016, Artur Malinowski <artur.malinow...@pg.gda.pl> wrote: > Hi, > > I have a problem with my application that is based on dynamic process > management. The scenario related to process creation is as follows: > 1. All processes call MPI_Comm_spawn_multiple to spawn additional single > process per each node. > 2. Parent processes call MPI_Intercomm_merge. > 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, > MPI_Intercomm_merge. > 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. > Before and after above steps, processes call plenty of other MPI routines > (so it is hard to extract minimal example that suffer from the problem). > > Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge > for parent processes that fail with SIGSEGV are slightly different. > Depending on type used to print it (I'm not sure about the type of > MPI_Comm), they are either negative (if printed as int), or bigger than > others (if printed as unsigned long long). For instance, with code: > printf("%d %d %llu %\n", rank, intracomm, intracomm); > and output: > 4 -970650128 140564719013360 > 8 14458544 14458544 > 12 15121888 15121888 > 9 38104000 38104000 > 1 14921600 14921600 > 11 31413968 31413968 > 5 27737968 27737968 > 7 -934013376 140023589770816 > 13 24512096 24512096 > 0 31348624 31348624 > 3 -1091084352 139817274269632 > 2 27982528 27982528 > 10 8745056 8745056 > 14 9449856 9449856 > 6 10023360 10023360 > processes: 4, 7 and 3 fail. There is no connection between failed > processes and particular node, it usually affects about 20% of processes > and occurs both for tcp and ib. Any idea how to find source of the problem? > More info included at the bottom of this message. > > Thanks for your help. > > Regards, > Artur Malinowski > PhD student at Gdansk University of Technology > > ---------------------------- > > openmpi version: > > problem occurs both in 1.10.1 and 1.10.2, older untested > > ---------------------------- > > config.log > > included in config.log.tar.bz2 attachment > > ---------------------------- > > ompi_info > > included in ompi_info.tar.bz2 attachment > > ---------------------------- > > execution command > > /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi > /path/to/app > > ---------------------------- > > system info > > - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox > official page > - Linux: CentOS release 6.5 (Final) under Rocks cluster > - kernel: build on my own, 3.18.0 with some patches > > ---------------------------- > > ibv_devinfo > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.35.5100 > node_guid: 0002:c903:009f:5b00 > sys_image_guid: 0002:c903:009f:5b03 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1090110028 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 4 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > ---------------------------- > > ifconfig > > eth0 Link encap:Ethernet HWaddr XXXXXXXXXX > inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 > inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 > TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:63945289429 (59.5 GiB) TX bytes:68561418011 (63.8 GiB) > Memory:d0960000-d097ffff >