Re: [OMPI users] mpirun hanging after MPI_Abort
Best options for debugging something like this are: -mca odls_base_verbose 5 -mca errmgr_base_verbose 5 It’ll generate a fair amount of output, so try to do it with a small job if you can. You’ll need a build configured with -enable-debug to get the output. > On Feb 18, 2016, at 8:29 PM, Ben Menadue wrote: > > Hi, > > I'm investigating an issue with mpirun *sometimes* hanging after programs > call MPI_Abort... all of the MPI processes have terminated, however the > mpirun is still there. This happens with 1.8.8 and 1.10.2. There look to be > two threads, one in this path: > > #0 0x7fa09c3143b3 in select () from /lib64/libc.so.6 > #1 0x7fa09b001e2c in listen_thread (obj=0x7fa09b2109e8) at > ../../../../../../../../orte/mca/oob/tcp/oob_tcp_listener.c:685 > #2 0x7fa09c5ceaa1 in start_thread () from /lib64/libpthread.so.0 > #3 0x7fa09c31b93d in clone () from /lib64/libc.so.6 > > and the other in this: > > 0 0x7fa09c312113 in poll () from /lib64/libc.so.6 > #1 0x7fa09d318e7d in poll_dispatch (base=0x1568a80, tv=0x0) at > ../../../../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165 > #2 0x7fa09d30d96c in opal_libevent2021_event_base_loop (base=0x1568a80, > flags=1) at > ../../../../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633 > #3 0x004056fc in orterun (argc=2, argv=0x7ffe70248078) at > ../../../../../../../orte/tools/orterun/orterun.c:1142 > #4 0x00403614 in main (argc=2, argv=0x7ffe70248078) at > ../../../../../../../orte/tools/orterun/main.c:13 > > But since this is in mpirun itself, I'm not sure how to delve deeper - is > there an MCA *_base_verbose parameter (or equivalent) that works on the > mpirun? > > Cheers, > Ben > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28548.php
[OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris
Hi, yesterday I tried to build openmpi-dev-3498-gdc4d3ed on my machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was successful on my Linux machine, but I got the following errors on both Solaris platforms. Sun C 5.13: === CC base/ess_base_std_tool.lo "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: syntax error before or at: & "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: warning: syntax requires ";" after last struct/union member "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 121: cannot recover from previous errors cc: acomp failed for ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c make[2]: *** [base/ess_base_std_tool.lo] Error 1 make[2]: Leaving directory `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte/mca/ess' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte' make: *** [all-recursive] Error 1 tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc 50 GCC-5.2.0: == CC base/ess_base_std_tool.lo In file included from /usr/include/stdio.h:66:0, from ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c:29: ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h:116:22: error: expected identifier or '(' before '&' token orte_iof_sink_t *stdin; ^ make[2]: *** [base/ess_base_std_tool.lo] Error 1 make[2]: Leaving directory `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte/mca/ess' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte' make: *** [all-recursive] Error 1 tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc 50 I would be grateful if somebody can fix the problem. Thank you very much for any help in advance. Kind regards Siegmar
Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris
I’m afraid I have no idea what Solaris is complaining about here. > On Feb 19, 2016, at 6:52 AM, Siegmar Gross > wrote: > > Hi, > > yesterday I tried to build openmpi-dev-3498-gdc4d3ed on my > machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux > 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was successful on > my Linux machine, but I got the following errors on both Solaris > platforms. > > > Sun C 5.13: > === > > CC base/ess_base_std_tool.lo > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: > syntax error before or at: & > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: > warning: syntax requires ";" after last struct/union member > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 121: > cannot recover from previous errors > cc: acomp failed for > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c > make[2]: *** [base/ess_base_std_tool.lo] Error 1 > make[2]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte/mca/ess' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte' > make: *** [all-recursive] Error 1 > tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc 50 > > > GCC-5.2.0: > == > > CC base/ess_base_std_tool.lo > In file included from /usr/include/stdio.h:66:0, > from > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c:29: > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h:116:22: error: > expected identifier or '(' before '&' token > orte_iof_sink_t *stdin; > ^ > make[2]: *** [base/ess_base_std_tool.lo] Error 1 > make[2]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte/mca/ess' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte' > make: *** [all-recursive] Error 1 > tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc 50 > > > I would be grateful if somebody can fix the problem. Thank you > very much for any help in advance. > > > Kind regards > > Siegmar > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28550.php
Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris
a field from orte_iof_proc_t is named "stdin" could stdin be #defined under the hood in Solaris ? if so, then renaming this field should do the trick. I will double check that on Monday Cheers, Gilles On Saturday, February 20, 2016, Ralph Castain wrote: > I’m afraid I have no idea what Solaris is complaining about here. > > > On Feb 19, 2016, at 6:52 AM, Siegmar Gross < > siegmar.gr...@informatik.hs-fulda.de > wrote: > > > > Hi, > > > > yesterday I tried to build openmpi-dev-3498-gdc4d3ed on my > > machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux > > 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was successful on > > my Linux machine, but I got the following errors on both Solaris > > platforms. > > > > > > Sun C 5.13: > > === > > > > CC base/ess_base_std_tool.lo > > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line > 116: syntax error before or at: & > > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line > 116: warning: syntax requires ";" after last struct/union member > > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line > 121: cannot recover from previous errors > > cc: acomp failed for > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c > > make[2]: *** [base/ess_base_std_tool.lo] Error 1 > > make[2]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte/mca/ess' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte' > > make: *** [all-recursive] Error 1 > > tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc 50 > > > > > > GCC-5.2.0: > > == > > > > CC base/ess_base_std_tool.lo > > In file included from /usr/include/stdio.h:66:0, > > from > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c:29: > > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h:116:22: > error: expected identifier or '(' before '&' token > > orte_iof_sink_t *stdin; > > ^ > > make[2]: *** [base/ess_base_std_tool.lo] Error 1 > > make[2]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte/mca/ess' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory > `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte' > > make: *** [all-recursive] Error 1 > > tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc 50 > > > > > > I would be grateful if somebody can fix the problem. Thank you > > very much for any help in advance. > > > > > > Kind regards > > > > Siegmar > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28550.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28551.php
Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris
Gilles Gouaillardet writes: > a field from orte_iof_proc_t is named "stdin" > could stdin be #defined under the hood in Solaris ? It's defined as "(&__iob[0])" on Solaris 10; it's just #defined differently by glibc. See stdio.h(7posix).
Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris
Just pushed a change that renamed the field - hopefully fixed now Thanks! > On Feb 19, 2016, at 9:54 AM, Dave Love wrote: > > Gilles Gouaillardet writes: > >> a field from orte_iof_proc_t is named "stdin" >> could stdin be #defined under the hood in Solaris ? > > It's defined as "(&__iob[0])" on Solaris 10; it's just #defined > differently by glibc. See stdio.h(7posix). > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28553.php
[OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes
Hi, I have a problem with my application that is based on dynamic process management. The scenario related to process creation is as follows: 1. All processes call MPI_Comm_spawn_multiple to spawn additional single process per each node. 2. Parent processes call MPI_Intercomm_merge. 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, MPI_Intercomm_merge. 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. Before and after above steps, processes call plenty of other MPI routines (so it is hard to extract minimal example that suffer from the problem). Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge for parent processes that fail with SIGSEGV are slightly different. Depending on type used to print it (I'm not sure about the type of MPI_Comm), they are either negative (if printed as int), or bigger than others (if printed as unsigned long long). For instance, with code: printf("%d %d %llu %\n", rank, intracomm, intracomm); and output: 4 -970650128 140564719013360 8 14458544 14458544 12 15121888 15121888 9 38104000 38104000 1 14921600 14921600 11 31413968 31413968 5 27737968 27737968 7 -934013376 140023589770816 13 24512096 24512096 0 31348624 31348624 3 -1091084352 139817274269632 2 27982528 27982528 10 8745056 8745056 14 9449856 9449856 6 10023360 10023360 processes: 4, 7 and 3 fail. There is no connection between failed processes and particular node, it usually affects about 20% of processes and occurs both for tcp and ib. Any idea how to find source of the problem? More info included at the bottom of this message. Thanks for your help. Regards, Artur Malinowski PhD student at Gdansk University of Technology openmpi version: problem occurs both in 1.10.1 and 1.10.2, older untested config.log included in config.log.tar.bz2 attachment ompi_info included in ompi_info.tar.bz2 attachment execution command /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi /path/to/app system info - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox official page - Linux: CentOS release 6.5 (Final) under Rocks cluster - kernel: build on my own, 3.18.0 with some patches ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.35.5100 node_guid: 0002:c903:009f:5b00 sys_image_guid: 0002:c903:009f:5b03 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1090110028 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 4 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand ifconfig eth0 Link encap:Ethernet HWaddr XX inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:63945289429 (59.5 GiB) TX bytes:68561418011 (63.8 GiB) Memory:d096-d097 config.log.tar.bz2 Description: application/bzip ompi_info.tar.bz2 Description: application/bzip
Re: [OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes
Artur, in OpenMPI, MPI_Comm is an opaque pointer, so strictly speaking, high value might not be an issue. can you have your failed processes generate a core and post the stack trace ? btw, do you MPI_Send on the intra communicator created by MPI_Intercomm_merge ? what is the minimal config needed to reproduce the issue ? (number of nodes, number of tasks started with mpirun, number of tasks spawn by MPI_Comm_spawn_multiple, how many different binaries are spawned ?) Cheers, Gilles On Saturday, February 20, 2016, Artur Malinowski wrote: > Hi, > > I have a problem with my application that is based on dynamic process > management. The scenario related to process creation is as follows: > 1. All processes call MPI_Comm_spawn_multiple to spawn additional single > process per each node. > 2. Parent processes call MPI_Intercomm_merge. > 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, > MPI_Intercomm_merge. > 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. > Before and after above steps, processes call plenty of other MPI routines > (so it is hard to extract minimal example that suffer from the problem). > > Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge > for parent processes that fail with SIGSEGV are slightly different. > Depending on type used to print it (I'm not sure about the type of > MPI_Comm), they are either negative (if printed as int), or bigger than > others (if printed as unsigned long long). For instance, with code: > printf("%d %d %llu %\n", rank, intracomm, intracomm); > and output: > 4 -970650128 140564719013360 > 8 14458544 14458544 > 12 15121888 15121888 > 9 38104000 38104000 > 1 14921600 14921600 > 11 31413968 31413968 > 5 27737968 27737968 > 7 -934013376 140023589770816 > 13 24512096 24512096 > 0 31348624 31348624 > 3 -1091084352 139817274269632 > 2 27982528 27982528 > 10 8745056 8745056 > 14 9449856 9449856 > 6 10023360 10023360 > processes: 4, 7 and 3 fail. There is no connection between failed > processes and particular node, it usually affects about 20% of processes > and occurs both for tcp and ib. Any idea how to find source of the problem? > More info included at the bottom of this message. > > Thanks for your help. > > Regards, > Artur Malinowski > PhD student at Gdansk University of Technology > > > > openmpi version: > > problem occurs both in 1.10.1 and 1.10.2, older untested > > > > config.log > > included in config.log.tar.bz2 attachment > > > > ompi_info > > included in ompi_info.tar.bz2 attachment > > > > execution command > > /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi > /path/to/app > > > > system info > > - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox > official page > - Linux: CentOS release 6.5 (Final) under Rocks cluster > - kernel: build on my own, 3.18.0 with some patches > > > > ibv_devinfo > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.35.5100 > node_guid: 0002:c903:009f:5b00 > sys_image_guid: 0002:c903:009f:5b03 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1090110028 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 4 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > > > ifconfig > > eth0 Link encap:Ethernet HWaddr XX > inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 > inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 > TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:63945289429 (59.5 GiB) TX byt
Re: [OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes
Artur, do you check all the error codes returned by MPI_Comm_spawn_multiple ? (so you can confirm the requested number of tasks was spawned) since the error occurs only on the first MPI_Send, you might want to retrieve rank and size and print them right before MPI_Send, just to make sure the communicator is valid (e.g. no memory corruption occurred before) out of curiosity, did you try your application with an other MPI library (such as mpich or derivates) Cheers, Gilles On Saturday, February 20, 2016, Artur Malinowski wrote: > Hi, > > I have a problem with my application that is based on dynamic process > management. The scenario related to process creation is as follows: > 1. All processes call MPI_Comm_spawn_multiple to spawn additional single > process per each node. > 2. Parent processes call MPI_Intercomm_merge. > 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, > MPI_Intercomm_merge. > 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. > Before and after above steps, processes call plenty of other MPI routines > (so it is hard to extract minimal example that suffer from the problem). > > Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge > for parent processes that fail with SIGSEGV are slightly different. > Depending on type used to print it (I'm not sure about the type of > MPI_Comm), they are either negative (if printed as int), or bigger than > others (if printed as unsigned long long). For instance, with code: > printf("%d %d %llu %\n", rank, intracomm, intracomm); > and output: > 4 -970650128 140564719013360 > 8 14458544 14458544 > 12 15121888 15121888 > 9 38104000 38104000 > 1 14921600 14921600 > 11 31413968 31413968 > 5 27737968 27737968 > 7 -934013376 140023589770816 > 13 24512096 24512096 > 0 31348624 31348624 > 3 -1091084352 139817274269632 > 2 27982528 27982528 > 10 8745056 8745056 > 14 9449856 9449856 > 6 10023360 10023360 > processes: 4, 7 and 3 fail. There is no connection between failed > processes and particular node, it usually affects about 20% of processes > and occurs both for tcp and ib. Any idea how to find source of the problem? > More info included at the bottom of this message. > > Thanks for your help. > > Regards, > Artur Malinowski > PhD student at Gdansk University of Technology > > > > openmpi version: > > problem occurs both in 1.10.1 and 1.10.2, older untested > > > > config.log > > included in config.log.tar.bz2 attachment > > > > ompi_info > > included in ompi_info.tar.bz2 attachment > > > > execution command > > /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi > /path/to/app > > > > system info > > - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox > official page > - Linux: CentOS release 6.5 (Final) under Rocks cluster > - kernel: build on my own, 3.18.0 with some patches > > > > ibv_devinfo > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.35.5100 > node_guid: 0002:c903:009f:5b00 > sys_image_guid: 0002:c903:009f:5b03 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1090110028 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 4 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > > > ifconfig > > eth0 Link encap:Ethernet HWaddr XX > inet addr:10.1.255.248 Bcast:10.1.255.255 Mask:255.255.0.0 > inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0 > TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:63945289429 (59.5 GiB) TX bytes:6856141
Re: [OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes
Arthur, Your email does not contain enough information to pinpoint the problem. However, there are several hints that tent to indicate a problem in your application. 1. in the collective communication that succeed, the MPI_Intercomm_merge, the processes are doing [at least] one MPI_Allreduce followed by one MPI_Allgatherv, two collective communications that force the establishment of most of the connections between processes. As all the communications involved in this step succeed, I see no reason for a subsequent MPI_Send to fail if all the call parameters are correct. 2. The communication fail for both TCP and IB suggests that either the buffer your datatype + count is pointing to is not correctly allocated, or that the combination of count and datatype are identifying the wrong memory pattern. In both cases, the faulty process will segfault during the pack operation. Can you check the stack on the processes where the fault occurs? George. On Fri, Feb 19, 2016 at 6:23 PM, Artur Malinowski < artur.malinow...@pg.gda.pl> wrote: > Hi, > > I have a problem with my application that is based on dynamic process > management. The scenario related to process creation is as follows: > 1. All processes call MPI_Comm_spawn_multiple to spawn additional single > process per each node. > 2. Parent processes call MPI_Intercomm_merge. > 3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, > MPI_Intercomm_merge. > 4. Some of parent processes fail at their first MPI_Send with SIGSEGV. > Before and after above steps, processes call plenty of other MPI routines > (so it is hard to extract minimal example that suffer from the problem). > > Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge > for parent processes that fail with SIGSEGV are slightly different. > Depending on type used to print it (I'm not sure about the type of > MPI_Comm), they are either negative (if printed as int), or bigger than > others (if printed as unsigned long long). For instance, with code: > printf("%d %d %llu %\n", rank, intracomm, intracomm); > and output: > 4 -970650128 140564719013360 > 8 14458544 14458544 > 12 15121888 15121888 > 9 38104000 38104000 > 1 14921600 14921600 > 11 31413968 31413968 > 5 27737968 27737968 > 7 -934013376 140023589770816 > 13 24512096 24512096 > 0 31348624 31348624 > 3 -1091084352 139817274269632 > 2 27982528 27982528 > 10 8745056 8745056 > 14 9449856 9449856 > 6 10023360 10023360 > processes: 4, 7 and 3 fail. There is no connection between failed > processes and particular node, it usually affects about 20% of processes > and occurs both for tcp and ib. Any idea how to find source of the problem? > More info included at the bottom of this message. > > Thanks for your help. > > Regards, > Artur Malinowski > PhD student at Gdansk University of Technology > > > > openmpi version: > > problem occurs both in 1.10.1 and 1.10.2, older untested > > > > config.log > > included in config.log.tar.bz2 attachment > > > > ompi_info > > included in ompi_info.tar.bz2 attachment > > > > execution command > > /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi > /path/to/app > > > > system info > > - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox > official page > - Linux: CentOS release 6.5 (Final) under Rocks cluster > - kernel: build on my own, 3.18.0 with some patches > > > > ibv_devinfo > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.35.5100 > node_guid: 0002:c903:009f:5b00 > sys_image_guid: 0002:c903:009f:5b03 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x1 > board_id: MT_1090110028 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 4 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > ---