Re: [OMPI users] mpirun hanging after MPI_Abort

2016-02-19 Thread Ralph Castain
Best options for debugging something like this are: -mca odls_base_verbose 5 
-mca errmgr_base_verbose 5

It’ll generate a fair amount of output, so try to do it with a small job if you 
can. You’ll need a build configured with -enable-debug to get the output.


> On Feb 18, 2016, at 8:29 PM, Ben Menadue  wrote:
> 
> Hi,
> 
> I'm investigating an issue with mpirun *sometimes* hanging after programs
> call MPI_Abort... all of the MPI processes have terminated, however the
> mpirun is still there. This happens with 1.8.8 and 1.10.2. There look to be
> two threads, one in this path:
> 
> #0  0x7fa09c3143b3 in select () from /lib64/libc.so.6
> #1  0x7fa09b001e2c in listen_thread (obj=0x7fa09b2109e8) at
> ../../../../../../../../orte/mca/oob/tcp/oob_tcp_listener.c:685
> #2  0x7fa09c5ceaa1 in start_thread () from /lib64/libpthread.so.0
> #3  0x7fa09c31b93d in clone () from /lib64/libc.so.6
> 
> and the other in this:
> 
> 0  0x7fa09c312113 in poll () from /lib64/libc.so.6
> #1  0x7fa09d318e7d in poll_dispatch (base=0x1568a80, tv=0x0) at
> ../../../../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
> #2  0x7fa09d30d96c in opal_libevent2021_event_base_loop (base=0x1568a80,
> flags=1) at
> ../../../../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
> #3  0x004056fc in orterun (argc=2, argv=0x7ffe70248078) at
> ../../../../../../../orte/tools/orterun/orterun.c:1142
> #4  0x00403614 in main (argc=2, argv=0x7ffe70248078) at
> ../../../../../../../orte/tools/orterun/main.c:13
> 
> But since this is in mpirun itself, I'm not sure how to delve deeper - is
> there an MCA *_base_verbose parameter (or equivalent) that works on the
> mpirun?
> 
> Cheers,
> Ben
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28548.php



[OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris

2016-02-19 Thread Siegmar Gross

Hi,

yesterday I tried to build openmpi-dev-3498-gdc4d3ed on my
machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux
12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was successful on
my Linux machine, but I got the following errors on both Solaris
platforms.


Sun C 5.13:
===

  CC   base/ess_base_std_tool.lo
"../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: syntax 
error before or at: &
"../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: warning: 
syntax requires ";" after last struct/union member
"../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 121: 
cannot recover from previous errors
cc: acomp failed for 
../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c
make[2]: *** [base/ess_base_std_tool.lo] Error 1
make[2]: Leaving directory 
`/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte/mca/ess'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte'
make: *** [all-recursive] Error 1
tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc 50


GCC-5.2.0:
==

  CC   base/ess_base_std_tool.lo
In file included from /usr/include/stdio.h:66:0,
 from 
../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c:29:
../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h:116:22: error: 
expected identifier or '(' before '&' token
 orte_iof_sink_t *stdin;
  ^
make[2]: *** [base/ess_base_std_tool.lo] Error 1
make[2]: Leaving directory 
`/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte/mca/ess'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte'
make: *** [all-recursive] Error 1
tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc 50


I would be grateful if somebody can fix the problem. Thank you
very much for any help in advance.


Kind regards

Siegmar


Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris

2016-02-19 Thread Ralph Castain
I’m afraid I have no idea what Solaris is complaining about here.

> On Feb 19, 2016, at 6:52 AM, Siegmar Gross 
>  wrote:
> 
> Hi,
> 
> yesterday I tried to build openmpi-dev-3498-gdc4d3ed on my
> machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux
> 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was successful on
> my Linux machine, but I got the following errors on both Solaris
> platforms.
> 
> 
> Sun C 5.13:
> ===
> 
>  CC   base/ess_base_std_tool.lo
> "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: 
> syntax error before or at: &
> "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 116: 
> warning: syntax requires ";" after last struct/union member
> "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line 121: 
> cannot recover from previous errors
> cc: acomp failed for 
> ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c
> make[2]: *** [base/ess_base_std_tool.lo] Error 1
> make[2]: Leaving directory 
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte/mca/ess'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte'
> make: *** [all-recursive] Error 1
> tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc 50
> 
> 
> GCC-5.2.0:
> ==
> 
>  CC   base/ess_base_std_tool.lo
> In file included from /usr/include/stdio.h:66:0,
> from 
> ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c:29:
> ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h:116:22: error: 
> expected identifier or '(' before '&' token
> orte_iof_sink_t *stdin;
>  ^
> make[2]: *** [base/ess_base_std_tool.lo] Error 1
> make[2]: Leaving directory 
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte/mca/ess'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte'
> make: *** [all-recursive] Error 1
> tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc 50
> 
> 
> I would be grateful if somebody can fix the problem. Thank you
> very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28550.php



Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris

2016-02-19 Thread Gilles Gouaillardet
a field from orte_iof_proc_t is named "stdin"
could stdin be #defined under the hood in Solaris ?
if so, then renaming this field should do the trick.

I will double check that on Monday

Cheers,

Gilles

On Saturday, February 20, 2016, Ralph Castain  wrote:

> I’m afraid I have no idea what Solaris is complaining about here.
>
> > On Feb 19, 2016, at 6:52 AM, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de > wrote:
> >
> > Hi,
> >
> > yesterday I tried to build openmpi-dev-3498-gdc4d3ed on my
> > machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux
> > 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I was successful on
> > my Linux machine, but I got the following errors on both Solaris
> > platforms.
> >
> >
> > Sun C 5.13:
> > ===
> >
> >  CC   base/ess_base_std_tool.lo
> > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line
> 116: syntax error before or at: &
> > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line
> 116: warning: syntax requires ";" after last struct/union member
> > "../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h", line
> 121: cannot recover from previous errors
> > cc: acomp failed for
> ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c
> > make[2]: *** [base/ess_base_std_tool.lo] Error 1
> > make[2]: Leaving directory
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte/mca/ess'
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc/orte'
> > make: *** [all-recursive] Error 1
> > tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_cc 50
> >
> >
> > GCC-5.2.0:
> > ==
> >
> >  CC   base/ess_base_std_tool.lo
> > In file included from /usr/include/stdio.h:66:0,
> > from
> ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/ess/base/ess_base_std_tool.c:29:
> > ../../../../openmpi-dev-3498-gdc4d3ed/orte/mca/iof/base/base.h:116:22:
> error: expected identifier or '(' before '&' token
> > orte_iof_sink_t *stdin;
> >  ^
> > make[2]: *** [base/ess_base_std_tool.lo] Error 1
> > make[2]: Leaving directory
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte/mca/ess'
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory
> `/export2/src/openmpi-master/openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc/orte'
> > make: *** [all-recursive] Error 1
> > tyr openmpi-dev-3498-gdc4d3ed-SunOS.sparc.64_gcc 50
> >
> >
> > I would be grateful if somebody can fix the problem. Thank you
> > very much for any help in advance.
> >
> >
> > Kind regards
> >
> > Siegmar
> > ___
> > users mailing list
> > us...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28550.php
>
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28551.php


Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris

2016-02-19 Thread Dave Love
Gilles Gouaillardet  writes:

> a field from orte_iof_proc_t is named "stdin"
> could stdin be #defined under the hood in Solaris ?

It's defined as "(&__iob[0])" on Solaris 10; it's just #defined
differently by glibc.  See stdio.h(7posix).



Re: [OMPI users] Error building openmpi-dev-3498-gdc4d3ed on Solaris

2016-02-19 Thread Ralph Castain
Just pushed a change that renamed the field - hopefully fixed now

Thanks!

> On Feb 19, 2016, at 9:54 AM, Dave Love  wrote:
> 
> Gilles Gouaillardet  writes:
> 
>> a field from orte_iof_proc_t is named "stdin"
>> could stdin be #defined under the hood in Solaris ?
> 
> It's defined as "(&__iob[0])" on Solaris 10; it's just #defined
> differently by glibc.  See stdio.h(7posix).
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28553.php



[OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes

2016-02-19 Thread Artur Malinowski

Hi,

I have a problem with my application that is based on dynamic process 
management. The scenario related to process creation is as follows:
  1. All processes call MPI_Comm_spawn_multiple to spawn additional 
single process per each node.

  2. Parent processes call MPI_Intercomm_merge.
  3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent, 
MPI_Intercomm_merge.

  4. Some of parent processes fail at their first MPI_Send with SIGSEGV.
Before and after above steps, processes call plenty of other MPI 
routines (so it is hard to extract minimal example that suffer from the 
problem).


Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge 
for parent processes that fail with SIGSEGV are slightly different. 
Depending on type used to print it (I'm not sure about the type of 
MPI_Comm), they are either negative (if printed as int), or bigger than 
others (if printed as unsigned long long). For instance, with code:

  printf("%d %d %llu %\n", rank, intracomm, intracomm);
and output:
  4 -970650128 140564719013360
  8 14458544 14458544
  12 15121888 15121888
  9 38104000 38104000
  1 14921600 14921600
  11 31413968 31413968
  5 27737968 27737968
  7 -934013376 140023589770816
  13 24512096 24512096
  0 31348624 31348624
  3 -1091084352 139817274269632
  2 27982528 27982528
  10 8745056 8745056
  14 9449856 9449856
  6 10023360 10023360
processes: 4, 7 and 3 fail. There is no connection between failed 
processes and particular node, it usually affects about 20% of processes 
and occurs both for tcp and ib. Any idea how to find source of the 
problem? More info included at the bottom of this message.


Thanks for your help.

Regards,
Artur Malinowski
PhD student at Gdansk University of Technology



openmpi version:

problem occurs both in 1.10.1 and 1.10.2, older untested



config.log

included in config.log.tar.bz2 attachment



ompi_info

included in ompi_info.tar.bz2 attachment



execution command

/path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi 
/path/to/app




system info

- OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox 
official page

- Linux: CentOS release 6.5 (Final) under Rocks cluster
- kernel: build on my own, 3.18.0 with some patches



ibv_devinfo

hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.35.5100
node_guid:  0002:c903:009f:5b00
sys_image_guid: 0002:c903:009f:5b03
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id:   MT_1090110028
phys_port_cnt:  2
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 4
port_lid:   1
port_lmc:   0x00
link_layer: InfiniBand

port:   2
state:  PORT_DOWN (1)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: InfiniBand



ifconfig

eth0  Link encap:Ethernet  HWaddr XX
  inet addr:10.1.255.248  Bcast:10.1.255.255  Mask:255.255.0.0
  inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0
  TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:63945289429 (59.5 GiB)  TX bytes:68561418011 (63.8 GiB)
  Memory:d096-d097


config.log.tar.bz2
Description: application/bzip


ompi_info.tar.bz2
Description: application/bzip


Re: [OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes

2016-02-19 Thread Gilles Gouaillardet
Artur,

in OpenMPI, MPI_Comm is an opaque pointer, so strictly speaking, high value
might not be an issue.

can you have your failed processes generate a core and post the stack trace
?
btw, do you MPI_Send on the intra communicator created by
MPI_Intercomm_merge ?

what is the minimal config needed to reproduce the issue ?
(number of nodes, number of tasks started with mpirun, number of tasks
spawn by MPI_Comm_spawn_multiple, how many different binaries are spawned ?)

Cheers,

Gilles

On Saturday, February 20, 2016, Artur Malinowski 
wrote:

> Hi,
>
> I have a problem with my application that is based on dynamic process
> management. The scenario related to process creation is as follows:
>   1. All processes call MPI_Comm_spawn_multiple to spawn additional single
> process per each node.
>   2. Parent processes call MPI_Intercomm_merge.
>   3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent,
> MPI_Intercomm_merge.
>   4. Some of parent processes fail at their first MPI_Send with SIGSEGV.
> Before and after above steps, processes call plenty of other MPI routines
> (so it is hard to extract minimal example that suffer from the problem).
>
> Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge
> for parent processes that fail with SIGSEGV are slightly different.
> Depending on type used to print it (I'm not sure about the type of
> MPI_Comm), they are either negative (if printed as int), or bigger than
> others (if printed as unsigned long long). For instance, with code:
>   printf("%d %d %llu %\n", rank, intracomm, intracomm);
> and output:
>   4 -970650128 140564719013360
>   8 14458544 14458544
>   12 15121888 15121888
>   9 38104000 38104000
>   1 14921600 14921600
>   11 31413968 31413968
>   5 27737968 27737968
>   7 -934013376 140023589770816
>   13 24512096 24512096
>   0 31348624 31348624
>   3 -1091084352 139817274269632
>   2 27982528 27982528
>   10 8745056 8745056
>   14 9449856 9449856
>   6 10023360 10023360
> processes: 4, 7 and 3 fail. There is no connection between failed
> processes and particular node, it usually affects about 20% of processes
> and occurs both for tcp and ib. Any idea how to find source of the problem?
> More info included at the bottom of this message.
>
> Thanks for your help.
>
> Regards,
> Artur Malinowski
> PhD student at Gdansk University of Technology
>
> 
>
> openmpi version:
>
> problem occurs both in 1.10.1 and 1.10.2, older untested
>
> 
>
> config.log
>
> included in config.log.tar.bz2 attachment
>
> 
>
> ompi_info
>
> included in ompi_info.tar.bz2 attachment
>
> 
>
> execution command
>
> /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi
> /path/to/app
>
> 
>
> system info
>
> - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox
> official page
> - Linux: CentOS release 6.5 (Final) under Rocks cluster
> - kernel: build on my own, 3.18.0 with some patches
>
> 
>
> ibv_devinfo
>
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.35.5100
> node_guid:  0002:c903:009f:5b00
> sys_image_guid: 0002:c903:009f:5b03
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x1
> board_id:   MT_1090110028
> phys_port_cnt:  2
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 4
> port_lid:   1
> port_lmc:   0x00
> link_layer: InfiniBand
>
> port:   2
> state:  PORT_DOWN (1)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 0
> port_lid:   0
> port_lmc:   0x00
> link_layer: InfiniBand
>
> 
>
> ifconfig
>
> eth0  Link encap:Ethernet  HWaddr XX
>   inet addr:10.1.255.248  Bcast:10.1.255.255  Mask:255.255.0.0
>   inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:63945289429 (59.5 GiB)  TX byt

Re: [OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes

2016-02-19 Thread Gilles Gouaillardet
Artur,

do you check all the error codes returned by MPI_Comm_spawn_multiple ?
(so you can confirm the requested number of tasks was spawned)

since the error occurs only on the first MPI_Send, you might want to
retrieve rank and size and print them right before MPI_Send, just to make
sure the communicator is valid
(e.g. no memory corruption occurred before)

out of curiosity, did you try your application with an other MPI library
(such as mpich or derivates)

Cheers,

Gilles

On Saturday, February 20, 2016, Artur Malinowski 
wrote:

> Hi,
>
> I have a problem with my application that is based on dynamic process
> management. The scenario related to process creation is as follows:
>   1. All processes call MPI_Comm_spawn_multiple to spawn additional single
> process per each node.
>   2. Parent processes call MPI_Intercomm_merge.
>   3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent,
> MPI_Intercomm_merge.
>   4. Some of parent processes fail at their first MPI_Send with SIGSEGV.
> Before and after above steps, processes call plenty of other MPI routines
> (so it is hard to extract minimal example that suffer from the problem).
>
> Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge
> for parent processes that fail with SIGSEGV are slightly different.
> Depending on type used to print it (I'm not sure about the type of
> MPI_Comm), they are either negative (if printed as int), or bigger than
> others (if printed as unsigned long long). For instance, with code:
>   printf("%d %d %llu %\n", rank, intracomm, intracomm);
> and output:
>   4 -970650128 140564719013360
>   8 14458544 14458544
>   12 15121888 15121888
>   9 38104000 38104000
>   1 14921600 14921600
>   11 31413968 31413968
>   5 27737968 27737968
>   7 -934013376 140023589770816
>   13 24512096 24512096
>   0 31348624 31348624
>   3 -1091084352 139817274269632
>   2 27982528 27982528
>   10 8745056 8745056
>   14 9449856 9449856
>   6 10023360 10023360
> processes: 4, 7 and 3 fail. There is no connection between failed
> processes and particular node, it usually affects about 20% of processes
> and occurs both for tcp and ib. Any idea how to find source of the problem?
> More info included at the bottom of this message.
>
> Thanks for your help.
>
> Regards,
> Artur Malinowski
> PhD student at Gdansk University of Technology
>
> 
>
> openmpi version:
>
> problem occurs both in 1.10.1 and 1.10.2, older untested
>
> 
>
> config.log
>
> included in config.log.tar.bz2 attachment
>
> 
>
> ompi_info
>
> included in ompi_info.tar.bz2 attachment
>
> 
>
> execution command
>
> /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi
> /path/to/app
>
> 
>
> system info
>
> - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox
> official page
> - Linux: CentOS release 6.5 (Final) under Rocks cluster
> - kernel: build on my own, 3.18.0 with some patches
>
> 
>
> ibv_devinfo
>
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.35.5100
> node_guid:  0002:c903:009f:5b00
> sys_image_guid: 0002:c903:009f:5b03
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x1
> board_id:   MT_1090110028
> phys_port_cnt:  2
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 4
> port_lid:   1
> port_lmc:   0x00
> link_layer: InfiniBand
>
> port:   2
> state:  PORT_DOWN (1)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 0
> port_lid:   0
> port_lmc:   0x00
> link_layer: InfiniBand
>
> 
>
> ifconfig
>
> eth0  Link encap:Ethernet  HWaddr XX
>   inet addr:10.1.255.248  Bcast:10.1.255.255  Mask:255.255.0.0
>   inet6 addr: fe80::21e:67ff:feb9:5ca/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:138132137 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:160269713 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:63945289429 (59.5 GiB)  TX bytes:6856141

Re: [OMPI users] Nondeterministic SIGSEGV in MPI_Send to dynamically created processes

2016-02-19 Thread George Bosilca
Arthur,

Your email does not contain enough information to pinpoint the problem.
However, there are several hints that tent to indicate a problem in your
application.

1. in the collective communication that succeed, the MPI_Intercomm_merge,
the processes are doing [at least] one MPI_Allreduce followed by one
MPI_Allgatherv, two collective communications that force the establishment
of most of the connections between processes. As all the communications
involved in this step succeed, I see no reason for a subsequent MPI_Send to
fail if all the call parameters are correct.

2. The communication fail for both TCP and IB suggests that either the
buffer your datatype + count is pointing to is not correctly allocated, or
that the combination of count and datatype are identifying the wrong memory
pattern. In both cases, the faulty process will segfault during the pack
operation. Can you check the stack on the processes where the fault occurs?

  George.


On Fri, Feb 19, 2016 at 6:23 PM, Artur Malinowski <
artur.malinow...@pg.gda.pl> wrote:

> Hi,
>
> I have a problem with my application that is based on dynamic process
> management. The scenario related to process creation is as follows:
>   1. All processes call MPI_Comm_spawn_multiple to spawn additional single
> process per each node.
>   2. Parent processes call MPI_Intercomm_merge.
>   3. Child processes call MPI_Init_pmem, MPI_Comm_get_parent,
> MPI_Intercomm_merge.
>   4. Some of parent processes fail at their first MPI_Send with SIGSEGV.
> Before and after above steps, processes call plenty of other MPI routines
> (so it is hard to extract minimal example that suffer from the problem).
>
> Interesting observation: the MPI_Comm obtained with MPI_Intercomm_merge
> for parent processes that fail with SIGSEGV are slightly different.
> Depending on type used to print it (I'm not sure about the type of
> MPI_Comm), they are either negative (if printed as int), or bigger than
> others (if printed as unsigned long long). For instance, with code:
>   printf("%d %d %llu %\n", rank, intracomm, intracomm);
> and output:
>   4 -970650128 140564719013360
>   8 14458544 14458544
>   12 15121888 15121888
>   9 38104000 38104000
>   1 14921600 14921600
>   11 31413968 31413968
>   5 27737968 27737968
>   7 -934013376 140023589770816
>   13 24512096 24512096
>   0 31348624 31348624
>   3 -1091084352 139817274269632
>   2 27982528 27982528
>   10 8745056 8745056
>   14 9449856 9449856
>   6 10023360 10023360
> processes: 4, 7 and 3 fail. There is no connection between failed
> processes and particular node, it usually affects about 20% of processes
> and occurs both for tcp and ib. Any idea how to find source of the problem?
> More info included at the bottom of this message.
>
> Thanks for your help.
>
> Regards,
> Artur Malinowski
> PhD student at Gdansk University of Technology
>
> 
>
> openmpi version:
>
> problem occurs both in 1.10.1 and 1.10.2, older untested
>
> 
>
> config.log
>
> included in config.log.tar.bz2 attachment
>
> 
>
> ompi_info
>
> included in ompi_info.tar.bz2 attachment
>
> 
>
> execution command
>
> /path/to/openmpi/bin/mpirun --map-by node --prefix /path/to/openmpi
> /path/to/app
>
> 
>
> system info
>
> - OpenFabrics: MLNX_OFED_LINUX-3.1-1.0.3-rhel6.5-x86_64 from mellanox
> official page
> - Linux: CentOS release 6.5 (Final) under Rocks cluster
> - kernel: build on my own, 3.18.0 with some patches
>
> 
>
> ibv_devinfo
>
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.35.5100
> node_guid:  0002:c903:009f:5b00
> sys_image_guid: 0002:c903:009f:5b03
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x1
> board_id:   MT_1090110028
> phys_port_cnt:  2
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 4
> port_lid:   1
> port_lmc:   0x00
> link_layer: InfiniBand
>
> port:   2
> state:  PORT_DOWN (1)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 0
> port_lid:   0
> port_lmc:   0x00
> link_layer: InfiniBand
>
> ---