Re: [OMPI users] Problems with mpirun in openmpi-1.8.1 and -2.0.0

2016-08-22 Thread Juan A. Cordero Varelaq

Dear Ralph,

The existence of the two versions does not seem to be the source of 
problems, since they are in different locations. I uninstalled the most 
recent version and try again with no luck, getting the same 
warnings/errors. However, after a deep search I found a couple of hints, 
and executed this:


mpirun *-mca btl ^openib -mca btl_sm_use_knem 0* -np 5 myscript

and got only a fraction of the previous errors (before I had run the 
same but without the arguments in bold), related to OpenFabrics:


Open MPI failed to open an OpenFabrics device.  This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully.  This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.

All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.

  Hostname:MYMACHINE
  Device name: mlx4_0
  Errror (22): Invalid argument
--
--
[[52062,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: MYMACHINE

Another transport will be used instead, although this may result in
lower performance.
--

Do you guess why it could happen?


Thanks a lot

On 19/08/16 17:11, r...@open-mpi.org wrote:
The rdma error sounds like something isn’t right with your machine’s 
Infiniband installation.


The cross-version problem sounds like you installed both OMPI versions 
into the same location - did you do that?? If so, then that might be 
the root cause of both problems. You need to install them in totally 
different locations. Then you need to _prefix_ your PATH and 
LD_LIBRARY_PATH with the location of the version you want to use.


HTH
Ralph

On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq 
mailto:bioinformatica-i...@us.es>> wrote:


Dear users,

I am totally stuck using openmpi. I have two versions on my machine: 
1.8.1 and 2.0.0, and none of them work. When use the mpirun *1.8.1 
version*, I get the following error:


librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
--
Open MPI failed to open the /dev/knem device due to a local error.
Please check with your system administrator to get the problem fixed,
or set the btl_sm_use_knem MCA parameter to 0 to run without /dev/knem
support.

  Local host: MYMACHINE
  Errno:  2 (No such file or directory)
--
--
Open MPI failed to open an OpenFabrics device.  This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully.  This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.

All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.

  Hostname:MYMACHINE
  Device name: mlx4_0
  Errror (22): Invalid argument
--
--
[[60527,1],4]: A high-performance Open MPI point-to-point messaging 
module

was unable to find any relevant network interfaces:

Module: usNIC
  Host: MYMACHINE

When I use the *2.0.0 version*, I get something strange, it seems 
openmpi-2.0.0 looks for openmpi-1.8.1 libraries?:


A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  MYMACHINE
Framework: ess
Component: pmi
--
[MYMACHINE:126820] *** Process received signal ***
[MYMACHINE:126820] Signal: Segmentation fault (11)
[MYMACHINE:126820] Signal code: Address not mapped (1)
[MYMACHINE:126820] Failing at address: 0x1c0
[MYMACHINE:126820] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f39b2ec4cb0]
[MYMACHINE:126820] [ 1] 
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f39b23e7430]
[MYMACHINE:126820] [ 2] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f39b2676a57]
[MYMACHINE:126820] [ 3] 
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f39b2676fb7]
[MYMACHINE:126820

Re: [OMPI users] Problems with mpirun in openmpi-1.8.1 and -2.0.0

2016-08-22 Thread Gilles Gouaillardet
Juan,

can you try to
mpirun --mca btl ^openib,usnic --mca pml ob1 ...

note this simply disable native infiniband. from a performance point of
view, you should have your sysadmin fix the infiniband fabric.

about the version mismatch, please double check your environment
(e.g. $PATH and $LD_LIBRARY_PATH), it is likely v2.0 is in your environment
when you are using v1.8, or the other way around)
also, make sure orted are launched with the right environment.
if you are using ssh, then
ssh node env
should not contain reference to the version you are not using.
(that typically occurs when the environment is set in the .bashrc, directly
or via modules)

last but not least, you can
strings /.../bin/orted
strings /.../lib/libmpi.so
and check they do not reference the wrong version
(that can happen if a library was built and then moved)


Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq <
bioinformatica-i...@us.es> wrote:

> Dear Ralph,
>
> The existence of the two versions does not seem to be the source of
> problems, since they are in different locations. I uninstalled the most
> recent version and try again with no luck, getting the same
> warnings/errors. However, after a deep search I found a couple of hints,
> and executed this:
>
> mpirun *-mca btl ^openib -mca btl_sm_use_knem 0* -np 5 myscript
>
> and got only a fraction of the previous errors (before I had run the same
> but without the arguments in bold), related to OpenFabrics:
>
> Open MPI failed to open an OpenFabrics device.  This is an unusual
> error; the system reported the OpenFabrics device as being present,
> but then later failed to access it successfully.  This usually
> indicates either a misconfiguration or a failed OpenFabrics hardware
> device.
>
> All OpenFabrics support has been disabled in this MPI process; your
> job may or may not continue.
>
>   Hostname:MYMACHINE
>   Device name: mlx4_0
>   Errror (22): Invalid argument
> --
> --
> [[52062,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: usNIC
>   Host: MYMACHINE
>
> Another transport will be used instead, although this may result in
> lower performance.
> --
>
> Do you guess why it could happen?
>
>
> Thanks a lot
> On 19/08/16 17:11, r...@open-mpi.org
>  wrote:
>
> The rdma error sounds like something isn’t right with your machine’s
> Infiniband installation.
>
> The cross-version problem sounds like you installed both OMPI versions
> into the same location - did you do that?? If so, then that might be the
> root cause of both problems. You need to install them in totally different
> locations. Then you need to _prefix_ your PATH and LD_LIBRARY_PATH with the
> location of the version you want to use.
>
> HTH
> Ralph
>
> On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq <
> bioinformatica-i...@us.es
> > wrote:
>
> Dear users,
>
> I am totally stuck using openmpi. I have two versions on my machine: 1.8.1
> and 2.0.0, and none of them work. When use the mpirun *1.8.1 version*, I
> get the following error:
>
> librdmacm: Fatal: unable to open RDMA device
> librdmacm: Fatal: unable to open RDMA device
> librdmacm: Fatal: unable to open RDMA device
> librdmacm: Fatal: unable to open RDMA device
> librdmacm: Fatal: unable to open RDMA device
> --
> Open MPI failed to open the /dev/knem device due to a local error.
> Please check with your system administrator to get the problem fixed,
> or set the btl_sm_use_knem MCA parameter to 0 to run without /dev/knem
> support.
>
>   Local host: MYMACHINE
>   Errno:  2 (No such file or directory)
> --
> --
> Open MPI failed to open an OpenFabrics device.  This is an unusual
> error; the system reported the OpenFabrics device as being present,
> but then later failed to access it successfully.  This usually
> indicates either a misconfiguration or a failed OpenFabrics hardware
> device.
>
> All OpenFabrics support has been disabled in this MPI process; your
> job may or may not continue.
>
>   Hostname:MYMACHINE
>   Device name: mlx4_0
>   Errror (22): Invalid argument
> --
> --
> [[60527,1],4]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: usNIC
>   Host: MYMACHINE
>
> When I use the *2.0.0 version*, I get something strange, it seems
> openmpi-2.0.0 looks for openmpi-1.8.1 libraries?:

Re: [OMPI users] Problems with mpirun in openmpi-1.8.1 and -2.0.0

2016-08-22 Thread Juan A. Cordero Varelaq

Hi Gilles,

adding *,usnic* made it work :) --mca pml ob1 would not be then needed.

Does it render mpi very slow if infiniband is disabled (what does --mca 
pml pb1?)?


Regarding the version mismatch, everything seems to be right. When only 
one version is loaded, I see the PATH and the LD_LIBRARY_PATH for only 
one version, and with strings, everything seems to reference the right 
version.


Thanks a lot for the quick answers!
On 22/08/16 13:09, Gilles Gouaillardet wrote:

Juan,

can you try to
mpirun --mca btl ^openib,usnic --mca pml ob1 ...

note this simply disable native infiniband. from a performance point 
of view, you should have your sysadmin fix the infiniband fabric.


about the version mismatch, please double check your environment
(e.g. $PATH and $LD_LIBRARY_PATH), it is likely v2.0 is in your 
environment when you are using v1.8, or the other way around)

also, make sure orted are launched with the right environment.
if you are using ssh, then
ssh node env
should not contain reference to the version you are not using.
(that typically occurs when the environment is set in the .bashrc, 
directly or via modules)


last but not least, you can
strings /.../bin/orted
strings /.../lib/libmpi.so
and check they do not reference the wrong version
(that can happen if a library was built and then moved)


Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq 
mailto:bioinformatica-i...@us.es>> wrote:


Dear Ralph,

The existence of the two versions does not seem to be the source
of problems, since they are in different locations. I uninstalled
the most recent version and try again with no luck, getting the
same warnings/errors. However, after a deep search I found a
couple of hints, and executed this:

mpirun *-mca btl ^openib -mca btl_sm_use_knem 0* -np 5 myscript

and got only a fraction of the previous errors (before I had run
the same but without the arguments in bold), related to OpenFabrics:

Open MPI failed to open an OpenFabrics device.  This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully. This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.

All OpenFabrics support has been disabled in this MPI process; your
job may or may not continue.

  Hostname:MYMACHINE
  Device name: mlx4_0
  Errror (22): Invalid argument
--
--
[[52062,1],0]: A high-performance Open MPI point-to-point
messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: MYMACHINE

Another transport will be used instead, although this may result in
lower performance.
--

Do you guess why it could happen?


Thanks a lot

On 19/08/16 17:11, r...@open-mpi.org
 wrote:

The rdma error sounds like something isn’t right with your
machine’s Infiniband installation.

The cross-version problem sounds like you installed both OMPI
versions into the same location - did you do that?? If so, then
that might be the root cause of both problems. You need to
install them in totally different locations. Then you need to
_prefix_ your PATH and LD_LIBRARY_PATH with the location of the
version you want to use.

HTH
Ralph


On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq
> wrote:

Dear users,

I am totally stuck using openmpi. I have two versions on my
machine: 1.8.1 and 2.0.0, and none of them work. When use the
mpirun *1.8.1 version*, I get the following error:

librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
--
Open MPI failed to open the /dev/knem device due to a local error.
Please check with your system administrator to get the problem
fixed,
or set the btl_sm_use_knem MCA parameter to 0 to run without
/dev/knem
support.

  Local host: MYMACHINE
  Errno:  2 (No such file or directory)
--
--
Open MPI failed to open an OpenFabrics device.  This is an unusual
error; the system reported the OpenFabrics device as being present,
but then later failed to access it successfully.  This usually
indicates either a misconfiguration or a failed OpenFabrics hardware
device.

All OpenFabrics support has been 

Re: [OMPI users] Problems with mpirun in openmpi-1.8.1 and -2.0.0

2016-08-22 Thread Gilles Gouaillardet
Juan,

to keep things simple, --mca pml ob1 ensures you are not using mxm
(yet an other way to use infiniband)

IPoIB is unlikely working on your system now, so for inter node
communications, you will use tcp with the interconnect you have (GbE or 10
GbE if you are lucky)
in term of performance, GbE(Gigabit Ethernet) is an order (or two) of
magnitude slower than infiniband, both in term of bandwidth and latency)

if you still face issues with version mismatch, you can
mpirun --tag-output ldd a.out
in your script, and double check all nodes are using the correct libs

Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq <
bioinformatica-i...@us.es> wrote:

> Hi Gilles,
> adding *,usnic* made it work :) --mca pml ob1 would not be then needed.
>
> Does it render mpi very slow if infiniband is disabled (what does --mca
> pml pb1?)?
>
> Regarding the version mismatch, everything seems to be right. When only
> one version is loaded, I see the PATH and the LD_LIBRARY_PATH for only one
> version, and with strings, everything seems to reference the right version.
>
> Thanks a lot for the quick answers!
> On 22/08/16 13:09, Gilles Gouaillardet wrote:
>
> Juan,
>
> can you try to
> mpirun --mca btl ^openib,usnic --mca pml ob1 ...
>
> note this simply disable native infiniband. from a performance point of
> view, you should have your sysadmin fix the infiniband fabric.
>
> about the version mismatch, please double check your environment
> (e.g. $PATH and $LD_LIBRARY_PATH), it is likely v2.0 is in your
> environment when you are using v1.8, or the other way around)
> also, make sure orted are launched with the right environment.
> if you are using ssh, then
> ssh node env
> should not contain reference to the version you are not using.
> (that typically occurs when the environment is set in the .bashrc,
> directly or via modules)
>
> last but not least, you can
> strings /.../bin/orted
> strings /.../lib/libmpi.so
> and check they do not reference the wrong version
> (that can happen if a library was built and then moved)
>
>
> Cheers,
>
> Gilles
>
> On Monday, August 22, 2016, Juan A. Cordero Varelaq <
> 
> bioinformatica-i...@us.es
> > wrote:
>
>> Dear Ralph,
>>
>> The existence of the two versions does not seem to be the source of
>> problems, since they are in different locations. I uninstalled the most
>> recent version and try again with no luck, getting the same
>> warnings/errors. However, after a deep search I found a couple of hints,
>> and executed this:
>>
>> mpirun *-mca btl ^openib -mca btl_sm_use_knem 0* -np 5 myscript
>>
>> and got only a fraction of the previous errors (before I had run the same
>> but without the arguments in bold), related to OpenFabrics:
>>
>> Open MPI failed to open an OpenFabrics device.  This is an unusual
>> error; the system reported the OpenFabrics device as being present,
>> but then later failed to access it successfully.  This usually
>> indicates either a misconfiguration or a failed OpenFabrics hardware
>> device.
>>
>> All OpenFabrics support has been disabled in this MPI process; your
>> job may or may not continue.
>>
>>   Hostname:MYMACHINE
>>   Device name: mlx4_0
>>   Errror (22): Invalid argument
>> 
>> --
>> 
>> --
>> [[52062,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>>
>> Module: usNIC
>>   Host: MYMACHINE
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> 
>> --
>>
>> Do you guess why it could happen?
>>
>>
>> Thanks a lot
>> On 19/08/16 17:11, r...@open-mpi.org wrote:
>>
>> The rdma error sounds like something isn’t right with your machine’s
>> Infiniband installation.
>>
>> The cross-version problem sounds like you installed both OMPI versions
>> into the same location - did you do that?? If so, then that might be the
>> root cause of both problems. You need to install them in totally different
>> locations. Then you need to _prefix_ your PATH and LD_LIBRARY_PATH with the
>> location of the version you want to use.
>>
>> HTH
>> Ralph
>>
>> On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq <
>> bioinformatica-i...@us.es> wrote:
>>
>> Dear users,
>>
>> I am totally stuck using openmpi. I have two versions on my machine:
>> 1.8.1 and 2.0.0, and none of them work. When use the mpirun *1.8.1
>> version*, I get the following error:
>>
>> librdmacm: Fatal: unable to open RDMA device
>> librdmacm: Fatal: unable to open RDMA device
>> librdmacm: Fatal: unable to open RDMA device
>> librdmacm: Fatal: unable to open RDMA device
>> librdmacm: Fatal: unable to open RDMA device
>> 
>> --
>> Open MPI failed to open the /dev/knem device due to 

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-22 Thread Audet, Martin
Hi Devendar,

Thank again you for your answer.

I searched a little bit and found that UD stands for "Unreliable Datagram"
while RC is for "Reliable Connected" transport mechanism. I found another
called DC for "Dynamically Connected" which is not supported on our HCA.

Do you know what is basically the difference between them ?

I didn't find any information about this.

Which one is used by btl=openib (iverb), is it RC ?

Also are they all standard or some of them are supported only by Mellanox ?

I will try to convince the admin of the system I'm using to increase the
maximal shared segment size (SHMMAX). I guess what we have (e.g. 32 MB) is the
default. But I didn't find any document suggesting that we should increase
SHMMAX for helping MXM. This is a bit odd, if it's important, it should be
mentioned in Mellanox documentation at least.

I will check at the messaging rate benchmark osu_mbw_mr for sure to see if its
result are improved by MXM.

After looking at the MPI performance results published on your URL (e.g.
latencies around 1 us in native mode), I'm more and more convinced that our
results are suboptimal.

And after seeing the impact of SR-IOV published on your URL, I suspect more
and more that our mediocre latency is caused by this mechanism.

But our cluster is different: SR-IOV is not used in the context of Virtual
Machines running under a host VMM. SR-IOV is used with Linux LXC containers.


Martin Audet


> Hi Martin
>
> MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when
> running with large applications.  RC(MXM_TLS=*rc,*shm,self)  is recommended
> for microbenchmarks and very small scale applications,
>
> yes, max seg size setting is too small.
>
> Did you check any message rate benchmarks(like osu_mbw_mr) with MXM?
>
> virtualization env will have some overhead.  see some perf comparision here
> with mvapich
> http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ .



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread Jingchao Zhang
Hi all,


We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
odd behaviors when trying to read from standard input.


For example, if we start the application lammps across 4 nodes, each node 16 
cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, 
but always stuck in a few seconds thereafter.

Command:

mpirun ./lmp_ompi_g++ < in.snr

in.snr is the Lammps input file. compiler is gcc/6.1.


Instead, if we use

mpirun ./lmp_ompi_g++ -in in.snr

it works 100%.


Some odd behaviors we gathered so far.

1. For 1 node job, stdin always works.

2. For multiple nodes, stdin works unstably when the number of cores per node 
are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun 
works most of the time. But for each node with >8 cores, mpirun works the 1st 
time, then always stuck. There seems to be a magic number when it stops working.

3. We tested Quantum Expresso with compiler intel/13 and had the same issue.


We used gdb to debug and found when mpirun was stuck, the rest of the processes 
were all waiting on mpi broadcast from the master thread. The lammps binary, 
input file and gdb core files (example.tar.bz2) can be downloaded from this 
link https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc


Extra information:

1. Job scheduler is slurm.

2. configure setup:

./configure --prefix=$PREFIX \
--with-hwloc=internal \
--enable-mpirun-prefix-by-default \
--with-slurm \
--with-verbs \
--with-psm \
--disable-openib-connectx-xrc \
--with-knem=/opt/knem-1.1.2.90mlnx1 \
--with-cma
3. openmpi-mca-params.conf file
orte_hetero_nodes=1
hwloc_base_binding_policy=core
rmaps_base_mapping_policy=core
opal_cuda_support=0
btl_openib_use_eager_rdma=0
btl_openib_max_eager_rdma=0
btl_openib_flags=1

Thanks,
Jingchao


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread r...@open-mpi.org
Hmmm...perhaps we can break this out a bit? The stdin will be going to your 
rank=0 proc. It sounds like you have some subsequent step that calls MPI_Bcast?

Can you first verify that the input is being correctly delivered to rank=0? 
This will help us isolate if the problem is in the IO forwarding, or in the 
subsequent Bcast.

> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang  wrote:
> 
> Hi all,
> 
> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
> odd behaviors when trying to read from standard input.
> 
> For example, if we start the application lammps across 4 nodes, each node 16 
> cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, 
> but always stuck in a few seconds thereafter.
> Command:
> mpirun ./lmp_ompi_g++ < in.snr
> in.snr is the Lammps input file. compiler is gcc/6.1.
> 
> Instead, if we use
> mpirun ./lmp_ompi_g++ -in in.snr
> it works 100%.
> 
> Some odd behaviors we gathered so far. 
> 1. For 1 node job, stdin always works.
> 2. For multiple nodes, stdin works unstably when the number of cores per node 
> are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun 
> works most of the time. But for each node with >8 cores, mpirun works the 1st 
> time, then always stuck. There seems to be a magic number when it stops 
> working.
> 3. We tested Quantum Expresso with compiler intel/13 and had the same issue. 
> 
> We used gdb to debug and found when mpirun was stuck, the rest of the 
> processes were all waiting on mpi broadcast from the master thread. The 
> lammps binary, input file and gdb core files (example.tar.bz2) can be 
> downloaded from this link 
> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc 
> 
> 
> Extra information:
> 1. Job scheduler is slurm.
> 2. configure setup:
> ./configure --prefix=$PREFIX \
> --with-hwloc=internal \
> --enable-mpirun-prefix-by-default \
> --with-slurm \
> --with-verbs \
> --with-psm \
> --disable-openib-connectx-xrc \
> --with-knem=/opt/knem-1.1.2.90mlnx1 \
> --with-cma
> 3. openmpi-mca-params.conf file 
> orte_hetero_nodes=1
> hwloc_base_binding_policy=core
> rmaps_base_mapping_policy=core
> opal_cuda_support=0
> btl_openib_use_eager_rdma=0
> btl_openib_max_eager_rdma=0
> btl_openib_flags=1
> 
> Thanks,
> Jingchao 
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread Jeff Hammond
On Monday, August 22, 2016, Jingchao Zhang  wrote:

> Hi all,
>
>
> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them
> have odd behaviors when trying to read from standard input.
>
>
> For example, if we start the application lammps across 4 nodes, each node
> 16 cores, connected by Intel QDR Infiniband, mpirun works fine for the
> 1st time, but always stuck in a few seconds thereafter.
>
> Command:
>
> mpirun ./lmp_ompi_g++ < in.snr
>
> in.snr is the Lammps input file. compiler is gcc/6.1.
>
>
> Using stdin with MPI codes is at best brittle. It is generally
discouraged. Furthermore, it is straight up impossible on some
supercomputer architectures.

> Instead, if we use
>
> mpirun ./lmp_ompi_g++ -in in.snr
>
> it works 100%.
>
> Just do this all the time. AFAIK Quantum Espresso has the same option. I
never need stdin to run MiniDFT (i.e. QE-lite).

Since both codes you name already have the correct workaround for stdin, I
would not waste any time debugging this. Just do the right thing from now
on and enjoy having your applications work.

Jeff



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread Jingchao Zhang
Here you can find the source code for lammps input 
https://github.com/lammps/lammps/blob/r13864/src/input.cpp


Based on the gdb output, rank 0 stuck at line 167

if (fgets(&line[m],maxline-m,infile) == NULL)

and the rest threads stuck at line 203

MPI_Bcast(&n,1,MPI_INT,0,world);


So rank 0 possibly hangs on the fgets() function.


Here are the whole backtrace information:

$ cat master.backtrace worker.backtrace
#0  0x003c37cdb68d in read () from /lib64/libc.so.6
#1  0x003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6
#4  0x003c37c66ce9 in fgets () from /lib64/libc.so.6
#5  0x005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
#6  0x005d4236 in main () at ../main.cpp:31
#0  0x2b1635d2ace2 in poll_dispatch () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#1  0x2b1635d1fa71 in opal_libevent2022_event_base_loop ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#2  0x2b1635ce4634 in opal_progress () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#3  0x2b16351b8fad in ompi_request_default_wait () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#4  0x2b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#5  0x2b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#6  0x2b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
#7  0x2b16351cb4fb in PMPI_Bcast () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#8  0x005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203
#9  0x005d4236 in main () at ../main.cpp:31

Thanks,


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400

From: users  on behalf of r...@open-mpi.org 

Sent: Monday, August 22, 2016 2:17:10 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

Hmmm...perhaps we can break this out a bit? The stdin will be going to your 
rank=0 proc. It sounds like you have some subsequent step that calls MPI_Bcast?

Can you first verify that the input is being correctly delivered to rank=0? 
This will help us isolate if the problem is in the IO forwarding, or in the 
subsequent Bcast.

On Aug 22, 2016, at 1:11 PM, Jingchao Zhang 
mailto:zh...@unl.edu>> wrote:

Hi all,

We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
odd behaviors when trying to read from standard input.

For example, if we start the application lammps across 4 nodes, each node 16 
cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, 
but always stuck in a few seconds thereafter.
Command:
mpirun ./lmp_ompi_g++ < in.snr
in.snr is the Lammps input file. compiler is gcc/6.1.

Instead, if we use
mpirun ./lmp_ompi_g++ -in in.snr
it works 100%.

Some odd behaviors we gathered so far.
1. For 1 node job, stdin always works.
2. For multiple nodes, stdin works unstably when the number of cores per node 
are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun 
works most of the time. But for each node with >8 cores, mpirun works the 1st 
time, then always stuck. There seems to be a magic number when it stops working.
3. We tested Quantum Expresso with compiler intel/13 and had the same issue.

We used gdb to debug and found when mpirun was stuck, the rest of the processes 
were all waiting on mpi broadcast from the master thread. The lammps binary, 
input file and gdb core files (example.tar.bz2) can be downloaded from this 
link https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc

Extra information:
1. Job scheduler is slurm.
2. configure setup:

./configure --prefix=$PREFIX \
--with-hwloc=internal \
--enable-mpirun-prefix-by-default \
--with-slurm \
--with-verbs \
--with-psm \
--disable-openib-connectx-xrc \
--with-knem=/opt/knem-1.1.2.90mlnx1 \
--with-cma
3. openmpi-mca-params.conf file
orte_hetero_nodes=1
hwloc_base_binding_policy=core
rmaps_base_mapping_policy=core
opal_cuda_support=0
btl_openib_use_eager_rdma=0
btl_openib_max_eager_rdma=0
btl_openib_flags=1

Thanks,
Jingchao

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/us

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread r...@open-mpi.org
Well, I can try to find time to take a look. However, I will reiterate what 
Jeff H said - it is very unwise to rely on IO forwarding. Much better to just 
directly read the file unless that file is simply unavailable on the node where 
rank=0 is running.

> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang  wrote:
> 
> Here you can find the source code for lammps input 
> https://github.com/lammps/lammps/blob/r13864/src/input.cpp 
> 
> Based on the gdb output, rank 0 stuck at line 167
> if
>  (fgets(&line[m],maxline-m,infile)
>  == NULL)
> and the rest threads stuck at line 203
> MPI_Bcast(&n,1,MPI_INT,0,world);
> 
> So rank 0 possibly hangs on the fgets() function.
> 
> Here are the whole backtrace information:
> $ cat master.backtrace worker.backtrace
> #0  0x003c37cdb68d in read () from /lib64/libc.so.6
> #1  0x003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
> #2  0x003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
> #3  0x003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6
> #4  0x003c37c66ce9 in fgets () from /lib64/libc.so.6
> #5  0x005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
> #6  0x005d4236 in main () at ../main.cpp:31
> #0  0x2b1635d2ace2 in poll_dispatch () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
> #1  0x2b1635d1fa71 in opal_libevent2022_event_base_loop ()
>from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
> #2  0x2b1635ce4634 in opal_progress () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
> #3  0x2b16351b8fad in ompi_request_default_wait () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #4  0x2b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
>from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #5  0x2b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
>from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #6  0x2b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
>from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
> #7  0x2b16351cb4fb in PMPI_Bcast () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #8  0x005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203
> #9  0x005d4236 in main () at ../main.cpp:31
> 
> Thanks,
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> From: users  > on behalf of r...@open-mpi.org 
>  mailto:r...@open-mpi.org>>
> Sent: Monday, August 22, 2016 2:17:10 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>  
> Hmmm...perhaps we can break this out a bit? The stdin will be going to your 
> rank=0 proc. It sounds like you have some subsequent step that calls 
> MPI_Bcast?
> 
> Can you first verify that the input is being correctly delivered to rank=0? 
> This will help us isolate if the problem is in the IO forwarding, or in the 
> subsequent Bcast.
> 
>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang > > wrote:
>> 
>> Hi all,
>> 
>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
>> odd behaviors when trying to read from standard input.
>> 
>> For example, if we start the application lammps across 4 nodes, each node 16 
>> cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st 
>> time, but always stuck in a few seconds thereafter.
>> Command:
>> mpirun ./lmp_ompi_g++ < in.snr
>> in.snr is the Lammps input file. compiler is gcc/6.1.
>> 
>> Instead, if we use
>> mpirun ./lmp_ompi_g++ -in in.snr
>> it works 100%.
>> 
>> Some odd behaviors we gathered so far. 
>> 1. For 1 node job, stdin always works.
>> 2. For multiple nodes, stdin works unstably when the number of cores per 
>> node are relatively small. For example, for 2/3/4 nodes, each node 8 cores, 
>> mpirun works most of the time. But for each node with >8 cores, mpirun works 
>> the 1st time, then always stuck. There seems to be a magic number when it 
>> stops working.
>> 3. We tested Quantum Expresso with compiler intel/13 and had the same issue. 
>> 
>> We used gdb to debug and found when mpirun was stuck, the rest of the 
>> processes were all waiting on mpi broadcast from the master thread. The 
>> lammps binary, input file and gdb core files (example.tar.bz2) can be 
>> downloaded from this link 
>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc 
>> 
>> 
>> Extra information:
>> 1. Job scheduler is slurm.
>> 2. configure setup:
>> ./configure --prefix=$PREFIX \
>> --with-hwloc=internal \
>> --enable-mpirun-prefix-by-default \
>> --with-slurm \
>> --with-verbs \
>> --with-psm \
>> --dis

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread Jingchao Zhang
This might be a thin argument but we have many users running mpirun in this way 
for years with no problem until this recent upgrade. And some home-brewed mpi 
codes do not even have a standard way to read the input files. Last time I 
checked, the openmpi manual still claims it supports stdin 
(https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14). Maybe I missed it 
but the v2.0 release notes did not mention any changes to the behaviors of 
stdin as well.


We can tell our users to run mpirun in the suggested way, but I do hope someone 
can look into the issue and fix it.


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400

From: users  on behalf of r...@open-mpi.org 

Sent: Monday, August 22, 2016 3:04:50 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

Well, I can try to find time to take a look. However, I will reiterate what 
Jeff H said - it is very unwise to rely on IO forwarding. Much better to just 
directly read the file unless that file is simply unavailable on the node where 
rank=0 is running.

On Aug 22, 2016, at 1:55 PM, Jingchao Zhang 
mailto:zh...@unl.edu>> wrote:

Here you can find the source code for lammps input 
https://github.com/lammps/lammps/blob/r13864/src/input.cpp

Based on the gdb output, rank 0 stuck at line 167
if (fgets(&line[m],maxline-m,infile) == NULL)
and the rest threads stuck at line 203
MPI_Bcast(&n,1,MPI_INT,0,world);

So rank 0 possibly hangs on the fgets() function.

Here are the whole backtrace information:

$ cat master.backtrace worker.backtrace
#0  0x003c37cdb68d in read () from /lib64/libc.so.6
#1  0x003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6
#4  0x003c37c66ce9 in fgets () from /lib64/libc.so.6
#5  0x005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
#6  0x005d4236 in main () at ../main.cpp:31
#0  0x2b1635d2ace2 in poll_dispatch () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#1  0x2b1635d1fa71 in opal_libevent2022_event_base_loop ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#2  0x2b1635ce4634 in opal_progress () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#3  0x2b16351b8fad in ompi_request_default_wait () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#4  0x2b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#5  0x2b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#6  0x2b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
#7  0x2b16351cb4fb in PMPI_Bcast () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#8  0x005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203
#9  0x005d4236 in main () at ../main.cpp:31

Thanks,

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of r...@open-mpi.org 
mailto:r...@open-mpi.org>>
Sent: Monday, August 22, 2016 2:17:10 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

Hmmm...perhaps we can break this out a bit? The stdin will be going to your 
rank=0 proc. It sounds like you have some subsequent step that calls MPI_Bcast?

Can you first verify that the input is being correctly delivered to rank=0? 
This will help us isolate if the problem is in the IO forwarding, or in the 
subsequent Bcast.

On Aug 22, 2016, at 1:11 PM, Jingchao Zhang 
mailto:zh...@unl.edu>> wrote:

Hi all,

We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
odd behaviors when trying to read from standard input.

For example, if we start the application lammps across 4 nodes, each node 16 
cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, 
but always stuck in a few seconds thereafter.
Command:
mpirun ./lmp_ompi_g++ < in.snr
in.snr is the Lammps input file. compiler is gcc/6.1.

Instead, if we use
mpirun ./lmp_ompi_g++ -in in.snr
it works 100%.

Some odd behaviors we gathered so far.
1. For 1 node job, stdin always works.
2. For multiple nodes, stdin works unstably when the number of cores per node 
are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun 
works most of the time. But for each node with >8 cores, mpirun works the 1st 
time, then always stuck. There seems to be a magic number when it stops working.
3. We tested Quantum Expresso with compiler intel/13 and had the same issue.

We used gdb to debug and found when mpirun was stuck, the rest of the p

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-22 Thread r...@open-mpi.org
FWIW: I just tested forwarding up to 100MBytes via stdin using the simple test 
shown below with OMPI v2.0.1rc1, and it worked fine. So I’d suggest upgrading 
when the official release comes out, or going ahead and at least testing 
2.0.1rc1 on your machine. Or you can test this program with some input file and 
let me know if it works for you.

Ralph

#include 
#include 
#include 
#include 
#include 
#include 

#define ORTE_IOF_BASE_MSG_MAX   2048

int main(int argc, char *argv[])
{
int i, rank, size, next, prev, tag = 201;
int pos, msgsize, nbytes;
bool done;
char *msg;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank);

next = (rank + 1) % size;
prev = (rank + size - 1) % size;
msg = malloc(ORTE_IOF_BASE_MSG_MAX);
pos = 0;
nbytes = 0;

if (0 == rank) {
while (0 != (msgsize = read(0, msg, ORTE_IOF_BASE_MSG_MAX))) {
fprintf(stderr, "Rank %d: sending blob %d\n", rank, pos);
if (msgsize > 0) {
MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
MPI_COMM_WORLD);
}
++pos;
nbytes += msgsize;
}
fprintf(stderr, "Rank %d: sending termination blob %d\n", rank, pos);
memset(msg, 0, ORTE_IOF_BASE_MSG_MAX);
MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
} else {
while (1) {
MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, MPI_COMM_WORLD);
fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos);
++pos;
done = true;
for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) {
if (0 != msg[i]) {
done = false;
break;
}
}
if (done) {
break;
}
}
fprintf(stderr, "Rank %d: recv done\n", rank);
MPI_Barrier(MPI_COMM_WORLD);
}

fprintf(stderr, "Rank %d has completed bcast\n", rank);
MPI_Finalize();
return 0;
}



> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang  wrote:
> 
> This might be a thin argument but we have many users running mpirun in this 
> way for years with no problem until this recent upgrade. And some home-brewed 
> mpi codes do not even have a standard way to read the input files. Last time 
> I checked, the openmpi manual still claims it supports stdin 
> (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 
> ). Maybe I missed 
> it but the v2.0 release notes did not mention any changes to the behaviors of 
> stdin as well.
> 
> We can tell our users to run mpirun in the suggested way, but I do hope 
> someone can look into the issue and fix it.
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> From: users  > on behalf of r...@open-mpi.org 
>  mailto:r...@open-mpi.org>>
> Sent: Monday, August 22, 2016 3:04:50 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>  
> Well, I can try to find time to take a look. However, I will reiterate what 
> Jeff H said - it is very unwise to rely on IO forwarding. Much better to just 
> directly read the file unless that file is simply unavailable on the node 
> where rank=0 is running.
> 
>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang > > wrote:
>> 
>> Here you can find the source code for lammps input 
>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp 
>> 
>> Based on the gdb output, rank 0 stuck at line 167
>> if
>>  (fgets(&line[m],maxline-m,infile)
>>  == NULL)
>> and the rest threads stuck at line 203
>> MPI_Bcast(&n,1,MPI_INT,0,world);
>> 
>> So rank 0 possibly hangs on the fgets() function.
>> 
>> Here are the whole backtrace information:
>> $ cat master.backtrace worker.backtrace
>> #0  0x003c37cdb68d in read () from /lib64/libc.so.6
>> #1  0x003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
>> #2  0x003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
>> #3  0x003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6
>> #4  0x003c37c66ce9 in fgets () from /lib64/libc.so.6
>> #5  0x005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
>> #6  0x005d4236 in main () at ../main.cpp:31
>> #0  0x2b1635d2ace2 in poll_dispatch () from 
>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>> #1  0x2b1635d1fa71 in opal_libevent2022_event_base_loop ()
>>from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>> #2  0x2b1635ce4634 in opal_progress () from 
>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>> #3  0x