Re: [OMPI users] Building vs packaging

2016-05-16 Thread David Shrader

Hey Rob,

I don't know if this is what is going on, but in general, when a package 
is installed via a distro's package management system, it ends up in 
system locations such as /usr/bin and /usr/lib that are automatically 
searched when looking for executables and libraries. So, it isn't 
necessarily that the package maintainers did much of anything different 
when putting together the package; instead, they may have put files in 
locations that are more accessible from a system-tool point of view. For 
example, the runtime linker knows to search in several system-defined 
directories such as /usr/lib. This might explain why everything worked 
after installing openmpi-bin: the binaries and libraries all ended up in 
system locations that are automatically a part of the environment on the 
remote node, so remote execution worked as it could find everything.


Thanks,

David


On 05/14/2016 05:37 AM, Rob Malpass wrote:


Hi all

I posted about a fortnight ago to this list as I was having some 
trouble getting my nodes to be controlled by my master node.   
Perceived wisdom at the time was to compile with the 
–enable-orterun-prefix-by-default.


For some time I’d been getting cannot open libopen-rte.so.7 which 
points to a problem with LD_LIBRARY_PATH.   I had been able to run it 
on nodes 3 and 4 even though (from headnode) if I do


ssh node4 ‘echo $LD_LIBRARY_PATH’

returns a blank line.   However – as I say it’s working on nodes 3 and 4.

I had been hacking for ages on nodes 1 and 2 getting the same error 
but still with LD_LIBRARY_PATH apparently not set for an interactive 
login.


Almost in desperation, I cheated:

sudo  apt-get install openmpi-bin

and hey presto.   I can now do (from head node)

mpirun –H node2,node3,node4 –n 10 foo

and it works fine.   So clearly apt-get install has set something that 
I’d not done (and it’s seemingly not LD_LIBRARY_PATH) as ssh node2 
‘echo $LD_LIBRARY_PATH’ still returns a blank line.


Can anyone tell me what might be in the install script so I can get a 
clue?


Thanks



___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29201.php


--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



[OMPI users] what was the rationale behind rank mapping by socket?

2016-09-29 Thread David Shrader

Hello All,

Would anyone know why the default mapping scheme is socket for jobs with 
more than 2 ranks? Would they be able to please take some time and 
explain the reasoning? Please note I am not railing against the 
decision, but rather trying to gather as much information about it as I 
can so as to be able to better work with my users who are just now 
starting to ask questions about it. The FAQ pretty much pushes folks to 
the man pages, and the mpirun man page doesn't go in to the reasoning.


Thank you for your time,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] how to tell if pmi or pmi2 is being used?

2016-10-13 Thread David Shrader

Hello All,

I'm using Open MPI 1.10.3 with Slurm and would like to ask how do I find 
out if pmi1 or pmi2 was used for process launching? The Slurm 
installation is supposed to support both pmi1 and pmi2, but I would 
really like to know which one I fall in to. I tried using '-mca 
plm_base_verbose 100' on the mpirun line, but it didn't mention pmi 
specifically. Instead, all I could really find was that it was using the 
slurm component. Is there something else I can look at in the output 
that would have that detail?


Thank you for your time,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] how to tell if pmi or pmi2 is being used?

2016-10-13 Thread David Shrader

That is really good to know. Thanks!
David

On 10/13/2016 12:27 PM, r...@open-mpi.org wrote:

If you are using mpirun, then neither PMI1 or PMI2 are involved at all. ORTE 
has its own internal mechanism for handling wireup.



On Oct 13, 2016, at 10:43 AM, David Shrader  wrote:

Hello All,

I'm using Open MPI 1.10.3 with Slurm and would like to ask how do I find out if 
pmi1 or pmi2 was used for process launching? The Slurm installation is supposed 
to support both pmi1 and pmi2, but I would really like to know which one I fall 
in to. I tried using '-mca plm_base_verbose 100' on the mpirun line, but it 
didn't mention pmi specifically. Instead, all I could really find was that it 
was using the slurm component. Is there something else I can look at in the 
output that would have that detail?

Thank you for your time,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] question about "--rank-by slot" behavior

2016-11-30 Thread David Shrader

Hello All,

The man page for mpirun says that the default ranking procedure is 
round-robin by slot. It doesn't seem to be that straight-forward to me, 
though, and I wanted to ask about the behavior.


To help illustrate my confusion, here are a few examples where the 
ranking behavior changed based on the mapping behavior, which doesn't 
make sense to me, yet. First, here is a simple map by core (using 4 
nodes of 32 cpu cores each):


$> mpirun -n 128 --map-by core --report-bindings true
[gr0649.localdomain:119614] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 1 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 2 bound to socket 0[core 2[hwt 0]]: 
[././B/././././././././././././././.][./././././././././././././././././.]

...output snipped...

Things look as I would expect: ranking happens round-robin through the 
cpu cores. Now, here's a map by socket example:


$> mpirun -n 128 --map-by socket --report-bindings true
[gr0649.localdomain:119926] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 1 bound to socket 1[core 18[hwt 
0]]: 
[./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././././././.][./././././././././././././././././.]

...output snipped...

Why is rank 1 on a different socket? I know I am mapping by socket in 
this example, but, fundamentally, nothing should really be different in 
terms of ranking, correct? The same number of processes are available on 
each host as in the first example, and available in the same locations. 
How is "slot" different in this case? If I use "--rank-by core," I 
recover the output from the first example.


I thought that maybe "--rank-by slot" might be following something laid 
down by "--map-by", but the following example shows that isn't 
completely correct, either:


$> mpirun -n 128 --map-by socket:span --report-bindings true
[gr0649.localdomain:119319] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 1 bound to socket 1[core 18[hwt 
0]]: 
[./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././././././.][./././././././././././././././././.]

...output snipped...

If ranking by slot were somehow following something left over by 
mapping, I would have expected rank 2 to end up on a different host. So, 
now I don't know what to expect from using "--rank-by slot." Does anyone 
have any pointers?


Thank you for the help!
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] question about "--rank-by slot" behavior

2016-11-30 Thread David Shrader

Hello Ralph,

I do understand that "slot" is an abstract term and isn't tied down to 
any particular piece of hardware. What I am trying to understand is how 
"slot" came to be equivalent to "socket" in my second and third example, 
but "core" in my first example. As far as I can tell, MPI ranks should 
have been assigned the same in all three examples. Why weren't they?


You mentioned that, when using "--rank-by slot", the ranks are assigned 
round-robin by scheduler entry; does this mean that the scheduler 
entries change based on the mapping algorithm (the only thing I changed 
in my examples) and this results in ranks being assigned differently?


Thanks again,
David

On 11/30/2016 01:23 PM, r...@open-mpi.org wrote:

I think you have confused “slot” with a physical “core”. The two have 
absolutely nothing to do with each other.

A “slot” is nothing more than a scheduling entry in which a process can be 
placed. So when you --rank-by slot, the ranks are assigned round-robin by 
scheduler entry - i.e., you assign all the ranks on the first node, then assign 
all the ranks on the next node, etc.

It doesn’t matter where those ranks are placed, or what core or socket they are 
running on. We just blindly go thru and assign numbers.

If you rank-by core, then we cycle across the procs by looking at the core 
number they are bound to, assigning all the procs on a node before moving to 
the next node. If you rank-by socket, then you cycle across the procs on a node 
by round-robin of sockets, assigning all procs on the node before moving to the 
next node. If you then added “span” to that directive, we’d round-robin by 
socket across all nodes before circling around to the next proc on this node.

HTH
Ralph



On Nov 30, 2016, at 11:26 AM, David Shrader  wrote:

Hello All,

The man page for mpirun says that the default ranking procedure is round-robin 
by slot. It doesn't seem to be that straight-forward to me, though, and I 
wanted to ask about the behavior.

To help illustrate my confusion, here are a few examples where the ranking 
behavior changed based on the mapping behavior, which doesn't make sense to me, 
yet. First, here is a simple map by core (using 4 nodes of 32 cpu cores each):

$> mpirun -n 128 --map-by core --report-bindings true
[gr0649.localdomain:119614] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 1 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 2 bound to socket 0[core 2[hwt 0]]: 
[././B/././././././././././././././.][./././././././././././././././././.]
...output snipped...

Things look as I would expect: ranking happens round-robin through the cpu 
cores. Now, here's a map by socket example:

$> mpirun -n 128 --map-by socket --report-bindings true
[gr0649.localdomain:119926] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 1 bound to socket 1[core 18[hwt 0]]: 
[./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././././././.][./././././././././././././././././.]
...output snipped...

Why is rank 1 on a different socket? I know I am mapping by socket in this example, but, 
fundamentally, nothing should really be different in terms of ranking, correct? The same number of 
processes are available on each host as in the first example, and available in the same locations. 
How is "slot" different in this case? If I use "--rank-by core," I recover the 
output from the first example.

I thought that maybe "--rank-by slot" might be following something laid down by 
"--map-by", but the following example shows that isn't completely correct, either:

$> mpirun -n 128 --map-by socket:span --report-bindings true
[gr0649.localdomain:119319] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 1 bound to socket 1[core 18[hwt 0]]: 
[./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././././././.][./././././././././././././././././.]
...output snipped...

If ranking by slot were somehow following something left over by mapping, I would have 
expected rank 2 to end up on a different host. So, now I don't know what to expect from 
using "--rank-by slot." Does anyone have any pointers?

Thank you for the help!
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los 

Re: [OMPI users] question about "--rank-by slot" behavior

2016-11-30 Thread David Shrader
Thank you for the explanation! I understand what is going on now: there 
is a process list for each node whose order is dependent on the mapping 
policy, and the ranker, when using "slot," walks through that list. 
Makes sense.


Thank you again!
David

On 11/30/2016 04:46 PM, r...@open-mpi.org wrote:

“slot’ never became equivalent to “socket”, or to “core”. Here is what happened:

*for your first example: the mapper assigns the first process to the first node 
because there is a free core there, and you said to map-by core. It goes on to 
assign the second process to the second core, and the third process to the 
third core, etc. until we reach the defined #procs for that node (i.e., the 
number of assigned “slots” for that node). When it goes to rank the procs, the 
ranker starts with the first process assigned on the first node - this process 
occupies the first “slot”, and so it gets rank 0. The ranker then assigns rank 
1 to the second process it assigned to the first node, as that process occupies 
the second “slot”. Etc.

* your 2nd example: the mapper assigns the first process to the first socket of 
the first node, the second process to the second socket of the first node, and 
the third process to the first socket of the first node, until all the “slots” 
for that node have been filled. The ranker then starts with the first process 
that was assigned to the first node, and gives it rank 0. The ranker then 
assigns rank 1 to the second process that was assigned to the node - that would 
be the first proc mapped to the second socket. The ranker then assigns rank 2 
to the third proc assigned to the node - that would be the 2nd proc assigned to 
the first socket.

* your 3rd example: the mapper assigns the first process to the first socket of 
the first node, the second process to the second socket of the first node, and 
the third process to the first socket of the second node, continuing around 
until all procs have been mapped. The ranker then starts with the first proc 
assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 
to the second process assigned to the first node (because we are ranking by 
slot!), which corresponds to the first proc mapped to the second socket. The 
ranker then assigns rank 2 to the third process assigned to the first node, 
which corresponds to the second proc mapped to the first socket of that node.

So you can see that you will indeed get the same relative ranking, even though 
the mapping was done using a different algorithm.

HTH
Ralph


On Nov 30, 2016, at 2:16 PM, David Shrader  wrote:

Hello Ralph,

I do understand that "slot" is an abstract term and isn't tied down to any particular piece of hardware. What 
I am trying to understand is how "slot" came to be equivalent to "socket" in my second and third 
example, but "core" in my first example. As far as I can tell, MPI ranks should have been assigned the same 
in all three examples. Why weren't they?

You mentioned that, when using "--rank-by slot", the ranks are assigned 
round-robin by scheduler entry; does this mean that the scheduler entries change based on 
the mapping algorithm (the only thing I changed in my examples) and this results in ranks 
being assigned differently?

Thanks again,
David

On 11/30/2016 01:23 PM, r...@open-mpi.org wrote:

I think you have confused “slot” with a physical “core”. The two have 
absolutely nothing to do with each other.

A “slot” is nothing more than a scheduling entry in which a process can be 
placed. So when you --rank-by slot, the ranks are assigned round-robin by 
scheduler entry - i.e., you assign all the ranks on the first node, then assign 
all the ranks on the next node, etc.

It doesn’t matter where those ranks are placed, or what core or socket they are 
running on. We just blindly go thru and assign numbers.

If you rank-by core, then we cycle across the procs by looking at the core 
number they are bound to, assigning all the procs on a node before moving to 
the next node. If you rank-by socket, then you cycle across the procs on a node 
by round-robin of sockets, assigning all procs on the node before moving to the 
next node. If you then added “span” to that directive, we’d round-robin by 
socket across all nodes before circling around to the next proc on this node.

HTH
Ralph



On Nov 30, 2016, at 11:26 AM, David Shrader  wrote:

Hello All,

The man page for mpirun says that the default ranking procedure is round-robin 
by slot. It doesn't seem to be that straight-forward to me, though, and I 
wanted to ask about the behavior.

To help illustrate my confusion, here are a few examples where the ranking 
behavior changed based on the mapping behavior, which doesn't make sense to me, 
yet. First, here is a simple map by core (using 4 nodes of 32 cpu cores each):

$> mpirun -n 128 --map-by core --report-binding

[OMPI users] what was ompi configured with?

2015-05-05 Thread David Shrader

Hello,

Is there a way to tell what configure line was used in building Open MPI 
from the installation itself? That is, not from config.log but from 
issuing some command like 'mpicc --version'. I'm wondering if a 
particular installation of Open MPI has anything that "remembers" how it 
was configured.


Thank you very much for your time,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] what was ompi configured with?

2015-05-05 Thread David Shrader

That is pretty much what I am looking for. Thank you!
David

On 05/05/2015 12:58 PM, Jeff Squyres (jsquyres) wrote:

We can't capture the exact configure command line, but you can look at the 
output from ompi_info to check specific characteristics of your Open MPI 
installation.

ompi_info with no CLI options tells you a bunch of stuff; "ompi_info --all" 
tells you (a lot) more.



On May 5, 2015, at 2:54 PM, David Shrader  wrote:

Hello,

Is there a way to tell what configure line was used in building Open MPI from the 
installation itself? That is, not from config.log but from issuing some command like 
'mpicc --version'. I'm wondering if a particular installation of Open MPI has anything 
that "remembers" how it was configured.

Thank you very much for your time,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26838.php




--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



[OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-22 Thread David Shrader

Hello,

I'm getting a spurious '-L' flag when I have mxm installed in 
system-space (/usr/lib64/libmxm.so) which is causing an error at link 
time during make:


...output snipped...
/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99 -O3 
-DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing 
-pthread -module -avoid-version   -o libmca_mtl_mxm.la  mtl_mxm.lo 
mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo 
mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil

libtool: link: require no space between `-L' and `-lrt'
make[2]: *** [libmca_mtl_mxm.la] Error 1
make[2]: Leaving directory 
`/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm'

make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi'

make: *** [all-recursive] Error 1

If I I use --with-mxm=no, then this error doesn't occur (as expected as 
the mxm component isn't touched). Has anyone run in to this before?


Here is my configure line:

./configure --disable-silent-rules 
--with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=...


I wonder if there is an empty variable that should contain the directory 
libmxm is in somewhere in configure since no directory is passed to 
--with-mxm which is then paired with a "-L". I think I'll go through the 
configure script while waiting to see if anyone else has run in to this.


Thank you for any and all help,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread David Shrader

Hello Mike,

This particular instance of mxm was installed using rpms that were 
re-rolled by our admins. I'm not 100% sure where they got them (HPCx or 
somewhere else). I myself am not using HPCx. Is there any particular 
reason why mxm shouldn't be in system space? If there is, I'll share it 
with our admins and try to get the install location corrected.


As for what is causing the extra -L, it does look like an empty variable 
is used without checking that it is empty in configure. Line 246117 in 
the configure script provided by the openmpi-1.8.5.tar.bz2 tarball has this:


ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"

By invoking configure with '/bin/sh -x ./configure ...' and changing PS4 
to output line numbers, I saw that line 246117 was setting 
ompi_check_mxm_extra_libs to just "-L". It turns out that configure does 
this in three separate locations. I put a check around all three 
instances like this:


if test ! -z "$ompi_check_mxm_extra_libs"; then
  ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
fi

And the spurious '-L' disappeared from the linking commands and make 
completed fine.


So, it looks like there are two solutions: move the install location of 
mxm to not be in system-space or modify configure. Which one would be 
the better one for me to pursue?


Thanks,
David

On 05/23/2015 12:05 AM, Mike Dubman wrote:

Hi,

How mxm was installed? by copying?

The rpm based installation places mxm into /opt/mellanox/mxm and not 
into /usr/lib64/libmxm.so.


Do you use HPCx (pack of OMPI and MXM and FCA)?
You can download HPCX, extract it anywhere and compile OMPI pointing 
to mxm location under HPCX.


Also, HPCx contains rpms for mxm and fca.


M

On Sat, May 23, 2015 at 1:07 AM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


Hello,

I'm getting a spurious '-L' flag when I have mxm installed in
system-space (/usr/lib64/libmxm.so) which is causing an error at
link time during make:

...output snipped...
/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99
-O3 -DNDEBUG -I/opt/panfs/include -finline-functions
-fno-strict-aliasing -pthread -module -avoid-version   -o
libmca_mtl_mxm.la <http://libmca_mtl_mxm.la> mtl_mxm.lo
mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo
mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm
-lutil
libtool: link: require no space between `-L' and `-lrt'
make[2]: *** [libmca_mtl_mxm.la <http://libmca_mtl_mxm.la>] Error 1
make[2]: Leaving directory

`/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory

`/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi'
make: *** [all-recursive] Error 1

If I I use --with-mxm=no, then this error doesn't occur (as
expected as the mxm component isn't touched). Has anyone run in to
this before?

Here is my configure line:

./configure --disable-silent-rules
--with-platform=contrib/platform/lanl/toss/optimized-panasas
--prefix=...

I wonder if there is an empty variable that should contain the
directory libmxm is in somewhere in configure since no directory
is passed to --with-mxm which is then paired with a "-L". I think
I'll go through the configure script while waiting to see if
anyone else has run in to this.

Thank you for any and all help,
David

-- 
David Shrader

HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov <http://lanl.gov>

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26904.php




--

Kind Regards,

M.


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26905.php


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread David Shrader

Hello Mike,

I'm glad that I could be of help.

Just as an FYI, right now our admins are still hosting the fca libraries 
in /opt, but they would like to have it in system-space just as they 
have done with mxm. I haven't worked my way through all of the 
fca-related logic in configure yet, so I don't know if putting the fca 
libraries in system-space will expose any issues as of yet. It might be 
a good idea to check out that logic while looking at the mxm-related logic.


Thank you again!
David

On 05/26/2015 09:41 AM, Mike Dubman wrote:

Hello David,
Thanks for info and patch - will fix ompi configure logic with your patch.

mxm can be installed in the system and user spaces - both are valid 
and supported logic.


M

On Tue, May 26, 2015 at 5:50 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


Hello Mike,

This particular instance of mxm was installed using rpms that were
re-rolled by our admins. I'm not 100% sure where they got them
(HPCx or somewhere else). I myself am not using HPCx. Is there any
particular reason why mxm shouldn't be in system space? If there
is, I'll share it with our admins and try to get the install
location corrected.

As for what is causing the extra -L, it does look like an empty
variable is used without checking that it is empty in configure.
Line 246117 in the configure script provided by the
openmpi-1.8.5.tar.bz2 tarball has this:

ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"

By invoking configure with '/bin/sh -x ./configure ...' and
changing PS4 to output line numbers, I saw that line 246117 was
setting ompi_check_mxm_extra_libs to just "-L". It turns out that
configure does this in three separate locations. I put a check
around all three instances like this:

if test ! -z "$ompi_check_mxm_extra_libs"; then
  ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
fi

And the spurious '-L' disappeared from the linking commands and
make completed fine.

So, it looks like there are two solutions: move the install
location of mxm to not be in system-space or modify configure.
Which one would be the better one for me to pursue?

Thanks,
David


On 05/23/2015 12:05 AM, Mike Dubman wrote:

Hi,

How mxm was installed? by copying?

The rpm based installation places mxm into /opt/mellanox/mxm and
not into /usr/lib64/libmxm.so.

Do you use HPCx (pack of OMPI and MXM and FCA)?
You can download HPCX, extract it anywhere and compile OMPI
pointing to mxm location under HPCX.

Also, HPCx contains rpms for mxm and fca.


M

On Sat, May 23, 2015 at 1:07 AM, David Shrader mailto:dshra...@lanl.gov>> wrote:

Hello,

I'm getting a spurious '-L' flag when I have mxm installed in
system-space (/usr/lib64/libmxm.so) which is causing an error
at link time during make:

...output snipped...
/bin/sh ../../../../libtool  --tag=CC  --mode=link gcc
-std=gnu99 -O3 -DNDEBUG -I/opt/panfs/include
-finline-functions -fno-strict-aliasing -pthread -module
-avoid-version   -o libmca_mtl_mxm.la
<http://libmca_mtl_mxm.la> mtl_mxm.lo mtl_mxm_cancel.lo
mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo
mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil
libtool: link: require no space between `-L' and `-lrt'
make[2]: *** [libmca_mtl_mxm.la <http://libmca_mtl_mxm.la>]
Error 1
make[2]: Leaving directory

`/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory

`/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi'
make: *** [all-recursive] Error 1

If I I use --with-mxm=no, then this error doesn't occur (as
expected as the mxm component isn't touched). Has anyone run
in to this before?

Here is my configure line:

./configure --disable-silent-rules
--with-platform=contrib/platform/lanl/toss/optimized-panasas
--prefix=...

I wonder if there is an empty variable that should contain
the directory libmxm is in somewhere in configure since no
directory is passed to --with-mxm which is then paired with a
"-L". I think I'll go through the configure script while
waiting to see if anyone else has run in to this.

Thank you for any and all help,
David

-- 
David Shrader

HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov <http://lanl.gov>

___

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread David Shrader

Hello Mike,

I'm still working on getting you my config.log, but I thought I would 
chime in about that line 36. In my case, that code path is not executed 
because with_mxm is empty (I don't use --with-mxm on the configure line 
since libmxm.so is in system space and configure picks up on it 
automatically). Thus, ompi_check_mxm_libdir never gets assigned which 
results in just "-L" getting used on line 41. The same behavior could be 
found by using '--with-mxm=yes'.


Thanks,
David

On 05/26/2015 11:28 AM, Mike Dubman wrote:

Thanks Jeff!

but in this line:

https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L36

ompi_check_mxm_libdir gets value if with_mxm was passed



On Tue, May 26, 2015 at 6:59 PM, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:


This line:

https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L41

doesn't check to see if $ompi_check_mxm_libdir is empty.


> On May 26, 2015, at 11:50 AM, Mike Dubman
mailto:mi...@dev.mellanox.co.il>> wrote:
>
> David,
> Could you please send me your config.log file?
>
> Looking into config/ompi_check_mxm.m4 macro I don`t understand
how it could happen.
>
> Thanks a lot.
>
> On Tue, May 26, 2015 at 6:41 PM, Mike Dubman
mailto:mi...@dev.mellanox.co.il>> wrote:
> Hello David,
> Thanks for info and patch - will fix ompi configure logic with
your patch.
>
> mxm can be installed in the system and user spaces - both are
    valid and supported logic.
>
> M
>
> On Tue, May 26, 2015 at 5:50 PM, David Shrader
mailto:dshra...@lanl.gov>> wrote:
> Hello Mike,
>
> This particular instance of mxm was installed using rpms that
were re-rolled by our admins. I'm not 100% sure where they got
them (HPCx or somewhere else). I myself am not using HPCx. Is
there any particular reason why mxm shouldn't be in system space?
If there is, I'll share it with our admins and try to get the
install location corrected.
>
> As for what is causing the extra -L, it does look like an empty
variable is used without checking that it is empty in configure.
Line 246117 in the configure script provided by the
openmpi-1.8.5.tar.bz2 tarball has this:
>
> ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
>
> By invoking configure with '/bin/sh -x ./configure ...' and
changing PS4 to output line numbers, I saw that line 246117 was
setting ompi_check_mxm_extra_libs to just "-L". It turns out that
configure does this in three separate locations. I put a check
around all three instances like this:
>
> if test ! -z "$ompi_check_mxm_extra_libs"; then
>  ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
> fi
>
> And the spurious '-L' disappeared from the linking commands and
make completed fine.
>
> So, it looks like there are two solutions: move the install
location of mxm to not be in system-space or modify configure.
Which one would be the better one for me to pursue?
>
> Thanks,
> David
>
>
> On 05/23/2015 12:05 AM, Mike Dubman wrote:
>> Hi,
>>
>> How mxm was installed? by copying?
>>
>> The rpm based installation places mxm into /opt/mellanox/mxm
and not into /usr/lib64/libmxm.so.
>>
    >> Do you use HPCx (pack of OMPI and MXM and FCA)?
>> You can download HPCX, extract it anywhere and compile OMPI
pointing to mxm location under HPCX.
>>
>> Also, HPCx contains rpms for mxm and fca.
>>
>>
>> M
>>
>> On Sat, May 23, 2015 at 1:07 AM, David Shrader
mailto:dshra...@lanl.gov>> wrote:
>> Hello,
>>
>> I'm getting a spurious '-L' flag when I have mxm installed in
system-space (/usr/lib64/libmxm.so) which is causing an error at
link time during make:
>>
>> ...output snipped...
>> /bin/sh ../../../../libtool  --tag=CC  --mode=link gcc
-std=gnu99 -O3 -DNDEBUG -I/opt/panfs/include -finline-functions
-fno-strict-aliasing -pthread -module -avoid-version  -o
libmca_mtl_mxm.la <http://libmca_mtl_mxm.la> mtl_mxm.lo
mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo
mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm
-lutil
>> libtool: link: require no space between `-L' and `-lrt'
>> make[2]: *** [libmca_mtl_mxm.la <http://libmca_mtl_mxm.la>] Error 1
>> make[2]: Leaving directory

`/turquoise/usr/projec

Re: [OMPI users] Openmpi compilation errors

2015-05-27 Thread David Shrader

Looking at the config.log, I see this:

pgi-cc-lin64: LICENSE MANAGER PROBLEM: No such feature exists.
Feature:   pgi-cc-lin64

It looks like there is a problem with the PGI license. Does it work with 
a regular file (e.g., hello_world)? If it does, how do you get it to 
work (env variables, license file, etc.)?


Thanks,
David

On 05/27/2015 10:25 AM, Bruno Queiros wrote:

Hello

I'm trying to compile openmpi-1.8.5 with portland fortran 10.4 64bits 
on a CentOS7 64bits.


This is the output i get:

./configure CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 F90=pgf90 
--prefix=/opt/openmpi-1.8.5_pgf90



== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for gcc... pgcc
checking whether the C compiler works... no
configure: error: in `/root/TransferArea/openmpi-1.8.5':
configure: error: C compiler cannot create executables
See `config.log' for more details

The config.log goes as an attachment


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26954.php


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] Openmpi compilation errors

2015-05-27 Thread David Shrader
Yes, exactly like that. Given your configure line, all of the Portland 
Group's compilers need to work:


$> pgf90 hello.f90
$> pgcc hello.c
$> pgCC hello.cpp

What of those commands work for you?

Thanks,
David

On 05/27/2015 11:01 AM, Bruno Queiros wrote:

David

Do you mean if Portland Fortran compiler works? Like pgf90 hello.f ?

Bruno


Em qua, 27 de mai de 2015 às 17:40, David Shrader <mailto:dshra...@lanl.gov>> escreveu:


Looking at the config.log, I see this:

pgi-cc-lin64: LICENSE MANAGER PROBLEM: No such feature exists.
Feature:   pgi-cc-lin64

It looks like there is a problem with the PGI license. Does it
work with a regular file (e.g., hello_world)? If it does, how do
you get it to work (env variables, license file, etc.)?

Thanks,
David


On 05/27/2015 10:25 AM, Bruno Queiros wrote:

Hello

I'm trying to compile openmpi-1.8.5 with portland fortran 10.4
64bits on a CentOS7 64bits.

This is the output i get:

./configure CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 F90=pgf90
--prefix=/opt/openmpi-1.8.5_pgf90


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for gcc... pgcc
checking whether the C compiler works... no
configure: error: in `/root/TransferArea/openmpi-1.8.5':
configure: error: C compiler cannot create executables
See `config.log' for more details

The config.log goes as an attachment


___
users mailing list
us...@open-mpi.org  <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/05/26954.php


-- 
David Shrader

HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader lanl.gov  <http://lanl.gov>

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26955.php



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26957.php


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



[OMPI users] orte-clean hang in 1.8.5

2015-06-08 Thread David Shrader

Hello All,

I had a user report that orte-clean is hanging on him with Open MPI 
1.8.5. Here are the steps I used to reproduce what he reported:


%> which orte-clean
/usr/projects/hpcsoft/toss2/moonlight/openmpi/1.6.5-gcc-4.4/bin/orte-clean
%> mpirun -n 1 
/usr/projects/hpcsoft/toss2/moonlight/openmpi/1.6.5-gcc-4.4/bin/orte-clean

Reported: 1 (out of 1) daemons - 1 (out of 1) procs
[hangs]

I have found that the same behavior does not happen using 1.6.5. That 
is, I get a command prompt after running orte-clean.


Is this behavior expected? I am not familiar with orte-clean, so I am 
not sure if it hanging when used in this fashion is an actual problem 
with orte-clean. If it is unexpected behavior, I'll dig some more.


Thank you very much for your time,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] shared memory performance

2015-07-22 Thread David Shrader
er and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27298.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27300.php


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



[OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-10 Thread David Shrader

Hello All,

I'm having some trouble getting Open MPI 1.8.8 to configure correctly 
when hcoll is installed in system space. That is, hcoll is installed to 
/usr/lib64 and /usr/include/hcoll. I get an error during configure:


$> Konsole output ./configure --with-hcoll
...output snipped...
Konsole output configure:219976: checking for MCA component coll:hcoll 
compile mode

configure:219982: result: static
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. Aborting

I have also tried using "--with-hcoll=yes" and gotten the same behavior. 
Has anyone else gotten the hcoll component to build when hcoll itself is 
in system space? I am using hcoll-3.2.748.


I did take a look at configure, and it looks like there is a test on 
"with_hcoll" to see if it is not empty and not yes on line 220072. In my 
case, this test fails, so the else clause gets invoked. The else clause 
is several hundred lines below on line 220822 and simply sets Konsole 
output ompi_check_hcoll_happy="no". Configure doesn't try to do anything 
to figure out if hcoll is usable, but it does quit soon after with the 
above error because ompi_check_hcoll_happy isn't "yes."


In case it helps, here is the output from config.log for that area:

...output snipped...
configure:219976: checking for MCA component coll:hcoll compile mode
configure:219982: result: dso
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. Aborting

##  ##
## Cache variables. ##
##  ##
...output snipped...

Have I missed something in specifying --with-hcoll? I would prefer not 
to use "--with-hcoll=/usr" as I am pretty sure that spurious linker 
flags to that area will work their way in when they shouldn't.


Thanks,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread David Shrader
I have cloned Gilles' topic/hcoll_config branch and, after running 
autogen.pl, have found that './configure --with-hcoll' does indeed work 
now. I used Gilles' branch as I wasn't sure how best to get the pull 
request changes in to my own clone of master. It looks like the proper 
checks are happening, too:


Konsole output
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile mode... dso
checking --with-hcollvalue... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan compile... yes

I haven't checked whether or not Open MPI builds successfully as I don't 
have much experience running off of the latest source. For now, I think 
I will try to generate a patch to the 1.8.8 configure script and see if 
that works as expected.


Thanks,
David

On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:

On Aug 11, 2015, at 1:39 AM, Åke Sandgren  wrote:

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile flags 
is broken.

+1

Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some 
comments to it.



--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader

Hello Gilles,

Thank you very much for the patch! It is much more complete than mine. 
Using that patch and re-running autogen.pl, I am able to build 1.8.8 
with './configure --with-hcoll' without errors.


I do have issues when it comes to running 1.8.8 with hcoll built in, 
however. In my quick sanity test of running a basic parallel hello world 
C program, I get the following:


Konsole output Konsole output
[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x000f776e mca_coll_base_comm_select()  ??:0
12 0x00074ee4 ompi_mpi_init()  ??:0
13 0x00093dc0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x000f776e mca_coll_base_comm_select()  ??:0
12 0x00074ee4 ompi_mpi_init()  ??:0
13 0x00093dc0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
--
mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited 
on signal 11 (Segmentation fault).

--

I do not get this message with only 1 process.

I am using hcoll 3.2.748. Could this be an issue with hcoll itself or 
something with my ompi build?


Thanks,
David

On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:

Thanks David,

i made a PR for the v1.8 branch at 
https://github.com/open-mpi/ompi-release/pull/492


the patch is attached (it required some back-porting)

Cheers,

Gilles

On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config branch and, after running 
autogen.pl, have found that './configure --with-hcoll' does indeed 
work now. I used Gilles' branch as I wasn't sure how best to get the 
pull request changes in to my own clone of master. It looks like the 
proper checks are happening, too:


Konsole output
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile mode... dso
checking --with-hcollvalue... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan compile... yes

I haven't checked whether or not Open MPI builds successfully as I 
don't have much experience running off of the latest source. For now, 
I thin

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader

Hey Devendar,

It looks like I still get the error:

Konsole output
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
--
mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited 
on signal 11 (Segmentation fault).

--

Thanks,
David

On 08/12/2015 10:42 AM, Deva wrote:

Hi David,

This issue is from hcoll library. This could be because of symbol 
conflict with ml module.  This is fixed recently in HCOLL.  Can you 
try with "-mca coll ^ml" and see if this workaround works in your setup?


-Devendar

On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


Hello Gilles,

Thank you very much for the patch! It is much more complete than
mine. Using that patch and re-running autogen.pl
<http://autogen.pl>, I am able to build 1.8.8 with './configure
--with-hcoll' without errors.

I do have issues when it comes to running 1.8.8 with hcoll built
in, however. In my quick sanity test of running a basic parallel
hello world C program, I get the following:

[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader
The admin that rolled the hcoll rpm that we're using (and got it in 
system space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.


Thanks,
David

On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()
 ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()
 ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
--

mpirun noticed that process rank 1 with PID 14678 on node zo-fe1
exited on signal 11 (Segmentation fault).
--

Thanks,
David

On 08/12/2015 10:42 AM, Deva wrote:

Hi David,

This issue is from hcoll library. This could be because of symbol
conflict with ml module.  This is fixed recently in HCOLL.  Can
you try with "-mca coll ^ml" and see if this workaround works in
your setup?

-Devendar

    On Wed, Aug 12, 2015 at 9:30 AM, David Shrader mailto:dshra...@lanl.gov>> wrote:

Hello Gilles,

Thank you very much for the patch! It is much more complete
than mine. Using that patch and re-running autogen.pl
<http://autogen.pl>, I am able to build 1.8.8 with
'./configure --with-hcoll' without errors.

I do have issues when it comes to running 1.8.8 with hcoll
built in, however. In my quick sanity test of running a basic
parallel hello world C program, I get the following:

[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM
 WARN  Could not open the KNEM 

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader
I remember seeing those, but forgot about them. I am curious, though, 
why using '-mca coll ^ml' wouldn't work for me.


We'll watch for the next HPCX release. Is there an ETA on when that 
release may happen? Thank you for the help!

David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside 
OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll 
library and will be available in next HPCX release.


Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov>
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
<http://zo-fe1.lanl.gov> reporting for duty

This implies to me that some other library is being used instead
of /usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce
the issue.  Can you do one more quick test with seeing LD_PRELOAD
to hcoll lib?

$LD_PRELOAD= mpirun -n 2 -mca coll
^ml ./a.out

    -Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

The admin that rolled the hcoll rpm that we're using (and got
it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib?  MOFED or HPCX? what
    version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2)
procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65
  MXM  WARN  Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65
  MXM  WARN  Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb
base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3
hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rh

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread David Shrader
I don't have that option on the configure command line, but my platform 
file is using "enable_dlopen=no." I imagine that is getting the same 
result. Thank you for the pointer!


Thanks,
David

On 08/12/2015 05:04 PM, Deva wrote:
do you have "-disable-dlopen" in your configure option? This might 
force coll_ml to be loaded first even with -mca coll ^ml.


next HPCX is expected to release by end of Aug.

-Devendar

On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


I remember seeing those, but forgot about them. I am curious,
though, why using '-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when
that release may happen? Thank you for the help!
David


On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module
inside OMPI. HCOLL is derived from ml module. This issue is fixed
in hcoll library and will be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

    -Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader mailto:dshra...@lanl.gov>> wrote:

Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export
LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM
 WARN  Could not open the KNEM device file at /dev/knem : No
such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM
 WARN  Could not open the KNEM device file at /dev/knem : No
such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov>
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
<http://zo-fe1.lanl.gov> reporting for duty

This implies to me that some other library is being used
instead of /usr/lib64/libhcoll.so, but I am not sure how that
could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not
reproduce the issue.  Can you do one more quick test with
seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD= mpirun -n 2 -mca
    coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

The admin that rolled the hcoll rpm that we're using
(and got it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib? MOFED or HPCX?
what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml
./a.out
App launch reported: 1 (out of 1) daemons - 2 (out
of 2) procs
[1439397957.351764] [zo-fe1:14678:0]
shm.c:65   MXM  WARN  Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0]
shm.c:65   MXM  WARN  Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb
base_bcol_basesmuma_setup_library_buffers()  ??:0

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread David Shrader

Hey Jeff,

I'm actually not able to find coll_ml related files at that location. 
All I see are the following files:


[dshrader@zo-fe1 openmpi]$ ls 
/usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/

libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so

In this particular build, I am using platform files instead of the 
stripped down debug builds I was doing before. Could something in the 
platform files move or combine with something else the coll_ml related 
files?


Thanks,
David

On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:

Note that this will require you to have fairly recent GNU Autotools installed.

Another workaround for avoiding the coll ml module would be to install Open MPI 
as normal, and then rm the following files after installation:

rm $prefix/lib/openmpi/mca_coll_ml*

This will physically remove the coll ml plugin from the Open MPI installation, 
and therefore it won't/can't be used (or interfere with the hcoll plugin).



On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:

David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:

I remember seeing those, but forgot about them. I am curious, though, why using 
'-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when that release may 
happen? Thank you for the help!
David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside OMPI. 
HCOLL is derived from ml module. This issue is fixed in hcoll library and will 
be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty

This implies to me that some other library is being used instead of 
/usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.  Can 
you do one more quick test with seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD=  mpirun -n 2  -mca coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  wrote:
The admin that rolled the hcoll rpm that we're using (and got it in system 
space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

 From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  wrote:
Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x00056e4c mxm_error_signal_handler()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_com

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread David Shrader
Interestingly enough, I have found that using --disable-dlopen causes 
the seg fault whether or not --enable-mca-no-build=coll-ml is used. That 
is, the following configure line generates a build of Open MPI that will 
*not* seg fault when running a simple hello world program:


./configure --prefix=/tmp/dshrader-ompi-1.8.8-install 
--enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll


While the following configure line will produce a build of Open MPI that 
*will* seg fault with the same error I mentioned before:


./configure --prefix=/tmp/dshrader-ompi-1.8.8-install 
--enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll --disable-dlopen


I'm not sure why this would be.

Thanks,
David

On 08/13/2015 11:19 AM, Jeff Squyres (jsquyres) wrote:

Ah, if you're disable-dlopen, then you won't find individual plugin DSOs.

Instead, you can configure this way:

 ./configure --enable-mca-no-build=coll-ml ...

This will disable the build of the coll/ml component altogether.

 




On Aug 13, 2015, at 11:23 AM, David Shrader  wrote:

Hey Jeff,

I'm actually not able to find coll_ml related files at that location. All I see 
are the following files:

[dshrader@zo-fe1 openmpi]$ ls 
/usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/
libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so

In this particular build, I am using platform files instead of the stripped 
down debug builds I was doing before. Could something in the platform files 
move or combine with something else the coll_ml related files?

Thanks,
David

On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:

Note that this will require you to have fairly recent GNU Autotools installed.

Another workaround for avoiding the coll ml module would be to install Open MPI 
as normal, and then rm the following files after installation:

rm $prefix/lib/openmpi/mca_coll_ml*

This will physically remove the coll ml plugin from the Open MPI installation, 
and therefore it won't/can't be used (or interfere with the hcoll plugin).



On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:

David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:

I remember seeing those, but forgot about them. I am curious, though, why using 
'-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when that release may 
happen? Thank you for the help!
David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside OMPI. 
HCOLL is derived from ml module. This issue is fixed in hcoll library and will 
be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty

This implies to me that some other library is being used instead of 
/usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.  Can 
you do one more quick test with seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD=  mpirun -n 2  -mca coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  wrote:
The admin that rolled the hcoll rpm that we're using (and got it in system 
space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

 From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  wrote:
Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won'

[OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread David Shrader

Hello All,

I'm currently trying to install 1.10.0 with hcoll and mxm, and am 
getting an error during configure:


--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library in lib
checking for library containing hcoll_get_version... no
looking for library in lib64
checking for library containing hcoll_get_version... no
configure: error: HCOLL support requested but not found.  Aborting

The configure line I used:

./configure --with-mxm=/opt/mellanox/mxm 
--with-hcoll=/opt/mellanox/hcoll 
--with-platform=contrib/platform/lanl/toss/optimized-panasas


Here are the corresponding lines from config.log:

configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG 
-I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread 
-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include 
-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent 
-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include 
-I/opt/mellanox/hcoll/include   -L/opt/mellanox/hcoll/lib conftest.c 
-lhcoll  -lrt -lm -lutil   >&5
/usr/bin/ld: warning: libmxm.so.2, needed by 
/opt/mellanox/hcoll/lib/libhcoll.so, not found (try using -rpath or 
-rpath-link)

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_recv'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_create'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_config_free_context_opts'

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_destroy'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_config_free_ep_opts'

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_progress'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_config_read_opts'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_disconnect'

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mq_destroy'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mq_create'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_cleanup'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_send'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_connect'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_init'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_get_address'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_error_string'

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mem_unmap'
collect2: ld returned 1 exit status

An ldd on /opt/mellanox/hcoll/lib/libhcoll.so shows a dependency on 
libmxm.so, so the above error makes sense. I am using hcoll version 
3.3.768 and mxm version 3.4.3065 (reported by rpm).


So, my question: is there a way to take care of this other than putting 
'-L/opt/mellanox/lib -lmxm' in to LDFLAGS/LIBS? Using LDFLAGS/LIBS will 
link mxm in to everything, which I would prefer not to do.


Thanks in advance!
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread David Shrader
I should probably point out that libhcoll.so does not know where 
libmxm.so is:


[dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so
linux-vdso.so.1 =>  (0x7fffb2f1f000)
libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5 
(0x7fe31bd0b000)

libmxm.so.2 => not found
libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000)
libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0 
(0x7fe31b499000)

libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000)
libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0 
(0x7fe31adfe000)

librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000)
libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000)
libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000)
libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000)
/lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000)
libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000)
libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000)

Both hcoll and mxm where installed using the rpms provided by Mellanox.

Thanks again,
David

On 10/21/2015 09:34 AM, David Shrader wrote:

Hello All,

I'm currently trying to install 1.10.0 with hcoll and mxm, and am 
getting an error during configure:


--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library in lib
checking for library containing hcoll_get_version... no
looking for library in lib64
checking for library containing hcoll_get_version... no
configure: error: HCOLL support requested but not found.  Aborting

The configure line I used:

./configure --with-mxm=/opt/mellanox/mxm 
--with-hcoll=/opt/mellanox/hcoll 
--with-platform=contrib/platform/lanl/toss/optimized-panasas


Here are the corresponding lines from config.log:

configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG 
-I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread 
-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include 
-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent 
-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include 
-I/opt/mellanox/hcoll/include   -L/opt/mellanox/hcoll/lib conftest.c 
-lhcoll  -lrt -lm -lutil   >&5
/usr/bin/ld: warning: libmxm.so.2, needed by 
/opt/mellanox/hcoll/lib/libhcoll.so, not found (try using -rpath or 
-rpath-link)
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_req_recv'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_create'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_config_free_context_opts'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_destroy'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_config_free_ep_opts'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_progress'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_config_read_opts'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_disconnect'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_mq_destroy'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_mq_create'

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_cleanup'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_req_send'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_connect'

/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_init'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_ep_get_address'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_error_string'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to 
`mxm_mem_unmap'

collect2: ld returned 1 exit status

An ldd on /opt/mellanox/hcoll/lib/libhcoll.so shows a

Re: [OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread David Shrader
We're using TOSS which is based on Red Hat. The current version we're 
running is based on Red Hat 6.6. I'm actually not sure what mofed 
version we're using right now based on what I can find on the system and 
the admins over that are out. I'll get back to you on that as soon as I 
know.


Using LD_LIBRARY_PATH before configure got it to work, which I didn't 
expect. Thanks for the tip! I didn't realize that loading in a shared 
library of a library that is being linked in on the active compile line 
fell under the runtime portion of linking, and could be affected by 
using LD_LIBRARY_PATH.


Thanks!
David

On 10/21/2015 09:59 AM, Mike Dubman wrote:

Hi David,
what linux distro do you use? (and mofed version)?
Do you have /etc/ld.conf.d/mxm.conf file?
Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib 
./configure ?



Thanks

On Wed, Oct 21, 2015 at 6:40 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


I should probably point out that libhcoll.so does not know where
libmxm.so is:

[dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so
linux-vdso.so.1 =>  (0x7fffb2f1f000)
libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5
(0x7fe31bd0b000)
libmxm.so.2 => not found
libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3
(0x7fe31b6e2000)
libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0
(0x7fe31b499000)
libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000)
libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0
(0x7fe31adfe000)
librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000)
libibumad.so.3 => /usr/lib64/libibumad.so.3
(0x7fe31a9ee000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1
(0x7fe31a7d9000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1
(0x7fe31a5c7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000)
libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000)
libglib-2.0.so.0 => /lib64/libglib-2.0.so.0
(0x7fe319cfe000)
libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000)
/lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000)
libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000)
libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000)

Both hcoll and mxm where installed using the rpms provided by
Mellanox.

Thanks again,
David


On 10/21/2015 09:34 AM, David Shrader wrote:

Hello All,

I'm currently trying to install 1.10.0 with hcoll and mxm, and
am getting an error during configure:

--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library in lib
checking for library containing hcoll_get_version... no
looking for library in lib64
checking for library containing hcoll_get_version... no
configure: error: HCOLL support requested but not found.  Aborting

The configure line I used:

./configure --with-mxm=/opt/mellanox/mxm
--with-hcoll=/opt/mellanox/hcoll
--with-platform=contrib/platform/lanl/toss/optimized-panasas

Here are the corresponding lines from config.log:

configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG
-I/opt/panfs/include -finline-functions -fno-strict-aliasing
-pthread

-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include

-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent

-I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include
-I/opt/mellanox/hcoll/include  -L/opt/mellanox/hcoll/lib
conftest.c -lhcoll  -lrt -lm -lutil   >&5
/usr/bin/ld: warning: libmxm.so.2, needed by
/opt/mellanox/hcoll/lib/libhcoll.so, not found (try using
-rpath or -rpath-link)
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
`mxm_req_recv'
/opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to

Re: [OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread David Shrader
I'm sorry I missed reporting on that. I do not have 
/etc/ld.so.conf.d/mxm.conf.


Interestingly enough, the rpm reports that it does include that file, 
but it isn't there:


[dshrader@zo-fe1 serial]$ rpm -qa | grep mxm
mxm-3.4.3065-1.x86_64
[dshrader@zo-fe1 serial]$ rpm -ql mxm-3.4.3065-1.x86_64
/etc/ld.so.conf.d/mxm.conf
...output snipped...
[dshrader@zo-fe1 serial]$ ll /etc/ld.so.conf.d/mxm.conf
ls: cannot access /etc/ld.so.conf.d/mxm.conf: No such file or directory

I'll follow up with the admin who installed the rpm.

Thanks,
David

On 10/21/2015 11:37 AM, Mike Dubman wrote:
could you please check if you have file /etc/ld.so.conf.d/mxm.conf on 
your system?
it will help us understand why hcoll did not detect libmxm.so at the 
1st attempt.


Thanks

On Wed, Oct 21, 2015 at 7:19 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


We're using TOSS which is based on Red Hat. The current version
we're running is based on Red Hat 6.6. I'm actually not sure what
mofed version we're using right now based on what I can find on
the system and the admins over that are out. I'll get back to you
on that as soon as I know.

Using LD_LIBRARY_PATH before configure got it to work, which I
didn't expect. Thanks for the tip! I didn't realize that loading
in a shared library of a library that is being linked in on the
active compile line fell under the runtime portion of linking, and
could be affected by using LD_LIBRARY_PATH.

Thanks!
David


On 10/21/2015 09:59 AM, Mike Dubman wrote:

Hi David,
what linux distro do you use? (and mofed version)?
Do you have /etc/ld.conf.d/mxm.conf file?
Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib
./configure ....?


    Thanks

On Wed, Oct 21, 2015 at 6:40 PM, David Shrader mailto:dshra...@lanl.gov>> wrote:

I should probably point out that libhcoll.so does not know
where libmxm.so is:

[dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so
linux-vdso.so.1 => (0x7fffb2f1f000)
libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5
(0x7fe31bd0b000)
libmxm.so.2 => not found
libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3
(0x7fe31b6e2000)
libocoms.so.0 =>
/opt/mellanox/hcoll/lib/libocoms.so.0 (0x7fe31b499000)
libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000)
libnuma.so.1 => /usr/lib64/libnuma.so.1
(0x7fe31b009000)
libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0
(0x7fe31adfe000)
librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000)
libibumad.so.3 => /usr/lib64/libibumad.so.3
(0x7fe31a9ee000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1
(0x7fe31a7d9000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1
(0x7fe31a5c7000)
libpthread.so.0 => /lib64/libpthread.so.0
(0x7fe31a3a9000)
libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000)
libglib-2.0.so.0 => /lib64/libglib-2.0.so.0
(0x7fe319cfe000)
libibmad.so.5 => /usr/lib64/libibmad.so.5
(0x7fe319ae3000)
/lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000)
libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1
(0x7fe3196c2000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000)
libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000)

Both hcoll and mxm where installed using the rpms provided by
Mellanox.

Thanks again,
David


On 10/21/2015 09:34 AM, David Shrader wrote:

Hello All,

I'm currently trying to install 1.10.0 with hcoll and
mxm, and am getting an error during configure:

--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library in lib
checking for library containing hcoll_get_version... no
looking for library in lib64
checking for library containing hcoll_get_version... no
configure: error: HCOLL support requested but not found. 
Aborting

Re: [OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread David Shrader
It turns out that stuff in /etc is in RAM, so the mxm.conf wasn't there 
because that area hadn't been refreshed yet, either by the admin 
manually pushing it out or by rebooting. The admins pushed it out, and 
now ldd on libhcoll.so resolves the libmxm.so dependency. And, configure 
works without having to specify LD_LIBRARY_PATH.


So, not an Open MPI issue, but I am very grateful for all the help!
David

On 10/21/2015 12:00 PM, David Shrader wrote:
I'm sorry I missed reporting on that. I do not have 
/etc/ld.so.conf.d/mxm.conf.


Interestingly enough, the rpm reports that it does include that file, 
but it isn't there:


[dshrader@zo-fe1 serial]$ rpm -qa | grep mxm
mxm-3.4.3065-1.x86_64
[dshrader@zo-fe1 serial]$ rpm -ql mxm-3.4.3065-1.x86_64
/etc/ld.so.conf.d/mxm.conf
...output snipped...
[dshrader@zo-fe1 serial]$ ll /etc/ld.so.conf.d/mxm.conf
ls: cannot access /etc/ld.so.conf.d/mxm.conf: No such file or directory

I'll follow up with the admin who installed the rpm.

Thanks,
David

On 10/21/2015 11:37 AM, Mike Dubman wrote:
could you please check if you have file /etc/ld.so.conf.d/mxm.conf on 
your system?
it will help us understand why hcoll did not detect libmxm.so at the 
1st attempt.


Thanks

On Wed, Oct 21, 2015 at 7:19 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote:


We're using TOSS which is based on Red Hat. The current version
we're running is based on Red Hat 6.6. I'm actually not sure what
mofed version we're using right now based on what I can find on
the system and the admins over that are out. I'll get back to you
on that as soon as I know.

Using LD_LIBRARY_PATH before configure got it to work, which I
didn't expect. Thanks for the tip! I didn't realize that loading
in a shared library of a library that is being linked in on the
active compile line fell under the runtime portion of linking,
and could be affected by using LD_LIBRARY_PATH.

Thanks!
David


On 10/21/2015 09:59 AM, Mike Dubman wrote:

Hi David,
what linux distro do you use? (and mofed version)?
Do you have /etc/ld.conf.d/mxm.conf file?
Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib
./configure ....?


Thanks

On Wed, Oct 21, 2015 at 6:40 PM, David Shrader
 wrote:

I should probably point out that libhcoll.so does not know
where libmxm.so is:

[dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so
linux-vdso.so.1 => (0x7fffb2f1f000)
libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5
(0x7fe31bd0b000)
libmxm.so.2 => not found
libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3
(0x7fe31b6e2000)
libocoms.so.0 =>
/opt/mellanox/hcoll/lib/libocoms.so.0 (0x7fe31b499000)
libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000)
libnuma.so.1 => /usr/lib64/libnuma.so.1
(0x7fe31b009000)
libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0
(0x7fe31adfe000)
librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000)
libibumad.so.3 => /usr/lib64/libibumad.so.3
(0x7fe31a9ee000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1
(0x7fe31a7d9000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1
(0x7fe31a5c7000)
libpthread.so.0 => /lib64/libpthread.so.0
(0x7fe31a3a9000)
libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000)
libglib-2.0.so.0 => /lib64/libglib-2.0.so.0
(0x7fe319cfe000)
libibmad.so.5 => /usr/lib64/libibmad.so.5
(0x7fe319ae3000)
/lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000)
libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1
(0x7fe3196c2000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000)
libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000)

Both hcoll and mxm where installed using the rpms provided
by Mellanox.

Thanks again,
David


On 10/21/2015 09:34 AM, David Shrader wrote:

Hello All,

I'm currently trying to install 1.10.0 with hcoll and
mxm, and am getting an error during configure:

--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking hcoll/api/hcoll_api.h usability.

[OMPI users] a single build of Open MPI that can be used with multiple GCC versions

2016-02-10 Thread David Shrader

Hello,

Is it possible to use a single build of Open MPI against multiple 
versions of GCC if the versions of GCC are from the same release series? 
I was under the assumption that as long as a binary-compatible compiler 
was used, it was possible to "swap out" the compiler from underneath 
Open MPI.


That is the general question I have, but here is the specific scenario 
that prompted it:


 * built Open MPI 1.10.1 against GCC 5.2.0 with a directory name of
   openmpi-1.10.1-gcc-5
 * installed GCC 5.3.0
 * removed GCC 5.2.0

I now have users who are getting errors like the following when using 
mpicxx:


/bin/grep: 
/usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la: 
No such file or directory


I can see several references to my previous GCC 5.2.0 installation in 
the /lib/*.la files, including a reference to 
/usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la.


This is all disconcerting as users of GCC 5.3.0 were using 5.3.0's 
binaries but were getting some 5.2.0 library configs before I removed 
5.2.0, but no one knew it.


If it should be possible to use a single build of Open MPI with multiple 
binary-compatible compilers, is there a way to fix my above situation or 
prevent it from happening at build time?


Thanks,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] a single build of Open MPI that can be used with multiple GCC versions

2016-02-10 Thread David Shrader

A bit of an update:

I was mistaken when I said users were reporting 1.10.1 was throwing an 
error. The error occurs for 1.6.5 (which I still have to keep on my 
production systems). Users report that they do not see the error with 
1.10.1.


That being said, I do see references to my GCC 5.2.0 installation in the 
/lib/*.la 1.10.1 files and would like to ask if I need to 
worry at all? It seems the way files were named and organized in 
/lib changed in 1.7 which may be why 1.10.1 is working.


Thank you very much for your time,
David

On 02/10/2016 10:58 AM, David Shrader wrote:

Hello,

Is it possible to use a single build of Open MPI against multiple 
versions of GCC if the versions of GCC are from the same release 
series? I was under the assumption that as long as a binary-compatible 
compiler was used, it was possible to "swap out" the compiler from 
underneath Open MPI.


That is the general question I have, but here is the specific scenario 
that prompted it:


  * built Open MPI 1.10.1 against GCC 5.2.0 with a directory name of
openmpi-1.10.1-gcc-5
  * installed GCC 5.3.0
  * removed GCC 5.2.0

I now have users who are getting errors like the following when using 
mpicxx:


/bin/grep: 
/usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la: No 
such file or directory


I can see several references to my previous GCC 5.2.0 installation in 
the /lib/*.la files, including a reference to 
/usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la.


This is all disconcerting as users of GCC 5.3.0 were using 5.3.0's 
binaries but were getting some 5.2.0 library configs before I removed 
5.2.0, but no one knew it.


If it should be possible to use a single build of Open MPI with 
multiple binary-compatible compilers, is there a way to fix my above 
situation or prevent it from happening at build time?


Thanks,
David

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] Question on OpenMPI backwards compatibility

2016-02-26 Thread David Shrader

Hey Edwin,

The versioning scheme changed with 2.x. Prior to 2.x the "Minor" version 
had a different definition and did not mention backwards compatibility 
at all (at least in my 1.6.x tarballs). As it turned out for 1.8.x and 
1.6.x, 1.8.x was not backwards compatible with 1.6.x, so the behavior 
you saw in your test of 1.6.x-compiled code running against 1.8.x is 
expected. In practice, 1.x was never backwards compatible with 1.y where 
x>y, even though the versioning documentation at the time didn't 
specifically mention it.


There is a note in the versioning documentation 
(https://www.open-mpi.org/software/ompi/versions/) that does warn of 
this change in the versioning scheme:


NOTE: The version numbering conventions were changed with the release
  of v1.10.0.  Most notably, Open MPI no longer uses an "odd/even"
  release schedule to indicate feature development vs. stable
  releases.  See the README in releases prior to v1.10.0 for more
  information (e.g.,
https://github.com/open-mpi/ompi-release/blob/v1.8/README#L1392-L1475).

There is also a CAVEAT underneath the "Major" section of the versioning 
documentation that says that 1.10.x is not backwards compatible with 
other 1.x releases and that the same rule applies to anything before 
1.10.0. Perhaps another CAVEAT could be placed after the "Minor" section 
since the information on backwards compatibility in the "Minor" section 
only applies to 2.x and beyond.


The developers are still in the midst of the version scheme transition 
(developing on both 1.10.x and 2.x), so the FAQ entries might be a bit 
out-dated for the new numbering scheme for a while.


Thanks,
David

On 02/26/2016 09:20 AM, Blosch, Edwin L wrote:

I am confused about backwards-compatibility.

FAQ #111 says:
Open MPI reserves the right to break ABI compatibility at new feature release 
series. . MPI applications compiled/linked against Open MPI 1.6.x will not 
be ABI compatible with Open MPI 1.7.x

But the versioning documentation says:
   * Minor: The minor number is the second integer in the version string.   
 Backwards compatibility will still be preserved with prior releases that have 
the same major version number (e.g., v2.5.3 is backwards compatible with 
v2.3.1).

These two examples and statements appear inconsistent to me:

Can I use OpenMPI 1.7.x run-time and options to execute codes built with 
OpenMPI 1.6.x?   No (FAQ #111)

Can I use OpenMPI 2.5.x run-time and options to execute codes built with 
OpenMPI 2.3.x?   Yes (s/w versioning documentation)

Can I use OpenMPI 1.8.x run-time and options to execute codes built with 
OpenMPI 1.6.x?   Who knows?!  I tested this once, and it failed.  I made the 
assumption that 1.8.x wouldn't run a 1.6.x code, and I moved on.  But I realize 
now that I could have made a mistake.  The test I performed could have failed 
for some other reason.

Can anyone shed some light?




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/02/28590.php


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] Question on OpenMPI backwards compatibility

2016-02-26 Thread David Shrader
I forgot to include a link to the official announcement of the change, 
and that info might be helpful in navigating the different versions and 
backwards compatibility:


https://www.open-mpi.org/community/lists/announce/2015/06/0069.php

Thanks,
David

On 02/26/2016 10:43 AM, David Shrader wrote:

Hey Edwin,

The versioning scheme changed with 2.x. Prior to 2.x the "Minor" 
version had a different definition and did not mention backwards 
compatibility at all (at least in my 1.6.x tarballs). As it turned out 
for 1.8.x and 1.6.x, 1.8.x was not backwards compatible with 1.6.x, so 
the behavior you saw in your test of 1.6.x-compiled code running 
against 1.8.x is expected. In practice, 1.x was never backwards 
compatible with 1.y where x>y, even though the versioning 
documentation at the time didn't specifically mention it.


There is a note in the versioning documentation 
(https://www.open-mpi.org/software/ompi/versions/) that does warn of 
this change in the versioning scheme:


NOTE: The version numbering conventions were changed with the release
  of v1.10.0.  Most notably, Open MPI no longer uses an "odd/even"
  release schedule to indicate feature development vs. stable
  releases.  See the README in releases prior to v1.10.0 for more
  information (e.g.,
https://github.com/open-mpi/ompi-release/blob/v1.8/README#L1392-L1475).

There is also a CAVEAT underneath the "Major" section of the 
versioning documentation that says that 1.10.x is not backwards 
compatible with other 1.x releases and that the same rule applies to 
anything before 1.10.0. Perhaps another CAVEAT could be placed after 
the "Minor" section since the information on backwards compatibility 
in the "Minor" section only applies to 2.x and beyond.


The developers are still in the midst of the version scheme transition 
(developing on both 1.10.x and 2.x), so the FAQ entries might be a bit 
out-dated for the new numbering scheme for a while.


Thanks,
David

On 02/26/2016 09:20 AM, Blosch, Edwin L wrote:

I am confused about backwards-compatibility.

FAQ #111 says:
Open MPI reserves the right to break ABI compatibility at new feature 
release series. . MPI applications compiled/linked against Open 
MPI 1.6.x will not be ABI compatible with Open MPI 1.7.x


But the versioning documentation says:
   * Minor: The minor number is the second integer in the version 
string.    Backwards compatibility will still be preserved with 
prior releases that have the same major version number (e.g., v2.5.3 
is backwards compatible with v2.3.1).


These two examples and statements appear inconsistent to me:

Can I use OpenMPI 1.7.x run-time and options to execute codes built 
with OpenMPI 1.6.x?   No (FAQ #111)


Can I use OpenMPI 2.5.x run-time and options to execute codes built 
with OpenMPI 2.3.x?   Yes (s/w versioning documentation)


Can I use OpenMPI 1.8.x run-time and options to execute codes built 
with OpenMPI 1.6.x?   Who knows?!  I tested this once, and it 
failed.  I made the assumption that 1.8.x wouldn't run a 1.6.x code, 
and I moved on.  But I realize now that I could have made a mistake.  
The test I performed could have failed for some other reason.


Can anyone shed some light?




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/02/28590.php




--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov