Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Gilles Gouaillardet

Thanks David,

i made a PR for the v1.8 branch at 
https://github.com/open-mpi/ompi-release/pull/492


the patch is attached (it required some back-porting)

Cheers,

Gilles

On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config branch and, after running 
autogen.pl, have found that './configure --with-hcoll' does indeed 
work now. I used Gilles' branch as I wasn't sure how best to get the 
pull request changes in to my own clone of master. It looks like the 
proper checks are happening, too:


Konsole output
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile mode... dso
checking --with-hcollvalue... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan compile... yes

I haven't checked whether or not Open MPI builds successfully as I 
don't have much experience running off of the latest source. For now, 
I think I will try to generate a patch to the 1.8.8 configure script 
and see if that works as expected.


Thanks,
David

On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:

On Aug 11, 2015, at 1:39 AM, Åke Sandgren  wrote:

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile flags 
is broken.

+1

Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I just added some 
comments to it.



--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27432.php


diff --git a/config/ompi_check_libfca.m4 b/config/ompi_check_libfca.m4
index 715b0c7..62f697f 100644
--- a/config/ompi_check_libfca.m4
+++ b/config/ompi_check_libfca.m4
@@ -24,41 +24,37 @@ AC_DEFUN([OMPI_CHECK_FCA],[
 OMPI_CHECK_WITHDIR([fca], [$with_fca], [lib/libfca.so])

 AS_IF([test "$with_fca" != "no"],
-  [AS_IF([test ! -z "$with_fca" -a "$with_fca" != "yes"],
- [ompi_check_fca_dir=$with_fca
-  ompi_check_fca_libdir="$ompi_check_fca_dir/lib"
-  ompi_check_fca_incdir="$ompi_check_fca_dir/include"
-  ompi_check_fca_libs=fca
+  [ompi_check_fca_libs=fca
+   AS_IF([test ! -z "$with_fca" && test "$with_fca" != "yes"],
+ [ompi_check_fca_dir=$with_fca
+  ompi_check_fca_libdir="$ompi_check_fca_dir/lib"
+  ompi_check_fca_incdir="$ompi_check_fca_dir/include"
+  AC_SUBST([coll_fca_HOME], "$ompi_check_fca_dir")],
+ [AC_SUBST([coll_fca_HOME], "/")])

-  
coll_fca_extra_CPPFLAGS="-I$ompi_check_fca_incdir/fca 
-I$ompi_check_fca_incdir/fca_core"
-  AC_SUBST([coll_fca_extra_CPPFLAGS])
-  AC_SUBST([coll_fca_HOME], "$ompi_check_fca_dir")
+   CPPFLAGS_save=$CPPFLAGS
+   LDFLAGS_save=$LDFLAGS
+   LIBS_save=$LIBS

-  CPPFLAGS_save=$CPPFLAGS
-  LDFLAGS_save=$LDFLAGS
-  LIBS_save=$LIBS
-  CPPFLAGS="$CPPFLAGS $coll_fca_extra_CPPFLAGS"

+   OPAL_LOG_MSG([$1_CPPFLAGS : $$1_CPPFLAGS], 1)
+   OPAL_LOG_MSG([$1_LDFLAGS  : $$1_LDFLAGS], 1)
+   OPAL_LOG_MSG([$1_LIBS : $$1_LIBS], 1)

-  OPAL_LOG_MSG([$1_CPPFLAGS : $$1_CPPFLAGS], 1)
-  OPAL_LOG_MSG([$1_LDFLAGS  : $$1_LDFLAGS], 1)
-  OPAL_LOG_MSG([$1_LIBS : $$1_LIBS], 1)
+   OMPI_CHECK_PACKAGE([$1],
+  [fca/fca_api.h],
+  [$ompi_check_fca_libs],
+  [fca_get_version],
+  [],
+  [$ompi_check_fca_dir],
+  [$ompi_check_fca_libdir],
+  [ompi_check_fca_happy="yes"],
+  [ompi_check_fca_happy="no"])

-  OMPI_CHECK_PACKAGE([$1],
-  [fca_api.h],
-  [$ompi_check_fca_libs],
-  [fca_get_version],
-  [],
-  [$ompi_check_fca_dir],
-  [$ompi_check_fca_libdir],
-  [ompi_check_fca_happy="yes"],
-  [ompi_check_fca_happy="no"])
-
-  CPPFLAGS=$CPPFLAGS_save
-   

Re: [OMPI users] What Red Hat Enterprise/CentOS NUMA libraries are recommended/required for OpenMPI?

2015-08-12 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> I think Dave's point is that numactl-devel (and numactl) is only needed for 
> *building* Open MPI.  Users only need numactl to *run* Open MPI.

Yes.  However, I guess the basic problem is that the component fails to
load for want of libhwloc, either because (the right soname of) it isn't
present or there's a problem with numactl or another of its
dependencies.  (That won't happen if you use a packaged version, of
course.)

There are three instances of "libnumactl and libnumactl-devel" in the
release .txt files which I think are wrong.  I don't know on what system
such package names exist (not Debian or Red Hat-ish) and numactl isn't
even relevant to all Linux-based systems (e.g. not s390 or arm in
Fedora).  I'd replace the message with one saying that a compatible
version of libhwloc and its dependencies needs to be available, assuming
I've got that right.  The -devel/-dev package is surely not required
anyway.

[In the context of a recent version of SGE, I don't know how the support
can be missing; the execd and shepherd should be built with hwloc, and
there's a test in the rc script that the shepherd will actually load.]


Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 1.8.7

2015-08-12 Thread Dave Love
"Lane, William"  writes:

> I can successfully run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine but 
> not via qrsh. We're
> using CentOS 6.3 and a heterogeneous cluster of hyperthreaded and 
> non-hyperthreaded blades
> and x3550 chassis. OpenMPI 1.8.7 has been built w/the debug switch as well.

I think you want to explain exactly why you need this world of pain.  It
seems unlikely that MPI programs will run efficiently in it.  Our Intel
nodes mostly have hyperthreading on in BIOS -- or what passes for BIOS
on them -- but disabled at startup, and we only run MPI across identical
nodes in the heterogeneous system.

> Here's my latest errors:
> qrsh -V -now yes -pe mpi 209 mpirun -np 209 -display-devel-map --prefix 
> /hpc/apps/mpi/openmpi/1.8.7/ --mca btl ^sm --hetero-nodes --bind-to core 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3

[What does --hetero-nodes do?  It's undocumented as far as I can tell.]

> error: executing task of job 211298 failed: execution daemon on host 
> "csclprd3-0-4" didn't accept task
> error: executing task of job 211298 failed: execution daemon on host 
> "csclprd3-4-1" didn't accept task

So you need to find out why that was (probably lack of slots on the exec
host, which might be explained in the execd messages).

> [...]

> NOTE: the hosts that "didn't accept task" were different in two different 
> runs but the errors were the same.
>
> Here's the definition of the mpi Parallel Environment on our 
> Son-of-Gridengine cluster:
>
> pe_namempi
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/opt/sge/mpi/startmpi.sh $pe_hostfile
> stop_proc_args /opt/sge/mpi/stopmpi.sh

Why are those two not NONE? 

> allocation_rule$fill_up

As I said, that doesn't seem wise (unless you use -l exclusive).

> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary TRUE
> qsort_args NONE
>
> Qsort_args is set to NONE, but it's supposed to be set to TRUE right?

No see sge_pe(5).  (I think the text I supplied for the FAQ is accurate,
but reuti might confirm if he's reading this.)

> -Bill L.
>
> If I can run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine w/no issues 
> it has to be Son-of-Gridengine that's
> the issue right?

I don't see any evidence of an SGE bug, if that's what you mean, but
clearly you have a problem if execds won't accept the jobs, and this
isn't the place to discuss it.  I asked about SGE core binding, and it's
presumably also relevant how slots are defined on the compute nodes, but
I'd just say "Don't do that" without a pressing reason.


Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 1.8.7

2015-08-12 Thread Gilles Gouaillardet
basically, without --hetero-nodes, ompi assumes all nodes have the same
topology (fast startup)
with --hetero-nodes, ompi does not assume anything and request node
topology (slower startup)

I am nor sure if this is still 100% true on all versions.
iirc, at least on master, a hwloc signature is checked and ompi
transparently fall back to --heyero-nodes if needed

bottom line, on a heterogeneous cluster, it is required or safer to use the
--hetero-nodes option


Cheers,

Gilles

On Wednesday, August 12, 2015, Dave Love  wrote:

> "Lane, William" > writes:
>
> > I can successfully run my OpenMPI 1.8.7 jobs outside of
> Son-of-Gridengine but not via qrsh. We're
> > using CentOS 6.3 and a heterogeneous cluster of hyperthreaded and
> non-hyperthreaded blades
> > and x3550 chassis. OpenMPI 1.8.7 has been built w/the debug switch as
> well.
>
> I think you want to explain exactly why you need this world of pain.  It
> seems unlikely that MPI programs will run efficiently in it.  Our Intel
> nodes mostly have hyperthreading on in BIOS -- or what passes for BIOS
> on them -- but disabled at startup, and we only run MPI across identical
> nodes in the heterogeneous system.
>
> > Here's my latest errors:
> > qrsh -V -now yes -pe mpi 209 mpirun -np 209 -display-devel-map --prefix
> /hpc/apps/mpi/openmpi/1.8.7/ --mca btl ^sm --hetero-nodes --bind-to core
> /hpc/home/lanew/mpi/openmpi/ProcessColors3
>
> [What does --hetero-nodes do?  It's undocumented as far as I can tell.]
>
> > error: executing task of job 211298 failed: execution daemon on host
> "csclprd3-0-4" didn't accept task
> > error: executing task of job 211298 failed: execution daemon on host
> "csclprd3-4-1" didn't accept task
>
> So you need to find out why that was (probably lack of slots on the exec
> host, which might be explained in the execd messages).
>
> > [...]
>
> > NOTE: the hosts that "didn't accept task" were different in two
> different runs but the errors were the same.
> >
> > Here's the definition of the mpi Parallel Environment on our
> Son-of-Gridengine cluster:
> >
> > pe_namempi
> > slots  
> > user_lists NONE
> > xuser_listsNONE
> > start_proc_args/opt/sge/mpi/startmpi.sh $pe_hostfile
> > stop_proc_args /opt/sge/mpi/stopmpi.sh
>
> Why are those two not NONE?
>
> > allocation_rule$fill_up
>
> As I said, that doesn't seem wise (unless you use -l exclusive).
>
> > control_slaves FALSE
> > job_is_first_task  TRUE
> > urgency_slots  min
> > accounting_summary TRUE
> > qsort_args NONE
> >
> > Qsort_args is set to NONE, but it's supposed to be set to TRUE right?
>
> No see sge_pe(5).  (I think the text I supplied for the FAQ is accurate,
> but reuti might confirm if he's reading this.)
>
> > -Bill L.
> >
> > If I can run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine w/no
> issues it has to be Son-of-Gridengine that's
> > the issue right?
>
> I don't see any evidence of an SGE bug, if that's what you mean, but
> clearly you have a problem if execds won't accept the jobs, and this
> isn't the place to discuss it.  I asked about SGE core binding, and it's
> presumably also relevant how slots are defined on the compute nodes, but
> I'd just say "Don't do that" without a pressing reason.
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27436.php
>


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader

Hello Gilles,

Thank you very much for the patch! It is much more complete than mine. 
Using that patch and re-running autogen.pl, I am able to build 1.8.8 
with './configure --with-hcoll' without errors.


I do have issues when it comes to running 1.8.8 with hcoll built in, 
however. In my quick sanity test of running a basic parallel hello world 
C program, I get the following:


Konsole output Konsole output
[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x000f776e mca_coll_base_comm_select()  ??:0
12 0x00074ee4 ompi_mpi_init()  ??:0
13 0x00093dc0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x000f776e mca_coll_base_comm_select()  ??:0
12 0x00074ee4 ompi_mpi_init()  ??:0
13 0x00093dc0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
--
mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited 
on signal 11 (Segmentation fault).

--

I do not get this message with only 1 process.

I am using hcoll 3.2.748. Could this be an issue with hcoll itself or 
something with my ompi build?


Thanks,
David

On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:

Thanks David,

i made a PR for the v1.8 branch at 
https://github.com/open-mpi/ompi-release/pull/492


the patch is attached (it required some back-porting)

Cheers,

Gilles

On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config branch and, after running 
autogen.pl, have found that './configure --with-hcoll' does indeed 
work now. I used Gilles' branch as I wasn't sure how best to get the 
pull request changes in to my own clone of master. It looks like the 
proper checks are happening, too:


Konsole output
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile mode... dso
checking --with-hcollvalue... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan compile... yes

I haven't checked whether or not Open MPI builds successfully as I 
don't have much experience running off of the latest source. For now, 
I think I will try to generate a patch to the 1.8.8 configure scr

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
Hi David,

This issue is from hcoll library. This could be because of symbol conflict
with ml module.  This is fixed recently in HCOLL.  Can you try with "-mca
coll ^ml" and see if this workaround works in your setup?

-Devendar

On Wed, Aug 12, 2015 at 9:30 AM, David Shrader  wrote:

> Hello Gilles,
>
> Thank you very much for the patch! It is much more complete than mine.
> Using that patch and re-running autogen.pl, I am able to build 1.8.8 with
> './configure --with-hcoll' without errors.
>
> I do have issues when it comes to running 1.8.8 with hcoll built in,
> however. In my quick sanity test of running a basic parallel hello world C
> program, I get the following:
>
> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f776e mca_coll_base_comm_select()  ??:0
> 12 0x00074ee4 ompi_mpi_init()  ??:0
> 13 0x00093dc0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f776e mca_coll_base_comm_select()  ??:0
> 12 0x00074ee4 ompi_mpi_init()  ??:0
> 13 0x00093dc0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
> --
> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited on
> signal 11 (Segmentation fault).
> --
>
> I do not get this message with only 1 process.
>
> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or
> something with my ompi build?
>
> Thanks,
> David
>
> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
>
> Thanks David,
>
> i made a PR for the v1.8 branch at
> https://github.com/open-mpi/ompi-release/pull/492
>
> the patch is attached (it required some back-porting)
>
> Cheers,
>
> Gilles
>
> On 8/12/2015 4:01 AM, David Shrader wrote:
>
> I have cloned Gilles' topic/hcoll_config branch and, after running
> autogen.pl, have found that './configure --with-hcoll' does indeed work
> now. I used Gilles' branch as I wasn't sure how best to get the pull
> request changes in to my own clone of master. It looks like the proper
> checks are happening, too:
>
> --- MCA component coll:hcoll (m4 configuration macro)
> checking for MCA component coll:hcoll compile mode... dso
> checking --with-hcoll value... simple ok (unspecified)
> checking hcoll/api/hcoll_api.h usability... yes
> checking hcoll/api/hcoll_api.h pre

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader

Hey Devendar,

It looks like I still get the error:

Konsole output
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could 
not open the KNEM device file at /dev/knem : No such file or direc

tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 

3 0x00056e4c mxm_error_signal_handler() 
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 


4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() 
 coll_ml_module.c:0

8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
--
mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited 
on signal 11 (Segmentation fault).

--

Thanks,
David

On 08/12/2015 10:42 AM, Deva wrote:

Hi David,

This issue is from hcoll library. This could be because of symbol 
conflict with ml module.  This is fixed recently in HCOLL.  Can you 
try with "-mca coll ^ml" and see if this workaround works in your setup?


-Devendar

On Wed, Aug 12, 2015 at 9:30 AM, David Shrader > wrote:


Hello Gilles,

Thank you very much for the patch! It is much more complete than
mine. Using that patch and re-running autogen.pl
, I am able to build 1.8.8 with './configure
--with-hcoll' without errors.

I do have issues when it comes to running 1.8.8 with hcoll built
in, however. In my quick sanity test of running a basic parallel
hello world C program, I get the following:

[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
>From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  wrote:

> Hey Devendar,
>
> It looks like I still get the error:
>
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f684e mca_coll_base_comm_select()  ??:0
> 12 0x00073fc4 ompi_mpi_init()  ??:0
> 13 0x00092ea0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f684e mca_coll_base_comm_select()  ??:0
> 12 0x00073fc4 ompi_mpi_init()  ??:0
> 13 0x00092ea0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
> --
> mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on
> signal 11 (Segmentation fault).
> --
>
> Thanks,
> David
>
> On 08/12/2015 10:42 AM, Deva wrote:
>
> Hi David,
>
> This issue is from hcoll library. This could be because of symbol conflict
> with ml module.  This is fixed recently in HCOLL.  Can you try with "-mca
> coll ^ml" and see if this workaround works in your setup?
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 9:30 AM, David Shrader  wrote:
>
>> Hello Gilles,
>>
>> Thank you very much for the patch! It is much more complete than mine.
>> Using that patch and re-running autogen.pl, I am able to build 1.8.8
>> with './configure --with-hcoll' without errors.
>>
>> I do have issues when it comes to running 1.8.8 with hcoll built in,
>> however. In my quick sanity test of running a basic parallel hello world C
>> program, I get the following:
>>
>> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>> [1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
>> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
>>  backtrace 
>> 2 0x00056cdc mxm_handle_error()
>>  
>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>> pcx-v1.3.336

Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-12 Thread Geoffrey Paulsen
I'm confused why this application needs an asynchronous cuMemcpyAsync()in a blocking MPI call.   Rolf could you please explain?And how does is a call to cuMemcpyAsync() followed by a syncronization any different than a cuMemcpy() in this use case?
 
I would still expect that if the MPI_Send / Recv call issued the cuMemcpyAsync() that it would be MPI's responsibility to issue the synchronization call as well. 

 
---Geoffrey PaulsenSoftware Engineer, IBM Platform MPIIBM Platform-MPIPhone: 720-349-2832Email: gpaul...@us.ibm.comwww.ibm.com
 
 
- Original message -From: Rolf vandeVaart Sent by: "users" To: Open MPI Users Cc:Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy'sDate: Tue, Aug 11, 2015 1:45 PM 
I talked with Jeremia off list and we figured out what was going on.  There is the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get that behavior you need the following:--mca mpi_common_cuda_cumemcpy_async 1It is too late to change this in 1.8 but it will be made the default behavior in 1.10 and all future versions.  In addition, he is right about not being able to see these variables in the Open MPI 1.8 series.  This was a bug and it has been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that back into 1.10.Rolf>-Original Message->From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär>Sent: Tuesday, August 11, 2015 9:17 AM>To: us...@open-mpi.org>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's>>Hi!>>In my current application, MPI_Send/MPI_Recv hangs when using buffers in>GPU device memory of a Nvidia GPU. I realized this is due to the fact that>OpenMPI uses the synchronous cuMempcy rather than the asynchornous>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,>synchronous copies cannot be used.>>I scanned through the source and saw support for async memcpy's are>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in>./ompi/mca/common/cuda/common_cuda.c>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?>Version used is OpenMPI 1.8.5.>>Thank you,>Jeremia>>(gdb) bt>#0  0x2aaaba11 in clock_gettime ()>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c>uda.so.1>#11 0x2c992544 in opal_cuda_memcpy () from>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6>#12 0x2c98adf7 in opal_convertor_pack () from>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm>l_ob1.so>#14 0x2aaab167353f in mca_pml_ob1_send () from>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm>l_ob1.so>#15 0x2bf4f322 in PMPI_Send () from>/users/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmpi.so.1>>___>users mailing list>us...@open-mpi.org>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users>Link to this post: http://www.open->mpi.org/community/lists/users/2015/08/27424.php---This email message is for the sole use of the intended recipient(s) and may containconfidential information.  Any unauthorized review, use, disclosure or distributionis prohibited.  If you are not the intended recipient, please contact the sender byreply email and destroy all copies of the original message.---___users mailing listus...@open-mpi.orgSubscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to this post: http://www.open-mpi.org/community/lists/users/2015/08/27431.php 



Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-12 Thread Rolf vandeVaart
Hi Geoff:

Our original implementation used cuMemcpy for copying GPU memory into and out 
of host memory.  However, what we learned is that the cuMemcpy causes a 
synchronization for all work on the GPU.  This means that one could not overlap 
very well running a kernel and doing communication.  So, now we create an 
internal stream and then use that along with cuMemcpyAsync/cuStreamSynchronize 
for doing the copy.

In turns out in Jeremia’s case, he wanted to have a long running kernel and he 
wanted the MPI_Send/MPI_Recv to happen at the same time.  With the use of 
cuMemcpy, the MPI library was waiting for his kernel to complete before doing 
the cuMemcpy.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Geoffrey Paulsen
Sent: Wednesday, August 12, 2015 12:55 PM
To: us...@open-mpi.org
Cc: us...@open-mpi.org; Sameh S Sharkawi
Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

I'm confused why this application needs an asynchronous cuMemcpyAsync()in a 
blocking MPI call.   Rolf could you please explain?

And how does is a call to cuMemcpyAsync() followed by a syncronization any 
different than a cuMemcpy() in this use case?

I would still expect that if the MPI_Send / Recv call issued the 
cuMemcpyAsync() that it would be MPI's responsibility to issue the 
synchronization call as well.



---
Geoffrey Paulsen
Software Engineer, IBM Platform MPI
IBM Platform-MPI
Phone: 720-349-2832
Email: gpaul...@us.ibm.com
www.ibm.com


- Original message -
From: Rolf vandeVaart mailto:rvandeva...@nvidia.com>>
Sent by: "users" mailto:users-boun...@open-mpi.org>>
To: Open MPI Users mailto:us...@open-mpi.org>>
Cc:
Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
Date: Tue, Aug 11, 2015 1:45 PM

I talked with Jeremia off list and we figured out what was going on.  There is 
the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the 
cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get 
that behavior you need the following:

--mca mpi_common_cuda_cumemcpy_async 1

It is too late to change this in 1.8 but it will be made the default behavior 
in 1.10 and all future versions.  In addition, he is right about not being able 
to see these variables in the Open MPI 1.8 series.  This was a bug and it has 
been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that 
back into 1.10.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär
>Sent: Tuesday, August 11, 2015 9:17 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
>
>Hi!
>
>In my current application, MPI_Send/MPI_Recv hangs when using buffers in
>GPU device memory of a Nvidia GPU. I realized this is due to the fact that
>OpenMPI uses the synchronous cuMempcy rather than the asynchornous
>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,
>synchronous copies cannot be used.
>
>I scanned through the source and saw support for async memcpy's are
>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in
>./ompi/mca/common/cuda/common_cuda.c
>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but
>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?
>Version used is OpenMPI 1.8.5.
>
>Thank you,
>Jeremia
>
>(gdb) bt
>#0  0x2aaaba11 in clock_gettime ()
>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c
>uda.so.1
>#11 0x2c992544 in opal_cuda_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#12 0x2c98adf7 in opal_convertor_pack () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#14 0x2aaab167353f in mca_pml_ob1_send () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#15 0x2bf4f322 in PMPI_Send () from
>/users/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmpi.so.1
>
>___
>users mailing list
>us...@open-mpi.org
>Subscriptio

Re: [OMPI users] Problem in using openmpi-1.8.7

2015-08-12 Thread Jeff Squyres (jsquyres)
This is likely because you installed Open MPI 1.8.7 into the same directory as 
a prior Open MPI installation.

You probably want to uninstall the old version first (e.g., run "make 
uninstall" from the old version's build tree), or just install 1.8.7 into a new 
tree.



> On Aug 11, 2015, at 2:22 PM, Amos Leffler  wrote:
> 
> Dear Users,
> I have run into a problem with openmpi-1.8.7.  It configures and 
> installs properly but when I tested it using examples it gave me numerous 
> errors with mpicc as shown in the output below.  Have I made an error in the 
> process?
> 
> Amoss-MacBook-Pro:openmpi-1.8.7 amosleff$ cd examples
> Amoss-MacBook-Pro:examples amosleff$ mpicc hello_c.c -o hello_c -g
> Amoss-MacBook-Pro:examples amosleff$ mpiexec hello_c
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_ess_slurmd: 
> dlopen(/usr/local/lib/openmpi/mca_ess_slurmd.so, 9): Symbol not found: 
> _orte_jmap_t_class
>   Referenced from: /usr/local/lib/openmpi/mca_ess_slurmd.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_ess_slurmd.so (ignored)
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_errmgr_default: 
> dlopen(/usr/local/lib/openmpi/mca_errmgr_default.so, 9): Symbol not found: 
> _orte_errmgr_base_error_abort
>   Referenced from: /usr/local/lib/openmpi/mca_errmgr_default.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_errmgr_default.so (ignored)
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_routed_cm: 
> dlopen(/usr/local/lib/openmpi/mca_routed_cm.so, 9): Symbol not found: 
> _orte_message_event_t_class
>   Referenced from: /usr/local/lib/openmpi/mca_routed_cm.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_routed_cm.so (ignored)
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_routed_linear: 
> dlopen(/usr/local/lib/openmpi/mca_routed_linear.so, 9): Symbol not found: 
> _orte_message_event_t_class
>   Referenced from: /usr/local/lib/openmpi/mca_routed_linear.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_routed_linear.so (ignored)
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_grpcomm_basic: 
> dlopen(/usr/local/lib/openmpi/mca_grpcomm_basic.so, 9): Symbol not found: 
> _opal_profile
>   Referenced from: /usr/local/lib/openmpi/mca_grpcomm_basic.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_grpcomm_basic.so (ignored)
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_grpcomm_hier: 
> dlopen(/usr/local/lib/openmpi/mca_grpcomm_hier.so, 9): Symbol not found: 
> _orte_daemon_cmd_processor
>   Referenced from: /usr/local/lib/openmpi/mca_grpcomm_hier.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_grpcomm_hier.so (ignored)
> [Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_filem_rsh: 
> dlopen(/usr/local/lib/openmpi/mca_filem_rsh.so, 9): Symbol not found: 
> _opal_uses_threads
>   Referenced from: /usr/local/lib/openmpi/mca_filem_rsh.so
>   Expected in: flat namespace
>  in /usr/local/lib/openmpi/mca_filem_rsh.so (ignored)
> [Amoss-MacBook-Pro:61027] *** Process received signal ***
> [Amoss-MacBook-Pro:61027] Signal: Segmentation fault: 11 (11)
> [Amoss-MacBook-Pro:61027] Signal code: Address not mapped (1)
> [Amoss-MacBook-Pro:61027] Failing at address: 0x10013
> [Amoss-MacBook-Pro:61027] [ 0] 0   libsystem_platform.dylib
> 0x7fff92aebf1a _sigtramp + 26
> [Amoss-MacBook-Pro:61027] [ 1] 0   ??? 
> 0x7fff508ce0af 0x0 + 140734544797871
> [Amoss-MacBook-Pro:61027] [ 2] 0   libopen-rte.7.dylib 
> 0x00010f386e45 orte_rmaps_base_map_job + 2789
> [Amoss-MacBook-Pro:61027] [ 3] 0   libopen-pal.6.dylib 
> 0x00010f3ffaed opal_libevent2021_event_base_loop + 2333
> [Amoss-MacBook-Pro:61027] [ 4] 0   mpiexec 
> 0x00010f333288 orterun + 6440
> [Amoss-MacBook-Pro:61027] [ 5] 0   mpiexec 
> 0x00010f331942 main + 34
> [Amoss-MacBook-Pro:61027] [ 6] 0   libdyld.dylib   
> 0x7fff94d455c9 start + 1
> [Amoss-MacBook-Pro:61027] [ 7] 0   ??? 
> 0x0002 0x0 + 2
> [Amoss-MacBook-Pro:61027] *** End of error message ***
> Segmentation fault: 11
> 
> Your help would be much appreciated.
> Amos Leffler
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27430.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader
The admin that rolled the hcoll rpm that we're using (and got it in 
system space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.


Thanks,
David

On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader > wrote:


Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()
 ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()
 ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
--

mpirun noticed that process rank 1 with PID 14678 on node zo-fe1
exited on signal 11 (Segmentation fault).
--

Thanks,
David

On 08/12/2015 10:42 AM, Deva wrote:

Hi David,

This issue is from hcoll library. This could be because of symbol
conflict with ml module.  This is fixed recently in HCOLL.  Can
you try with "-mca coll ^ml" and see if this workaround works in
your setup?

-Devendar

On Wed, Aug 12, 2015 at 9:30 AM, David Shrader mailto:dshra...@lanl.gov>> wrote:

Hello Gilles,

Thank you very much for the patch! It is much more complete
than mine. Using that patch and re-running autogen.pl
, I am able to build 1.8.8 with
'./configure --with-hcoll' without errors.

I do have issues when it comes to running 1.8.8 with hcoll
built in, however. In my quick sanity test of running a basic
parallel hello world C program, I get the following:

[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM
 WARN  Could not open the KNEM device file at /dev/knem : No
such file or direc
  

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-12 Thread Howard Pritchard
Hi Nate,

Sorry for the delay in getting back to you.

We're somewhat stuck on how to help you, but here are two suggestions.

Could you add the following to your launch command line

--mca odls_base_verbose 100

so we can see exactly what arguments are being feed to java when launching
your app.

Also, if you could put your MPITestBroke.class file somewhere (like google
drive)
where we could get it and try to run locally or at NERSC, that might help
us
narrow down the problem.Better yet, if you have the class or jar file
for
the entire app plus some data sets, we could try that out as well.

All the config outputs, etc. you've sent so far indicate a correct
installation
of open mpi.

Howard


On Aug 6, 2015 1:54 PM, "Nate Chambers"  wrote:

> Howard,
>
> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still segfaults
> as before. I must admit I am new to MPI, so is it possible I'm just
> configuring or running incorrectly? Let me list my steps for you, and maybe
> something will jump out? Also attached is my config.log.
>
>
> CONFIGURE
> ./configure --prefix= --enable-mpi-java CC=gcc
>
> MAKE
> make all install
>
> RUN
> /mpirun -np 1 java MPITestBroke twitter/
>
>
> DEFAULT JAVA AND GCC
>
> $ java -version
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>
> $ gcc --v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=
> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
> --enable-threads=posix --enable-checking=release --with-system-zlib
> --enable-__cxa_atexit --disable-libunwind-exceptions
> --enable-gnu-unique-object
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
> --enable-java-awt=gtk --disable-dssi
> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
> --enable-libgcj-multifile --enable-java-maintainer-mode
> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
> --build=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>
>
>
>
>
> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard 
> wrote:
>
>> HI Nate,
>>
>> We're trying this out on a mac running mavericks and a cray xc system.
>> the mac has java 8
>> while the cray xc has java 7.
>>
>> We could not get the code to run just using the java launch command,
>> although we noticed if you add
>>
>> catch(NoClassDefFoundError e) {
>>
>>   System.out.println("Not using MPI its out to lunch for now");
>>
>> }
>>
>> as one of the catches after the try for firing up MPI, you can get
>> further.
>>
>> Instead we tried on the two systems using
>>
>> mpirun -np 1 java MPITestBroke tweets repeat.txt
>>
>> and, you guessed it, we can't reproduce the error, at least using master.
>>
>> Would you mind trying to get a copy of nightly master build off of
>>
>> http://www.open-mpi.org/nightly/master/
>>
>> and install that version and give it a try.
>>
>> If that works, then I'd suggest using master (or v2.0) for now.
>>
>> Howard
>>
>>
>>
>>
>> 2015-08-05 14:41 GMT-06:00 Nate Chambers :
>>
>>> Howard,
>>>
>>> Thanks for looking at all this. Adding System.gc() did not cause it to
>>> segfault. The segfault still comes much later in the processing.
>>>
>>> I was able to reduce my code to a single test file without other
>>> dependencies. It is attached. This code simply opens a text file and reads
>>> its lines, one by one. Once finished, it closes and opens the same file and
>>> reads the lines again. On my system, it does this about 4 times until the
>>> segfault fires. Obviously this code makes no sense, but it's based on our
>>> actual code that reads millions of lines of data and does various
>>> processing to it.
>>>
>>> Attached is a tweets.tgz file that you can uncompress to have an input
>>> directory. The text file is just the same line over and over again. Run it
>>> as:
>>>
>>> *java MPITestBroke tweets/*
>>>
>>>
>>> Nate
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard 
>>> wrote:
>>>
 Hi Nate,

 Sorry for the delay in getting back.  Thanks for the sanity check.  You
 may have a point about the args string to MPI.init -
 there's nothing the Open MPI is needing from this but that is a
 difference with your use case - your app has an argument.

 Would you mind adding a

 System.gc()

 call immediately after MPI.init call and see if the gc blows up with a
 segfault?

 Also, may be interesting to add the -verbose:jni to your command line.

 We'll do some experiments here with the init string arg.

 Is your app open source where we could download it and try to reproduce
 the problem locally?

 thanks,

 Howard


 2015-08-04 18:

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
David,

This is because of hcoll symbols conflict with ml coll module inside OMPI.
HCOLL is derived from ml module. This issue is fixed in hcoll library and
will be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:

> Interesting... the seg faults went away:
>
> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> 0: Running on host zo-fe1.lanl.gov
> 0: We have 2 processors
> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>
> This implies to me that some other library is being used instead of
> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>
> Thanks,
> David
>
> On 08/12/2015 03:30 PM, Deva wrote:
>
> Hi David,
>
> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.
> Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
>
> $LD_PRELOAD=  mpirun -n 2  -mca coll ^ml
> ./a.out
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  wrote:
>
>> The admin that rolled the hcoll rpm that we're using (and got it in
>> system space) said that she got it from
>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>>
>> Thanks,
>> David
>>
>>
>> On 08/12/2015 10:51 AM, Deva wrote:
>>
>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>
>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < 
>> dshra...@lanl.gov> wrote:
>>
>>> Hey Devendar,
>>>
>>> It looks like I still get the error:
>>>
>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
>>> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>>>  backtrace 
>>> 2 0x00056cdc mxm_handle_error()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>
>>> 3 0x00056e4c mxm_error_signal_handler()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>
>>> 4 0x000326a0 killpg()  ??:0
>>> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
>>> 9 0x0006ace9 hcoll_create_context()  ??:0
>>> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
>>> 11 0x000f684e mca_coll_base_comm_select()  ??:0
>>> 12 0x00073fc4 ompi_mpi_init()  ??:0
>>> 13 0x00092ea0 PMPI_Init()  ??:0
>>> 14 0x004009b6 main()  ??:0
>>> 15 0x0001ed5d __libc_start_main()  ??:0
>>> 16 0x004008c9 _start()  ??:0
>>> ===
>>>  backtrace 
>>> 2 0x00056cdc mxm_handle_error()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>
>>> 3 0x00056e4c mxm_error_signal_handler()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>
>>> 4 0x000326a0 killpg()  ??:0
>>> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
>>> 9 0x0006ace9 hcoll_create_context()  ??:0
>>> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
>>> 11 0x000f684e mca_coll_base_comm_select()  ??:0
>>> 12

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread David Shrader
I remember seeing those, but forgot about them. I am curious, though, 
why using '-mca coll ^ml' wouldn't work for me.


We'll watch for the next HPCX release. Is there an ETA on when that 
release may happen? Thank you for the help!

David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside 
OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll 
library and will be available in next HPCX release.


Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader > wrote:


Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov 
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
 reporting for duty

This implies to me that some other library is being used instead
of /usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce
the issue.  Can you do one more quick test with seeing LD_PRELOAD
to hcoll lib?

$LD_PRELOAD= mpirun -n 2 -mca coll
^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

The admin that rolled the hcoll rpm that we're using (and got
it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib?  MOFED or HPCX? what
version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2)
procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65
  MXM  WARN  Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65
  MXM  WARN  Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb
base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3
hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
===
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
do you have "-disable-dlopen" in your configure option? This might force
coll_ml to be loaded first even with -mca coll ^ml.

next HPCX is expected to release by end of Aug.

-Devendar

On Wed, Aug 12, 2015 at 3:30 PM, David Shrader  wrote:

> I remember seeing those, but forgot about them. I am curious, though, why
> using '-mca coll ^ml' wouldn't work for me.
>
> We'll watch for the next HPCX release. Is there an ETA on when that
> release may happen? Thank you for the help!
> David
>
>
> On 08/12/2015 04:04 PM, Deva wrote:
>
> David,
>
> This is because of hcoll symbols conflict with ml coll module inside OMPI.
> HCOLL is derived from ml module. This issue is fixed in hcoll library and
> will be available in next HPCX release.
>
> Some earlier discussion on this issue:
> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
>
>> Interesting... the seg faults went away:
>>
>> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> 0: Running on host zo-fe1.lanl.gov
>> 0: We have 2 processors
>> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>>
>> This implies to me that some other library is being used instead of
>> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>>
>> Thanks,
>> David
>>
>> On 08/12/2015 03:30 PM, Deva wrote:
>>
>> Hi David,
>>
>> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the
>> issue.  Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
>>
>> $LD_PRELOAD=  mpirun -n 2  -mca coll ^ml
>> ./a.out
>>
>> -Devendar
>>
>> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader < 
>> dshra...@lanl.gov> wrote:
>>
>>> The admin that rolled the hcoll rpm that we're using (and got it in
>>> system space) said that she got it from
>>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> On 08/12/2015 10:51 AM, Deva wrote:
>>>
>>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>>
>>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < 
>>> dshra...@lanl.gov> wrote:
>>>
 Hey Devendar,

 It looks like I still get the error:

 [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
 App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
 [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN
  Could not open the KNEM device file at /dev/knem : No such file or direc
 tory. Won't use knem.
 [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN
  Could not open the KNEM device file at /dev/knem : No such file or direc
 tory. Won't use knem.
 [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
 [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
  backtrace 
 2 0x00056cdc mxm_handle_error()
  
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
 pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

 3 0x00056e4c mxm_error_signal_handler()
  
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
 ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

 4 0x000326a0 killpg()  ??:0
 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
  coll_ml_module.c:0
 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
 9 0x0006ace9 hcoll_create_context()  ??:0
 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
 11 0x000f684e mca_coll_base_comm_select()  ??:0
 12 0x00073fc4 ompi_mpi_init()  ??:0
 13 0x00092ea0 PMPI_Init()  ??:0
 14 0x004009b6 main()  ??:0
 15 0x0001ed5d __libc_start_main()  ??:0
 16 0x004008c9 _start()  ??:0
 ===
  backtrace 
 2 0x00056cdc mxm_handle_error()
  
 /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
 pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

 3 0x00056e4c mxm_error_signal_handler()
  
 /scrap/jenkins/wo

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-12 Thread Nate Chambers
*I appreciate you trying to help! I put the Java and its compiled .class
file on Dropbox. The directory contains the .java and .class files, as well
as a data/ directory:*

http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0

*You can run it with and without MPI:*

>  java MPITestBroke data/
>  mpirun -np 1 java MPITestBroke data/

*Attached is a text file of what I see when I run it with mpirun and your
debug flag. Lots of debug lines.*


Nate





On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard 
wrote:

> Hi Nate,
>
> Sorry for the delay in getting back to you.
>
> We're somewhat stuck on how to help you, but here are two suggestions.
>
> Could you add the following to your launch command line
>
> --mca odls_base_verbose 100
>
> so we can see exactly what arguments are being feed to java when launching
> your app.
>
> Also, if you could put your MPITestBroke.class file somewhere (like google
> drive)
> where we could get it and try to run locally or at NERSC, that might help
> us
> narrow down the problem.Better yet, if you have the class or jar file
> for
> the entire app plus some data sets, we could try that out as well.
>
> All the config outputs, etc. you've sent so far indicate a correct
> installation
> of open mpi.
>
> Howard
>
>
> On Aug 6, 2015 1:54 PM, "Nate Chambers"  wrote:
>
>> Howard,
>>
>> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still
>> segfaults as before. I must admit I am new to MPI, so is it possible I'm
>> just configuring or running incorrectly? Let me list my steps for you, and
>> maybe something will jump out? Also attached is my config.log.
>>
>>
>> CONFIGURE
>> ./configure --prefix= --enable-mpi-java CC=gcc
>>
>> MAKE
>> make all install
>>
>> RUN
>> /mpirun -np 1 java MPITestBroke twitter/
>>
>>
>> DEFAULT JAVA AND GCC
>>
>> $ java -version
>> java version "1.7.0_21"
>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>
>> $ gcc --v
>> Using built-in specs.
>> Target: x86_64-redhat-linux
>> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
>> --infodir=/usr/share/info --with-bugurl=
>> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
>> --enable-threads=posix --enable-checking=release --with-system-zlib
>> --enable-__cxa_atexit --disable-libunwind-exceptions
>> --enable-gnu-unique-object
>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
>> --enable-java-awt=gtk --disable-dssi
>> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
>> --enable-libgcj-multifile --enable-java-maintainer-mode
>> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
>> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
>> --build=x86_64-redhat-linux
>> Thread model: posix
>> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>>
>>
>>
>>
>>
>> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard 
>> wrote:
>>
>>> HI Nate,
>>>
>>> We're trying this out on a mac running mavericks and a cray xc system.
>>> the mac has java 8
>>> while the cray xc has java 7.
>>>
>>> We could not get the code to run just using the java launch command,
>>> although we noticed if you add
>>>
>>> catch(NoClassDefFoundError e) {
>>>
>>>   System.out.println("Not using MPI its out to lunch for now");
>>>
>>> }
>>>
>>> as one of the catches after the try for firing up MPI, you can get
>>> further.
>>>
>>> Instead we tried on the two systems using
>>>
>>> mpirun -np 1 java MPITestBroke tweets repeat.txt
>>>
>>> and, you guessed it, we can't reproduce the error, at least using master.
>>>
>>> Would you mind trying to get a copy of nightly master build off of
>>>
>>> http://www.open-mpi.org/nightly/master/
>>>
>>> and install that version and give it a try.
>>>
>>> If that works, then I'd suggest using master (or v2.0) for now.
>>>
>>> Howard
>>>
>>>
>>>
>>>
>>> 2015-08-05 14:41 GMT-06:00 Nate Chambers :
>>>
 Howard,

 Thanks for looking at all this. Adding System.gc() did not cause it to
 segfault. The segfault still comes much later in the processing.

 I was able to reduce my code to a single test file without other
 dependencies. It is attached. This code simply opens a text file and reads
 its lines, one by one. Once finished, it closes and opens the same file and
 reads the lines again. On my system, it does this about 4 times until the
 segfault fires. Obviously this code makes no sense, but it's based on our
 actual code that reads millions of lines of data and does various
 processing to it.

 Attached is a tweets.tgz file that you can uncompress to have an input
 directory. The text file is just the same line over and over again. Run it
 as:

 *java MPITestBroke tweets/*


 Nate





 On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard 
 wrote:

> Hi Nate,
>
> Sorry for the delay in getti