I don't have that option on the configure command line, but my platform
file is using "enable_dlopen=no." I imagine that is getting the same
result. Thank you for the pointer!
Thanks,
David
On 08/12/2015 05:04 PM, Deva wrote:
do you have "-disable-dlopen" in your configure option? This might
force coll_ml to be loaded first even with -mca coll ^ml.
next HPCX is expected to release by end of Aug.
-Devendar
On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <dshra...@lanl.gov
<mailto:dshra...@lanl.gov>> wrote:
I remember seeing those, but forgot about them. I am curious,
though, why using '-mca coll ^ml' wouldn't work for me.
We'll watch for the next HPCX release. Is there an ETA on when
that release may happen? Thank you for the help!
David
On 08/12/2015 04:04 PM, Deva wrote:
David,
This is because of hcoll symbols conflict with ml coll module
inside OMPI. HCOLL is derived from ml module. This issue is fixed
in hcoll library and will be available in next HPCX release.
Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
-Devendar
On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov
<mailto:dshra...@lanl.gov>> wrote:
Interesting... the seg faults went away:
[dshrader@zo-fe1 tests]$ export
LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM
WARN Could not open the KNEM device file at /dev/knem : No
such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM
WARN Could not open the KNEM device file at /dev/knem : No
such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov>
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
<http://zo-fe1.lanl.gov> reporting for duty
This implies to me that some other library is being used
instead of /usr/lib64/libhcoll.so, but I am not sure how that
could be...
Thanks,
David
On 08/12/2015 03:30 PM, Deva wrote:
Hi David,
I tried same tarball on OFED-1.5.4.1 and I could not
reproduce the issue. Can you do one more quick test with
seeing LD_PRELOAD to hcoll lib?
$LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so> mpirun -n 2 -mca
coll ^ml ./a.out
-Devendar
On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
<dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:
The admin that rolled the hcoll rpm that we're using
(and got it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
Thanks,
David
On 08/12/2015 10:51 AM, Deva wrote:
From where did you grab this HCOLL lib? MOFED or HPCX?
what version?
On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
<dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:
Hey Devendar,
It looks like I still get the error:
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml
./a.out
App launch reported: 1 (out of 1) daemons - 2 (out
of 2) procs
[1439397957.351764] [zo-fe1:14678:0]
shm.c:65 MXM WARN Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0]
shm.c:65 MXM WARN Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b82cb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3
hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery()
coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000f9706 mca_coll_hcoll_comm_query()
??:0
11 0x00000000000f684e mca_coll_base_comm_select()
??:0
12 0x0000000000073fc4 ompi_mpi_init() ??:0
13 0x0000000000092ea0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b82cb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3
hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery()
coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000f9706 mca_coll_hcoll_comm_query()
??:0
11 0x00000000000f684e mca_coll_base_comm_select()
??:0
12 0x0000000000073fc4 ompi_mpi_init() ??:0
13 0x0000000000092ea0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 14678
on node zo-fe1 exited on signal 11 (Segmentation
fault).
--------------------------------------------------------------------------
Thanks,
David
On 08/12/2015 10:42 AM, Deva wrote:
Hi David,
This issue is from hcoll library. This could be
because of symbol conflict with ml module. This is
fixed recently in HCOLL. Can you try with "-mca
coll ^ml" and see if this workaround works in your
setup?
-Devendar
On Wed, Aug 12, 2015 at 9:30 AM, David Shrader
<dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:
Hello Gilles,
Thank you very much for the patch! It is much
more complete than mine. Using that patch and
re-running autogen.pl <http://autogen.pl>, I
am able to build 1.8.8 with './configure
--with-hcoll' without errors.
I do have issues when it comes to running
1.8.8 with hcoll built in, however. In my
quick sanity test of running a basic parallel
hello world C program, I get the following:
[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2
(out of 2) procs
[1439390789.039197] [zo-fe1:31354:0]
shm.c:65 MXM WARN Could not open
the KNEM device file at /dev/knem : No such
file or direc
tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0]
shm.c:65 MXM WARN Could not open
the KNEM device file at /dev/knem : No such
file or direc
tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11
(Segmentation fault)
[zo-fe1:31354:0] Caught signal 11
(Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c
mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b91eb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3
hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery()
coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()
??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000fa626
mca_coll_hcoll_comm_query() ??:0
11 0x00000000000f776e
mca_coll_base_comm_select() ??:0
12 0x0000000000074ee4 ompi_mpi_init() ??:0
13 0x0000000000093dc0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c
mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b91eb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3
hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery()
coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()
??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000fa626
mca_coll_hcoll_comm_query() ??:0
11 0x00000000000f776e
mca_coll_base_comm_select() ??:0
12 0x0000000000074ee4 ompi_mpi_init() ??:0
13 0x0000000000093dc0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID
31353 on node zo-fe1 exited on signal 11
(Segmentation fault).
--------------------------------------------------------------------------
I do not get this message with only 1 process.
I am using hcoll 3.2.748. Could this be an
issue with hcoll itself or something with my
ompi build?
Thanks,
David
On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
Thanks David,
i made a PR for the v1.8 branch at
https://github.com/open-mpi/ompi-release/pull/492
the patch is attached (it required some
back-porting)
Cheers,
Gilles
On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config
branch and, after running autogen.pl
<http://autogen.pl>, have found that
'./configure --with-hcoll' does indeed work
now. I used Gilles' branch as I wasn't sure
how best to get the pull request changes in
to my own clone of master. It looks like the
proper checks are happening, too:
--- MCA component coll:hcoll(m4
configuration macro)
checking for MCA component coll:hcollcompile
mode... dso
checking --with-hcollvalue... simple ok
(unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing
hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan
compile... yes
I haven't checked whether or not Open MPI
builds successfully as I don't have much
experience running off of the latest source.
For now, I think I will try to generate a
patch to the 1.8.8 configure script and see
if that works as expected.
Thanks,
David
On 08/11/2015 06:34 AM, Jeff Squyres
(jsquyres) wrote:
On Aug 11, 2015, at 1:39 AM, Åke
Sandgren<ake.sandg...@hpc2n.umu.se>
<mailto:ake.sandg...@hpc2n.umu.se> wrote:
Please fix the hcoll test (and code) to be correct.
Any configure test that adds /usr/lib and/or /usr/include
to any compile flags is broken.
+1
Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I
just added some comments to it.
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27432.php
_______________________________________________
users mailing list us...@open-mpi.org
<mailto:us...@open-mpi.org> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27434.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27438.php
--
-Devendar
_______________________________________________
users mailing list us...@open-mpi.org
<mailto:us...@open-mpi.org> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27439.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27440.php
--
-Devendar
_______________________________________________ users
mailing list us...@open-mpi.org
<mailto:us...@open-mpi.org> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27441.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27445.php
--
-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
--
-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
--
-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov