I remember seeing those, but forgot about them. I am curious, though,
why using '-mca coll ^ml' wouldn't work for me.
We'll watch for the next HPCX release. Is there an ETA on when that
release may happen? Thank you for the help!
David
On 08/12/2015 04:04 PM, Deva wrote:
David,
This is because of hcoll symbols conflict with ml coll module inside
OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll
library and will be available in next HPCX release.
Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
-Devendar
On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov
<mailto:dshra...@lanl.gov>> wrote:
Interesting... the seg faults went away:
[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM WARN
Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM WARN
Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov>
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
<http://zo-fe1.lanl.gov> reporting for duty
This implies to me that some other library is being used instead
of /usr/lib64/libhcoll.so, but I am not sure how that could be...
Thanks,
David
On 08/12/2015 03:30 PM, Deva wrote:
Hi David,
I tried same tarball on OFED-1.5.4.1 and I could not reproduce
the issue. Can you do one more quick test with seeing LD_PRELOAD
to hcoll lib?
$LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so> mpirun -n 2 -mca coll
^ml ./a.out
-Devendar
On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
<dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:
The admin that rolled the hcoll rpm that we're using (and got
it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
Thanks,
David
On 08/12/2015 10:51 AM, Deva wrote:
From where did you grab this HCOLL lib? MOFED or HPCX? what
version?
On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
<dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:
Hey Devendar,
It looks like I still get the error:
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2)
procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65
MXM WARN Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65
MXM WARN Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b82cb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000f9706 mca_coll_hcoll_comm_query() ??:0
11 0x00000000000f684e mca_coll_base_comm_select() ??:0
12 0x0000000000073fc4 ompi_mpi_init() ??:0
13 0x0000000000092ea0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b82cb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000f9706 mca_coll_hcoll_comm_query() ??:0
11 0x00000000000f684e mca_coll_base_comm_select() ??:0
12 0x0000000000073fc4 ompi_mpi_init() ??:0
13 0x0000000000092ea0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 14678 on
node zo-fe1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Thanks,
David
On 08/12/2015 10:42 AM, Deva wrote:
Hi David,
This issue is from hcoll library. This could be because
of symbol conflict with ml module. This is fixed
recently in HCOLL. Can you try with "-mca coll ^ml"
and see if this workaround works in your setup?
-Devendar
On Wed, Aug 12, 2015 at 9:30 AM, David Shrader
<dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:
Hello Gilles,
Thank you very much for the patch! It is much more
complete than mine. Using that patch and re-running
autogen.pl <http://autogen.pl>, I am able to build
1.8.8 with './configure --with-hcoll' without errors.
I do have issues when it comes to running 1.8.8
with hcoll built in, however. In my quick sanity
test of running a basic parallel hello world C
program, I get the following:
[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out
of 2) procs
[1439390789.039197] [zo-fe1:31354:0]
shm.c:65 MXM WARN Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0]
shm.c:65 MXM WARN Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b91eb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3
hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery()
coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000fa626 mca_coll_hcoll_comm_query()
??:0
11 0x00000000000f776e mca_coll_base_comm_select()
??:0
12 0x0000000000074ee4 ompi_mpi_init() ??:0
13 0x0000000000093dc0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x0000000000056e4c mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg() ??:0
5 0x00000000000b91eb
base_bcol_basesmuma_setup_library_buffers() ??:0
6 0x00000000000969e3
hmca_bcol_basesmuma_comm_query() ??:0
7 0x0000000000032ee3
hmca_coll_ml_tree_hierarchy_discovery()
coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query() ??:0
9 0x000000000006ace9 hcoll_create_context() ??:0
10 0x00000000000fa626 mca_coll_hcoll_comm_query()
??:0
11 0x00000000000f776e mca_coll_base_comm_select()
??:0
12 0x0000000000074ee4 ompi_mpi_init() ??:0
13 0x0000000000093dc0 PMPI_Init() ??:0
14 0x00000000004009b6 main() ??:0
15 0x000000000001ed5d __libc_start_main() ??:0
16 0x00000000004008c9 _start() ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 31353
on node zo-fe1 exited on signal 11 (Segmentation
fault).
--------------------------------------------------------------------------
I do not get this message with only 1 process.
I am using hcoll 3.2.748. Could this be an issue
with hcoll itself or something with my ompi build?
Thanks,
David
On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
Thanks David,
i made a PR for the v1.8 branch at
https://github.com/open-mpi/ompi-release/pull/492
the patch is attached (it required some back-porting)
Cheers,
Gilles
On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config branch
and, after running autogen.pl
<http://autogen.pl>, have found that './configure
--with-hcoll' does indeed work now. I used
Gilles' branch as I wasn't sure how best to get
the pull request changes in to my own clone of
master. It looks like the proper checks are
happening, too:
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile
mode... dso
checking --with-hcollvalue... simple ok
(unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing
hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan
compile... yes
I haven't checked whether or not Open MPI builds
successfully as I don't have much experience
running off of the latest source. For now, I
think I will try to generate a patch to the 1.8.8
configure script and see if that works as expected.
Thanks,
David
On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres)
wrote:
On Aug 11, 2015, at 1:39 AM, Åke
Sandgren<ake.sandg...@hpc2n.umu.se>
<mailto:ake.sandg...@hpc2n.umu.se> wrote:
Please fix the hcoll test (and code) to be correct.
Any configure test that adds /usr/lib and/or /usr/include to
any compile flags is broken.
+1
Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I just
added some comments to it.
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27432.php
_______________________________________________
users mailing list us...@open-mpi.org
<mailto:us...@open-mpi.org> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27434.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27438.php
--
-Devendar
_______________________________________________ users
mailing list us...@open-mpi.org
<mailto:us...@open-mpi.org> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27439.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27440.php
--
-Devendar
_______________________________________________ users
mailing list us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/08/27441.php
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27445.php
--
-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at>lanl.gov <http://lanl.gov>
--
-Devendar
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov