Hello Gilles,

Thank you very much for the patch! It is much more complete than mine. Using that patch and re-running autogen.pl, I am able to build 1.8.8 with './configure --with-hcoll' without errors.

I do have issues when it comes to running 1.8.8 with hcoll built in, however. In my quick sanity test of running a basic parallel hello world C program, I get the following:

Konsole output Konsole output
[dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439390789.039197] [zo-fe1:31354:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439390789.040265] [zo-fe1:31353:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
[zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x0000000000056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg()  ??:0
5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x000000000006ace9 hcoll_create_context()  ??:0
10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
12 0x0000000000074ee4 ompi_mpi_init()  ??:0
13 0x0000000000093dc0 PMPI_Init()  ??:0
14 0x00000000004009b6 main()  ??:0
15 0x000000000001ed5d __libc_start_main()  ??:0
16 0x00000000004008c9 _start()  ??:0
===================
==== backtrace ====
2 0x0000000000056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x0000000000056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x00000000000326a0 killpg()  ??:0
5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0
8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x000000000006ace9 hcoll_create_context()  ??:0
10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
12 0x0000000000074ee4 ompi_mpi_init()  ??:0
13 0x0000000000093dc0 PMPI_Init()  ??:0
14 0x00000000004009b6 main()  ??:0
15 0x000000000001ed5d __libc_start_main()  ??:0
16 0x00000000004008c9 _start()  ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I do not get this message with only 1 process.

I am using hcoll 3.2.748. Could this be an issue with hcoll itself or something with my ompi build?

Thanks,
David

On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
Thanks David,

i made a PR for the v1.8 branch at https://github.com/open-mpi/ompi-release/pull/492

the patch is attached (it required some back-porting)

Cheers,

Gilles

On 8/12/2015 4:01 AM, David Shrader wrote:
I have cloned Gilles' topic/hcoll_config branch and, after running autogen.pl, have found that './configure --with-hcoll' does indeed work now. I used Gilles' branch as I wasn't sure how best to get the pull request changes in to my own clone of master. It looks like the proper checks are happening, too:

Konsole output
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile mode... dso
checking --with-hcollvalue... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan compile... yes

I haven't checked whether or not Open MPI builds successfully as I don't have much experience running off of the latest source. For now, I think I will try to generate a patch to the 1.8.8 configure script and see if that works as expected.

Thanks,
David

On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:
On Aug 11, 2015, at 1:39 AM, Åke Sandgren<ake.sandg...@hpc2n.umu.se>  wrote:
Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile flags 
is broken.
+1

Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I just added some 
comments to it.


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27432.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27434.php

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

Reply via email to