>From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <dshra...@lanl.gov> wrote:

> Hey Devendar,
>
> It looks like I still get the error:
>
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439397957.351764] [zo-fe1:14678:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439397957.352704] [zo-fe1:14677:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
> ==== backtrace ====
> 2 0x0000000000056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x0000000000056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x00000000000326a0 killpg()  ??:0
> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x000000000006ace9 hcoll_create_context()  ??:0
> 10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
> 11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
> 12 0x0000000000073fc4 ompi_mpi_init()  ??:0
> 13 0x0000000000092ea0 PMPI_Init()  ??:0
> 14 0x00000000004009b6 main()  ??:0
> 15 0x000000000001ed5d __libc_start_main()  ??:0
> 16 0x00000000004008c9 _start()  ??:0
> ===================
> ==== backtrace ====
> 2 0x0000000000056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x0000000000056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x00000000000326a0 killpg()  ??:0
> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x000000000006ace9 hcoll_create_context()  ??:0
> 10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
> 11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
> 12 0x0000000000073fc4 ompi_mpi_init()  ??:0
> 13 0x0000000000092ea0 PMPI_Init()  ??:0
> 14 0x00000000004009b6 main()  ??:0
> 15 0x000000000001ed5d __libc_start_main()  ??:0
> 16 0x00000000004008c9 _start()  ??:0
> ===================
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Thanks,
> David
>
> On 08/12/2015 10:42 AM, Deva wrote:
>
> Hi David,
>
> This issue is from hcoll library. This could be because of symbol conflict
> with ml module.  This is fixed recently in HCOLL.  Can you try with "-mca
> coll ^ml" and see if this workaround works in your setup?
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <dshra...@lanl.gov> wrote:
>
>> Hello Gilles,
>>
>> Thank you very much for the patch! It is much more complete than mine.
>> Using that patch and re-running autogen.pl, I am able to build 1.8.8
>> with './configure --with-hcoll' without errors.
>>
>> I do have issues when it comes to running 1.8.8 with hcoll built in,
>> however. In my quick sanity test of running a basic parallel hello world C
>> program, I get the following:
>>
>> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>> [1439390789.039197] [zo-fe1:31354:0]         shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [1439390789.040265] [zo-fe1:31353:0]         shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
>> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
>> ==== backtrace ====
>> 2 0x0000000000056cdc mxm_handle_error()
>>  
>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>
>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>  
>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>
>> 4 0x00000000000326a0 killpg()  ??:0
>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>  coll_ml_module.c:0
>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
>> 11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
>> 12 0x0000000000074ee4 ompi_mpi_init()  ??:0
>> 13 0x0000000000093dc0 PMPI_Init()  ??:0
>> 14 0x00000000004009b6 main()  ??:0
>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>> 16 0x00000000004008c9 _start()  ??:0
>> ===================
>> ==== backtrace ====
>> 2 0x0000000000056cdc mxm_handle_error()
>>  
>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>
>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>  
>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>
>> 4 0x00000000000326a0 killpg()  ??:0
>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>  coll_ml_module.c:0
>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
>> 11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
>> 12 0x0000000000074ee4 ompi_mpi_init()  ??:0
>> 13 0x0000000000093dc0 PMPI_Init()  ??:0
>> 14 0x00000000004009b6 main()  ??:0
>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>> 16 0x00000000004008c9 _start()  ??:0
>> ===================
>> --------------------------------------------------------------------------
>>
>> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited
>> on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>>
>> I do not get this message with only 1 process.
>>
>> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or
>> something with my ompi build?
>>
>> Thanks,
>> David
>>
>> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
>>
>> Thanks David,
>>
>> i made a PR for the v1.8 branch at
>> <https://github.com/open-mpi/ompi-release/pull/492>
>> https://github.com/open-mpi/ompi-release/pull/492
>>
>> the patch is attached (it required some back-porting)
>>
>> Cheers,
>>
>> Gilles
>>
>> On 8/12/2015 4:01 AM, David Shrader wrote:
>>
>> I have cloned Gilles' topic/hcoll_config branch and, after running
>> autogen.pl, have found that './configure --with-hcoll' does indeed work
>> now. I used Gilles' branch as I wasn't sure how best to get the pull
>> request changes in to my own clone of master. It looks like the proper
>> checks are happening, too:
>>
>> --- MCA component coll:hcoll (m4 configuration macro)
>> checking for MCA component coll:hcoll compile mode... dso
>> checking --with-hcoll value... simple ok (unspecified)
>> checking hcoll/api/hcoll_api.h usability... yes
>> checking hcoll/api/hcoll_api.h presence... yes
>> checking for hcoll/api/hcoll_api.h... yes
>> looking for library without search path
>> checking for library containing hcoll_get_version... -lhcoll
>> checking if MCA component coll:hcoll can compile... yes
>>
>> I haven't checked whether or not Open MPI builds successfully as I don't
>> have much experience running off of the latest source. For now, I think I
>> will try to generate a patch to the 1.8.8 configure script and see if that
>> works as expected.
>>
>> Thanks,
>> David
>>
>> On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:
>>
>> On Aug 11, 2015, at 1:39 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> 
>> <ake.sandg...@hpc2n.umu.se> wrote:
>>
>> Please fix the hcoll test (and code) to be correct.
>>
>> Any configure test that adds /usr/lib and/or /usr/include to any compile 
>> flags is broken.
>>
>> +1
>>
>> Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some 
>> comments to it.
>>
>>
>>
>> --
>> David Shrader
>> HPC-3 High Performance Computer Systems
>> Los Alamos National Lab
>> Email: dshrader <at> lanl.gov
>>
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/08/27432.php
>>
>>
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/08/27434.php
>>
>>
>> --
>> David Shrader
>> HPC-3 High Performance Computer Systems
>> Los Alamos National Lab
>> Email: dshrader <at> lanl.gov
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27438.php
>>
>
>
>
> --
>
>
> -Devendar
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27439.php
>
>
> --
> David Shrader
> HPC-3 High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader <at> lanl.gov
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27440.php
>



-- 


-Devendar

Reply via email to