The admin that rolled the hcoll rpm that we're using (and got it in system space) said that she got it from hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David

On 08/12/2015 10:51 AM, Deva wrote:
From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:

    Hey Devendar,

    It looks like I still get the error:

    [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
    App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
    [1439397957.351764] [zo-fe1:14678:0]         shm.c:65   MXM  WARN
     Could not open the KNEM device file at /dev/knem : No such file
    or direc
    tory. Won't use knem.
    [1439397957.352704] [zo-fe1:14677:0]         shm.c:65   MXM  WARN
     Could not open the KNEM device file at /dev/knem : No such file
    or direc
    tory. Won't use knem.
    [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
    [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
    ==== backtrace ====
    2 0x0000000000056cdc mxm_handle_error()
     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
    
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

    3 0x0000000000056e4c mxm_error_signal_handler()
     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
    
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

    4 0x00000000000326a0 killpg()  ??:0
    5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()
     ??:0
    6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
    7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
     coll_ml_module.c:0
    8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
    9 0x000000000006ace9 hcoll_create_context()  ??:0
    10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
    11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
    12 0x0000000000073fc4 ompi_mpi_init()  ??:0
    13 0x0000000000092ea0 PMPI_Init()  ??:0
    14 0x00000000004009b6 main()  ??:0
    15 0x000000000001ed5d __libc_start_main()  ??:0
    16 0x00000000004008c9 _start()  ??:0
    ===================
    ==== backtrace ====
    2 0x0000000000056cdc mxm_handle_error()
     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
    
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

    3 0x0000000000056e4c mxm_error_signal_handler()
     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
    
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

    4 0x00000000000326a0 killpg()  ??:0
    5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()
     ??:0
    6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
    7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
     coll_ml_module.c:0
    8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
    9 0x000000000006ace9 hcoll_create_context()  ??:0
    10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
    11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
    12 0x0000000000073fc4 ompi_mpi_init()  ??:0
    13 0x0000000000092ea0 PMPI_Init()  ??:0
    14 0x00000000004009b6 main()  ??:0
    15 0x000000000001ed5d __libc_start_main()  ??:0
    16 0x00000000004008c9 _start()  ??:0
    ===================
    --------------------------------------------------------------------------

    mpirun noticed that process rank 1 with PID 14678 on node zo-fe1
    exited on signal 11 (Segmentation fault).
    --------------------------------------------------------------------------

    Thanks,
    David

    On 08/12/2015 10:42 AM, Deva wrote:
    Hi David,

    This issue is from hcoll library. This could be because of symbol
    conflict with ml module.  This is fixed recently in HCOLL.  Can
    you try with "-mca coll ^ml" and see if this workaround works in
    your setup?

    -Devendar

    On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <dshra...@lanl.gov
    <mailto:dshra...@lanl.gov>> wrote:

        Hello Gilles,

        Thank you very much for the patch! It is much more complete
        than mine. Using that patch and re-running autogen.pl
        <http://autogen.pl>, I am able to build 1.8.8 with
        './configure --with-hcoll' without errors.

        I do have issues when it comes to running 1.8.8 with hcoll
        built in, however. In my quick sanity test of running a basic
        parallel hello world C program, I get the following:

        [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
        App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
        [1439390789.039197] [zo-fe1:31354:0]         shm.c:65   MXM
         WARN  Could not open the KNEM device file at /dev/knem : No
        such file or direc
        tory. Won't use knem.
        [1439390789.040265] [zo-fe1:31353:0]         shm.c:65   MXM
         WARN  Could not open the KNEM device file at /dev/knem : No
        such file or direc
        tory. Won't use knem.
        [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
        [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
        ==== backtrace ====
        2 0x0000000000056cdc mxm_handle_error()
         
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
        
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

        3 0x0000000000056e4c mxm_error_signal_handler()
         
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
        
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

        4 0x00000000000326a0 killpg()  ??:0
        5 0x00000000000b91eb
        base_bcol_basesmuma_setup_library_buffers()  ??:0
        6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
        7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
         coll_ml_module.c:0
        8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
        9 0x000000000006ace9 hcoll_create_context()  ??:0
        10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
        11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
        12 0x0000000000074ee4 ompi_mpi_init()  ??:0
        13 0x0000000000093dc0 PMPI_Init()  ??:0
        14 0x00000000004009b6 main()  ??:0
        15 0x000000000001ed5d __libc_start_main()  ??:0
        16 0x00000000004008c9 _start()  ??:0
        ===================
        ==== backtrace ====
        2 0x0000000000056cdc mxm_handle_error()
         
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
        
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

        3 0x0000000000056e4c mxm_error_signal_handler()
         
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
        
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

        4 0x00000000000326a0 killpg()  ??:0
        5 0x00000000000b91eb
        base_bcol_basesmuma_setup_library_buffers()  ??:0
        6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
        7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
         coll_ml_module.c:0
        8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
        9 0x000000000006ace9 hcoll_create_context()  ??:0
        10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
        11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
        12 0x0000000000074ee4 ompi_mpi_init()  ??:0
        13 0x0000000000093dc0 PMPI_Init()  ??:0
        14 0x00000000004009b6 main()  ??:0
        15 0x000000000001ed5d __libc_start_main()  ??:0
        16 0x00000000004008c9 _start()  ??:0
        ===================
        
--------------------------------------------------------------------------

        mpirun noticed that process rank 0 with PID 31353 on node
        zo-fe1 exited on signal 11 (Segmentation fault).
        
--------------------------------------------------------------------------

        I do not get this message with only 1 process.

        I am using hcoll 3.2.748. Could this be an issue with hcoll
        itself or something with my ompi build?

        Thanks,
        David

        On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
        Thanks David,

        i made a PR for the v1.8 branch at
        https://github.com/open-mpi/ompi-release/pull/492

        the patch is attached (it required some back-porting)

        Cheers,

        Gilles

        On 8/12/2015 4:01 AM, David Shrader wrote:
        I have cloned Gilles' topic/hcoll_config branch and, after
        running autogen.pl <http://autogen.pl>, have found that
        './configure --with-hcoll' does indeed work now. I used
        Gilles' branch as I wasn't sure how best to get the pull
        request changes in to my own clone of master. It looks like
        the proper checks are happening, too:

        --- MCA component coll:hcoll(m4 configuration macro)
        checking for MCA component coll:hcollcompile mode... dso
        checking --with-hcollvalue... simple ok (unspecified)
        checking hcoll/api/hcoll_api.h usability... yes
        checking hcoll/api/hcoll_api.h presence... yes
        checking for hcoll/api/hcoll_api.h... yes
        looking for library without search path
        checking for library containing hcoll_get_version... -lhcoll
        checking if MCA component coll:hcollcan compile... yes

        I haven't checked whether or not Open MPI builds
        successfully as I don't have much experience running off of
        the latest source. For now, I think I will try to generate
        a patch to the 1.8.8 configure script and see if that works
        as expected.

        Thanks,
        David

        On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:
        On Aug 11, 2015, at 1:39 AM, Åke Sandgren<ake.sandg...@hpc2n.umu.se> 
<mailto:ake.sandg...@hpc2n.umu.se>  wrote:
        Please fix the hcoll test (and code) to be correct.

        Any configure test that adds /usr/lib and/or /usr/include to any 
compile flags is broken.
        +1

        Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I just added 
some comments to it.


-- David Shrader
        HPC-3 High Performance Computer Systems
        Los Alamos National Lab
        Email: dshrader <at>lanl.gov <http://lanl.gov>


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27432.php



        _______________________________________________ users
        mailing list us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

        Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27434.php

-- David Shrader
        HPC-3 High Performance Computer Systems
        Los Alamos National Lab
        Email: dshrader <at>lanl.gov <http://lanl.gov>


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2015/08/27438.php




--

    -Devendar


    _______________________________________________ users mailing
    list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription:
    http://www.open-mpi.org/mailman/listinfo.cgi/users

    Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27439.php

-- David Shrader
    HPC-3 High Performance Computer Systems
    Los Alamos National Lab
    Email: dshrader <at>lanl.gov <http://lanl.gov>


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/08/27440.php




--


-Devendar


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27441.php

--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

Reply via email to