Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

David Shrader Thu, 13 Aug 2015 10:07:08 -0400 (EDT)

I don't have that option on the configure command line, but my platformfile is using "enable_dlopen=no." I imagine that is getting the sameresult. Thank you for the pointer!


Thanks,
David


On 08/12/2015 05:04 PM, Deva wrote:

do you have "-disable-dlopen" in your configure option? This mightforce coll_ml to be loaded first even with -mca coll ^ml.


next HPCX is expected to release by end of Aug.

-Devendar

On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <dshra...@lanl.gov<mailto:dshra...@lanl.gov>> wrote:


    I remember seeing those, but forgot about them. I am curious,
    though, why using '-mca coll ^ml' wouldn't work for me.

    We'll watch for the next HPCX release. Is there an ETA on when
    that release may happen? Thank you for the help!
    David


    On 08/12/2015 04:04 PM, Deva wrote:

    David,

    This is because of hcoll symbols conflict with ml coll module
    inside OMPI. HCOLL is derived from ml module. This issue is fixed
    in hcoll library and will be available in next HPCX release.

    Some earlier discussion on this issue:
    http://www.open-mpi.org/community/lists/users/2015/06/27154.php
    http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

    -Devendar

    On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov
    <mailto:dshra...@lanl.gov>> wrote:

        Interesting... the seg faults went away:

        [dshrader@zo-fe1 tests]$ export
        LD_PRELOAD=/usr/lib64/libhcoll.so
        [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
        App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
        [1439416182.732720] [zo-fe1:14690:0]         shm.c:65   MXM
         WARN  Could not open the KNEM device file at /dev/knem : No
        such file or direc
        tory. Won't use knem.
        [1439416182.733640] [zo-fe1:14689:0]         shm.c:65   MXM
         WARN  Could not open the KNEM device file at /dev/knem : No
        such file or direc
        tory. Won't use knem.
        0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov>
        0: We have 2 processors
        0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
        <http://zo-fe1.lanl.gov> reporting for duty

        This implies to me that some other library is being used
        instead of /usr/lib64/libhcoll.so, but I am not sure how that
        could be...

        Thanks,
        David

        On 08/12/2015 03:30 PM, Deva wrote:

        Hi David,

        I tried same tarball on OFED-1.5.4.1 and I could not
        reproduce the issue.  Can you do one more quick test with
        seeing LD_PRELOAD to hcoll lib?

        $LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so> mpirun -n 2 -mca
        coll ^ml ./a.out

        -Devendar

        On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
        <dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:

            The admin that rolled the hcoll rpm that we're using
            (and got it in system space) said that she got it from
            hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

            Thanks,
            David


            On 08/12/2015 10:51 AM, Deva wrote:

            From where did you grab this HCOLL lib? MOFED or HPCX?
            what version?

            On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
            <dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:

                Hey Devendar,

                It looks like I still get the error:

                [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml
                ./a.out
                App launch reported: 1 (out of 1) daemons - 2 (out
                of 2) procs
                [1439397957.351764] [zo-fe1:14678:0]
                        shm.c:65   MXM  WARN  Could not open the
                KNEM device file at /dev/knem : No such file or direc
                tory. Won't use knem.
                [1439397957.352704] [zo-fe1:14677:0]
                        shm.c:65   MXM  WARN  Could not open the
                KNEM device file at /dev/knem : No such file or direc
                tory. Won't use knem.
                [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
                [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
                ==== backtrace ====
                2 0x0000000000056cdc mxm_handle_error()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
                
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

                3 0x0000000000056e4c mxm_error_signal_handler()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
                
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

                4 0x00000000000326a0 killpg()  ??:0
                5 0x00000000000b82cb
                base_bcol_basesmuma_setup_library_buffers()  ??:0
                6 0x00000000000969e3
                hmca_bcol_basesmuma_comm_query()  ??:0
                7 0x0000000000032ee3
                hmca_coll_ml_tree_hierarchy_discovery()
                 coll_ml_module.c:0
                8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
                9 0x000000000006ace9 hcoll_create_context()  ??:0
                10 0x00000000000f9706 mca_coll_hcoll_comm_query()
                 ??:0
                11 0x00000000000f684e mca_coll_base_comm_select()
                 ??:0
                12 0x0000000000073fc4 ompi_mpi_init()  ??:0
                13 0x0000000000092ea0 PMPI_Init()  ??:0
                14 0x00000000004009b6 main()  ??:0
                15 0x000000000001ed5d __libc_start_main()  ??:0
                16 0x00000000004008c9 _start()  ??:0
                ===================
                ==== backtrace ====
                2 0x0000000000056cdc mxm_handle_error()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
                
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

                3 0x0000000000056e4c mxm_error_signal_handler()
                 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
                
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

                4 0x00000000000326a0 killpg()  ??:0
                5 0x00000000000b82cb
                base_bcol_basesmuma_setup_library_buffers()  ??:0
                6 0x00000000000969e3
                hmca_bcol_basesmuma_comm_query()  ??:0
                7 0x0000000000032ee3
                hmca_coll_ml_tree_hierarchy_discovery()
                 coll_ml_module.c:0
                8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
                9 0x000000000006ace9 hcoll_create_context()  ??:0
                10 0x00000000000f9706 mca_coll_hcoll_comm_query()
                 ??:0
                11 0x00000000000f684e mca_coll_base_comm_select()
                 ??:0
                12 0x0000000000073fc4 ompi_mpi_init()  ??:0
                13 0x0000000000092ea0 PMPI_Init()  ??:0
                14 0x00000000004009b6 main()  ??:0
                15 0x000000000001ed5d __libc_start_main()  ??:0
                16 0x00000000004008c9 _start()  ??:0
                ===================
                
--------------------------------------------------------------------------

                mpirun noticed that process rank 1 with PID 14678
                on node zo-fe1 exited on signal 11 (Segmentation
                fault).
                
--------------------------------------------------------------------------

                Thanks,
                David

                On 08/12/2015 10:42 AM, Deva wrote:

                Hi David,

                This issue is from hcoll library. This could be
                because of symbol conflict with ml module. This is
                fixed recently in HCOLL.  Can you try with "-mca
                coll ^ml" and see if this workaround works in your
                setup?

                -Devendar

                On Wed, Aug 12, 2015 at 9:30 AM, David Shrader
                <dshra...@lanl.gov <mailto:dshra...@lanl.gov>> wrote:

                    Hello Gilles,

                    Thank you very much for the patch! It is much
                    more complete than mine. Using that patch and
                    re-running autogen.pl <http://autogen.pl>, I
                    am able to build 1.8.8 with './configure
                    --with-hcoll' without errors.

                    I do have issues when it comes to running
                    1.8.8 with hcoll built in, however. In my
                    quick sanity test of running a basic parallel
                    hello world C program, I get the following:

                    [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
                    App launch reported: 1 (out of 1) daemons - 2
                    (out of 2) procs
                    [1439390789.039197] [zo-fe1:31354:0]
                            shm.c:65   MXM  WARN  Could not open
                    the KNEM device file at /dev/knem : No such
                    file or direc
                    tory. Won't use knem.
                    [1439390789.040265] [zo-fe1:31353:0]
                            shm.c:65   MXM  WARN  Could not open
                    the KNEM device file at /dev/knem : No such
                    file or direc
                    tory. Won't use knem.
                    [zo-fe1:31353:0] Caught signal 11
                    (Segmentation fault)
                    [zo-fe1:31354:0] Caught signal 11
                    (Segmentation fault)
                    ==== backtrace ====
                    2 0x0000000000056cdc mxm_handle_error()
                     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
                    
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

                    3 0x0000000000056e4c
                    mxm_error_signal_handler()
                     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
                    
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

                    4 0x00000000000326a0 killpg()  ??:0
                    5 0x00000000000b91eb
                    base_bcol_basesmuma_setup_library_buffers()  ??:0
                    6 0x00000000000969e3
                    hmca_bcol_basesmuma_comm_query()  ??:0
                    7 0x0000000000032ee3
                    hmca_coll_ml_tree_hierarchy_discovery()
                     coll_ml_module.c:0
                    8 0x000000000002fda2 hmca_coll_ml_comm_query()
                     ??:0
                    9 0x000000000006ace9 hcoll_create_context()  ??:0
                    10 0x00000000000fa626
                    mca_coll_hcoll_comm_query()  ??:0
                    11 0x00000000000f776e
                    mca_coll_base_comm_select()  ??:0
                    12 0x0000000000074ee4 ompi_mpi_init()  ??:0
                    13 0x0000000000093dc0 PMPI_Init()  ??:0
                    14 0x00000000004009b6 main()  ??:0
                    15 0x000000000001ed5d __libc_start_main()  ??:0
                    16 0x00000000004008c9 _start()  ??:0
                    ===================
                    ==== backtrace ====
                    2 0x0000000000056cdc mxm_handle_error()
                     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
                    
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

                    3 0x0000000000056e4c
                    mxm_error_signal_handler()
                     
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
                    
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

                    4 0x00000000000326a0 killpg()  ??:0
                    5 0x00000000000b91eb
                    base_bcol_basesmuma_setup_library_buffers()  ??:0
                    6 0x00000000000969e3
                    hmca_bcol_basesmuma_comm_query()  ??:0
                    7 0x0000000000032ee3
                    hmca_coll_ml_tree_hierarchy_discovery()
                     coll_ml_module.c:0
                    8 0x000000000002fda2 hmca_coll_ml_comm_query()
                     ??:0
                    9 0x000000000006ace9 hcoll_create_context()  ??:0
                    10 0x00000000000fa626
                    mca_coll_hcoll_comm_query()  ??:0
                    11 0x00000000000f776e
                    mca_coll_base_comm_select()  ??:0
                    12 0x0000000000074ee4 ompi_mpi_init()  ??:0
                    13 0x0000000000093dc0 PMPI_Init()  ??:0
                    14 0x00000000004009b6 main()  ??:0
                    15 0x000000000001ed5d __libc_start_main()  ??:0
                    16 0x00000000004008c9 _start()  ??:0
                    ===================
                    
--------------------------------------------------------------------------

                    mpirun noticed that process rank 0 with PID
                    31353 on node zo-fe1 exited on signal 11
                    (Segmentation fault).
                    
--------------------------------------------------------------------------

                    I do not get this message with only 1 process.

                    I am using hcoll 3.2.748. Could this be an
                    issue with hcoll itself or something with my
                    ompi build?

                    Thanks,
                    David

                    On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:

                    Thanks David,

                    i made a PR for the v1.8 branch at
                    https://github.com/open-mpi/ompi-release/pull/492

                    the patch is attached (it required some
                    back-porting)

                    Cheers,

                    Gilles

                    On 8/12/2015 4:01 AM, David Shrader wrote:

                    I have cloned Gilles' topic/hcoll_config
                    branch and, after running autogen.pl
                    <http://autogen.pl>, have found that
                    './configure --with-hcoll' does indeed work
                    now. I used Gilles' branch as I wasn't sure
                    how best to get the pull request changes in
                    to my own clone of master. It looks like the
                    proper checks are happening, too:

                    --- MCA component coll:hcoll(m4
                    configuration macro)
                    checking for MCA component coll:hcollcompile
                    mode... dso
                    checking --with-hcollvalue... simple ok
                    (unspecified)
                    checking hcoll/api/hcoll_api.h usability... yes
                    checking hcoll/api/hcoll_api.h presence... yes
                    checking for hcoll/api/hcoll_api.h... yes
                    looking for library without search path
                    checking for library containing
                    hcoll_get_version... -lhcoll
                    checking if MCA component coll:hcollcan
                    compile... yes

                    I haven't checked whether or not Open MPI
                    builds successfully as I don't have much
                    experience running off of the latest source.
                    For now, I think I will try to generate a
                    patch to the 1.8.8 configure script and see
                    if that works as expected.

                    Thanks,
                    David

                    On 08/11/2015 06:34 AM, Jeff Squyres
                    (jsquyres) wrote:

                    On Aug 11, 2015, at 1:39 AM, Åke 
Sandgren<ake.sandg...@hpc2n.umu.se>
                    <mailto:ake.sandg...@hpc2n.umu.se>  wrote:

                    Please fix the hcoll test (and code) to be correct.

                    Any configure test that adds /usr/lib and/or /usr/include 
to any compile flags is broken.

                    +1

                    Gilles filedhttps://github.com/open-mpi/ompi/pull/796; I 
just added some comments to it.

--David Shrader

                    HPC-3 High Performance Computer Systems
                    Los Alamos National Lab
                    Email: dshrader <at>lanl.gov <http://lanl.gov>


                    _______________________________________________
                    users mailing list
                    us...@open-mpi.org <mailto:us...@open-mpi.org>
                    
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
                    Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27432.php




                    _______________________________________________
                    users mailing list us...@open-mpi.org
                    <mailto:us...@open-mpi.org> Subscription:
                    http://www.open-mpi.org/mailman/listinfo.cgi/users

                    Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27434.php

--David Shrader

                    HPC-3 High Performance Computer Systems
                    Los Alamos National Lab
                    Email: dshrader <at>lanl.gov <http://lanl.gov>


                    _______________________________________________
                    users mailing list
                    us...@open-mpi.org <mailto:us...@open-mpi.org>
                    Subscription:
                    http://www.open-mpi.org/mailman/listinfo.cgi/users
                    Link to this post:
                    
http://www.open-mpi.org/community/lists/users/2015/08/27438.php

--


                -Devendar


                _______________________________________________
                users mailing list us...@open-mpi.org
                <mailto:us...@open-mpi.org> Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/users

                Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27439.php

--David Shrader

                HPC-3 High Performance Computer Systems
                Los Alamos National Lab
                Email: dshrader <at>lanl.gov <http://lanl.gov>


                _______________________________________________
                users mailing list
                us...@open-mpi.org <mailto:us...@open-mpi.org>
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/users
                Link to this post:
                http://www.open-mpi.org/community/lists/users/2015/08/27440.php

--


            -Devendar


            _______________________________________________ users
            mailing list us...@open-mpi.org
            <mailto:us...@open-mpi.org> Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users

            Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27441.php

--David Shrader

            HPC-3 High Performance Computer Systems
            Los Alamos National Lab
            Email: dshrader <at>lanl.gov <http://lanl.gov>


            _______________________________________________
            users mailing list
            us...@open-mpi.org <mailto:us...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users
            Link to this post:
            http://www.open-mpi.org/community/lists/users/2015/08/27445.php

--


        -Devendar

--David Shrader

        HPC-3 High Performance Computer Systems
        Los Alamos National Lab
        Email: dshrader <at>lanl.gov <http://lanl.gov>

--


    -Devendar

--David Shrader

    HPC-3 High Performance Computer Systems
    Los Alamos National Lab
    Email: dshrader <at>lanl.gov <http://lanl.gov>




--


-Devendar


--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

Reply via email to