Hi Gilles,

so if I use rthe option --mca pml ob1, I use infiniband and it will be as fast as normal, right?


Thanks


On 22/08/16 14:22, Gilles Gouaillardet wrote:
Juan,

to keep things simple, --mca pml ob1 ensures you are not using mxm
(yet an other way to use infiniband)

IPoIB is unlikely working on your system now, so for inter node communications, you will use tcp with the interconnect you have (GbE or 10 GbE if you are lucky) in term of performance, GbE(Gigabit Ethernet) is an order (or two) of magnitude slower than infiniband, both in term of bandwidth and latency)

if you still face issues with version mismatch, you can
mpirun --tag-output ldd a.out
in your script, and double check all nodes are using the correct libs

Cheers,

Gilles

On Monday, August 22, 2016, Juan A. Cordero Varelaq <bioinformatica-i...@us.es <mailto:bioinformatica-i...@us.es>> wrote:

    Hi Gilles,

    adding *,usnic* made it work :) --mca pml ob1 would not be then
    needed.

    Does it render mpi very slow if infiniband is disabled (what does
    --mca pml pb1?)?

    Regarding the version mismatch, everything seems to be right. When
    only one version is loaded, I see the PATH and the LD_LIBRARY_PATH
    for only one version, and with strings, everything seems to
    reference the right version.

    Thanks a lot for the quick answers!
    On 22/08/16 13:09, Gilles Gouaillardet wrote:
    Juan,

    can you try to
    mpirun --mca btl ^openib,usnic --mca pml ob1 ...

    note this simply disable native infiniband. from a performance
    point of view, you should have your sysadmin fix the infiniband
    fabric.

    about the version mismatch, please double check your environment
    (e.g. $PATH and $LD_LIBRARY_PATH), it is likely v2.0 is in your
    environment when you are using v1.8, or the other way around)
    also, make sure orted are launched with the right environment.
    if you are using ssh, then
    ssh node env
    should not contain reference to the version you are not using.
    (that typically occurs when the environment is set in the
    .bashrc, directly or via modules)

    last but not least, you can
    strings /.../bin/orted
    strings /.../lib/libmpi.so
    and check they do not reference the wrong version
    (that can happen if a library was built and then moved)


    Cheers,

    Gilles

    On Monday, August 22, 2016, Juan A. Cordero Varelaq
    <bioinformatica-i...@us.es
    <javascript:_e(%7B%7D,'cvml','bioinformatica-i...@us.es');>> wrote:

        Dear Ralph,

        The existence of the two versions does not seem to be the
        source of problems, since they are in different locations. I
        uninstalled the most recent version and try again with no
        luck, getting the same warnings/errors. However, after a deep
        search I found a couple of hints, and executed this:

        mpirun *-mca btl ^openib -mca btl_sm_use_knem 0* -np 5 myscript

        and got only a fraction of the previous errors (before I had
        run the same but without the arguments in bold), related to
        OpenFabrics:

        Open MPI failed to open an OpenFabrics device.  This is an
        unusual
        error; the system reported the OpenFabrics device as being
        present,
        but then later failed to access it successfully.  This usually
        indicates either a misconfiguration or a failed OpenFabrics
        hardware
        device.

        All OpenFabrics support has been disabled in this MPI
        process; your
        job may or may not continue.

          Hostname:    MYMACHINE
          Device name: mlx4_0
          Errror (22): Invalid argument
        
--------------------------------------------------------------------------
        
--------------------------------------------------------------------------
        [[52062,1],0]: A high-performance Open MPI point-to-point
        messaging module
        was unable to find any relevant network interfaces:

        Module: usNIC
          Host: MYMACHINE

        Another transport will be used instead, although this may
        result in
        lower performance.
        
--------------------------------------------------------------------------

        Do you guess why it could happen?


        Thanks a lot

        On 19/08/16 17:11, r...@open-mpi.org wrote:
        The rdma error sounds like something isn’t right with your
        machine’s Infiniband installation.

        The cross-version problem sounds like you installed both
        OMPI versions into the same location - did you do that?? If
        so, then that might be the root cause of both problems. You
        need to install them in totally different locations. Then
        you need to _prefix_ your PATH and LD_LIBRARY_PATH with the
        location of the version you want to use.

        HTH
        Ralph

        On Aug 19, 2016, at 12:53 AM, Juan A. Cordero Varelaq
        <bioinformatica-i...@us.es> wrote:

        Dear users,

        I am totally stuck using openmpi. I have two versions on my
        machine: 1.8.1 and 2.0.0, and none of them work. When use
        the mpirun *1.8.1 version*, I get the following error:

        librdmacm: Fatal: unable to open RDMA device
        librdmacm: Fatal: unable to open RDMA device
        librdmacm: Fatal: unable to open RDMA device
        librdmacm: Fatal: unable to open RDMA device
        librdmacm: Fatal: unable to open RDMA device
        
--------------------------------------------------------------------------
        Open MPI failed to open the /dev/knem device due to a local
        error.
        Please check with your system administrator to get the
        problem fixed,
        or set the btl_sm_use_knem MCA parameter to 0 to run
        without /dev/knem
        support.

          Local host: MYMACHINE
          Errno:      2 (No such file or directory)
        
--------------------------------------------------------------------------
        
--------------------------------------------------------------------------
        Open MPI failed to open an OpenFabrics device.  This is an
        unusual
        error; the system reported the OpenFabrics device as being
        present,
        but then later failed to access it successfully.  This usually
        indicates either a misconfiguration or a failed OpenFabrics
        hardware
        device.

        All OpenFabrics support has been disabled in this MPI
        process; your
        job may or may not continue.

          Hostname:    MYMACHINE
          Device name: mlx4_0
          Errror (22): Invalid argument
        
--------------------------------------------------------------------------
        
--------------------------------------------------------------------------
        [[60527,1],4]: A high-performance Open MPI point-to-point
        messaging module
        was unable to find any relevant network interfaces:

        Module: usNIC
          Host: MYMACHINE

        When I use the *2.0.0 version*, I get something strange, it
        seems openmpi-2.0.0 looks for openmpi-1.8.1 libraries?:

        A requested component was not found, or was unable to be
        opened.  This
        means that this component is either not installed or is
        unable to be
        used on your system (e.g., sometimes this means that shared
        libraries
        that the component requires are unable to be
        found/loaded).  Note that
        Open MPI stopped checking at the first component that it
        did not find.

        Host:      MYMACHINE
        Framework: ess
        Component: pmi
        
--------------------------------------------------------------------------
        [MYMACHINE:126820] *** Process received signal ***
        [MYMACHINE:126820] Signal: Segmentation fault (11)
        [MYMACHINE:126820] Signal code: Address not mapped (1)
        [MYMACHINE:126820] Failing at address: 0x1c0
        [MYMACHINE:126820] [ 0]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f39b2ec4cb0]
        [MYMACHINE:126820] [ 1]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f39b23e7430]
        [MYMACHINE:126820] [ 2]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f39b2676a57]
        [MYMACHINE:126820] [ 3]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f39b2676fb7]
        [MYMACHINE:126820] [ 4]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f39b267718f]
        [MYMACHINE:126820] [ 5]
        /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f39b23c5f2a]
        [MYMACHINE:126820] [ 6]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f39b23c70c3]
        [MYMACHINE:126820] [ 7]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f39b23c8278]
        [MYMACHINE:126820] [ 8]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f39b23d1e6c]
        [MYMACHINE:126820] [ 9]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f39b2666e21]
        [MYMACHINE:126820] [10]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f39b3115c92]
        [MYMACHINE:126820] [11]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f39b31387bb]
        [MYMACHINE:126820] [12] mb[0x402024]
        [MYMACHINE:126820] [13]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f39b2b187ed]
        [MYMACHINE:126820] [14] mb[0x402111]
        [MYMACHINE:126820] *** End of error message ***
        
--------------------------------------------------------------------------
        A requested component was not found, or was unable to be
        opened. This
        means that this component is either not installed or is
        unable to be
        used on your system (e.g., sometimes this means that shared
        libraries
        that the component requires are unable to be
        found/loaded).  Note that
        Open MPI stopped checking at the first component that it
        did not find.

        Host:      MYMACHINE
        Framework: ess
        Component: pmi
        
--------------------------------------------------------------------------
        [MYMACHINE:126821] *** Process received signal ***
        [MYMACHINE:126821] Signal: Segmentation fault (11)
        [MYMACHINE:126821] Signal code: Address not mapped (1)
        [MYMACHINE:126821] Failing at address: 0x1c0
        [MYMACHINE:126821] [ 0]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fed834bbcb0]
        [MYMACHINE:126821] [ 1]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7fed829de430]
        [MYMACHINE:126821] [ 2]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7fed82c6da57]
        [MYMACHINE:126821] [ 3]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7fed82c6dfb7]
        [MYMACHINE:126821] [ 4]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7fed82c6e18f]
        [MYMACHINE:126821] [ 5]
        /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7fed829bcf2a]
        [MYMACHINE:126821] [ 6]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7fed829be0c3]
        [MYMACHINE:126821] [ 7]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7fed829bf278]
        [MYMACHINE:126821] [ 8]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7fed829c8e6c]
        [MYMACHINE:126821] [ 9]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7fed82c5de21]
        [MYMACHINE:126821] [10]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7fed8370cc92]
        [MYMACHINE:126821] [11]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7fed8372f7bb]
        [MYMACHINE:126821] [12] mb[0x402024]
        [MYMACHINE:126821] [13]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fed8310f7ed]
        [MYMACHINE:126821] [14] mb[0x402111]
        [MYMACHINE:126821] *** End of error message ***
        
--------------------------------------------------------------------------
        A requested component was not found, or was unable to be
        opened. This
        means that this component is either not installed or is
        unable to be
        used on your system (e.g., sometimes this means that shared
        libraries
        that the component requires are unable to be
        found/loaded).  Note that
        Open MPI stopped checking at the first component that it
        did not find.

        Host:      MYMACHINE
        Framework: ess
        Component: pmi
        
--------------------------------------------------------------------------
        [MYMACHINE:126822] *** Process received signal ***
        [MYMACHINE:126822] Signal: Segmentation fault (11)
        [MYMACHINE:126822] Signal code: Address not mapped (1)
        [MYMACHINE:126822] Failing at address: 0x1c0
        [MYMACHINE:126822] [ 0]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f0174bc0cb0]
        [MYMACHINE:126822] [ 1]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f01740e3430]
        [MYMACHINE:126822] [ 2]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f0174372a57]
        [MYMACHINE:126822] [ 3]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f0174372fb7]
        [MYMACHINE:126822] [ 4]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f017437318f]
        [MYMACHINE:126822] [ 5]
        /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f01740c1f2a]
        [MYMACHINE:126822] [ 6]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f01740c30c3]
        [MYMACHINE:126822] [ 7]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f01740c4278]
        [MYMACHINE:126822] [ 8]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f01740cde6c]
        [MYMACHINE:126822] [ 9]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f0174362e21]
        [MYMACHINE:126822] [10]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f0174e11c92]
        [MYMACHINE:126822] [11]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f0174e347bb]
        [MYMACHINE:126822] [12] mb[0x402024]
        [MYMACHINE:126822] [13]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f01748147ed]
        [MYMACHINE:126822] [14] mb[0x402111]
        [MYMACHINE:126822] *** End of error message ***
        
--------------------------------------------------------------------------
        A requested component was not found, or was unable to be
        opened. This
        means that this component is either not installed or is
        unable to be
        used on your system (e.g., sometimes this means that shared
        libraries
        that the component requires are unable to be
        found/loaded).  Note that
        Open MPI stopped checking at the first component that it
        did not find.

        Host:      MYMACHINE
        Framework: ess
        Component: pmi
        
--------------------------------------------------------------------------
        [MYMACHINE:126823] *** Process received signal ***
        [MYMACHINE:126823] Signal: Segmentation fault (11)
        [MYMACHINE:126823] Signal code: Address not mapped (1)
        [MYMACHINE:126823] Failing at address: 0x1c0
        
--------------------------------------------------------------------------
        A requested component was not found, or was unable to be
        opened. This
        means that this component is either not installed or is
        unable to be
        used on your system (e.g., sometimes this means that shared
        libraries
        that the component requires are unable to be
        found/loaded).  Note that
        Open MPI stopped checking at the first component that it
        did not find.

        Host:      MYMACHINE
        Framework: ess
        Component: pmi
        
--------------------------------------------------------------------------
        [MYMACHINE:126823] [ 0]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fcd9cb58cb0]
        [MYMACHINE:126823] [ 1] [MYMACHINE:126824] *** Process
        received signal ***
        [MYMACHINE:126824] Signal: Segmentation fault (11)
        [MYMACHINE:126824] Signal code: Address not mapped (1)
        [MYMACHINE:126824] Failing at address: 0x1c0
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7fcd9c07b430]
        [MYMACHINE:126823] [ 2]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7fcd9c30aa57]
        [MYMACHINE:126823] [ 3]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7fcd9c30afb7]
        [MYMACHINE:126823] [ 4]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7fcd9c30b18f]
        [MYMACHINE:126823] [ 5]
        /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7fcd9c059f2a]
        [MYMACHINE:126823] [MYMACHINE:126824] [ 0] [ 6]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f2f0c611cb0]
        [MYMACHINE:126824] [ 1]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7fcd9c05b0c3]
        [MYMACHINE:126823] [ 7]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(opal_libevent2021_event_add+0x10)[0x7f2f0bb34430]
        [MYMACHINE:126824] [ 2]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(+0x25a57)[0x7f2f0bdc3a57]
        [MYMACHINE:126824] [ 3]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help_norender+0x197)[0x7f2f0bdc3fb7]
        [MYMACHINE:126824] [ 4]
        
/opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_show_help+0x10f)[0x7f2f0bdc418f]
        [MYMACHINE:126824] [ 5]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7fcd9c05c278]
        [MYMACHINE:126823] [ 8]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7fcd9c065e6c]
        [MYMACHINE:126823] [ 9]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7fcd9c2fae21]
        [MYMACHINE:126823] [10]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7fcd9cda9c92]
        [MYMACHINE:126823] [11]
        /opt/openmpi-1.8.1/lib/libopen-pal.so.6(+0x41f2a)[0x7f2f0bb12f2a]
        [MYMACHINE:126824] [ 6]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_components_filter+0x273)[0x7f2f0bb140c3]
        [MYMACHINE:126824] [ 7]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7fcd9cdcc7bb]
        [MYMACHINE:126823] [12] mb[0x402024]
        [MYMACHINE:126823] [13]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_components_open+0x58)[0x7f2f0bb15278]
        [MYMACHINE:126824] [ 8]
        
/opt/openmpi-1.8.1/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7f2f0bb1ee6c]
        [MYMACHINE:126824] [ 9]
        /opt/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x111)[0x7f2f0bdb3e21]
        [MYMACHINE:126824] [10]
        /opt/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x1c2)[0x7f2f0c862c92]
        [MYMACHINE:126824] [11]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fcd9c7ac7ed]
        [MYMACHINE:126823] [14] mb[0x402111]
        [MYMACHINE:126823] *** End of error message ***
        /opt/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0x1ab)[0x7f2f0c8857bb]
        [MYMACHINE:126824] [12] mb[0x402024]
        [MYMACHINE:126824] [13]
        /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f2f0c2657ed]
        [MYMACHINE:126824] [14] mb[0x402111]
        [MYMACHINE:126824] *** End of error message ***
        
--------------------------------------------------------------------------
        mpirun noticed that process rank 2 with PID 0 on node
        MYMACHINE exited on signal 11 (Segmentation fault).
        
--------------------------------------------------------------------------

        I am running my script with *mpirun* in a *single node of a
        SGE cluster*.

        I would be very grateful if somebody could give me some
        hints to solve this issue.

        Thanks a lot in advance
        _______________________________________________
        users mailing list
        users@lists.open-mpi.org
        https://rfd.newmexicoconsortium.org/mailman/listinfo/users
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>



        _______________________________________________
        users mailing list
        users@lists.open-mpi.org
        https://rfd.newmexicoconsortium.org/mailman/listinfo/users
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

    _______________________________________________
    users mailing list
    users@lists.open-mpi.org
    <javascript:_e(%7B%7D,'cvml','users@lists.open-mpi.org');>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to