Hi Tom,

As per your suggestion, i tried

*./configure --with-psm --prefix=/opt/openmpi-1.7.2 --enable-event-thread-support --enable-opal-multi-threads --enable-orte-progress-threads --enable-mpi-thread-multiple*

but I am getting this error:

--- MCA component mtl:psm (m4 configuration macro)
checking for MCA component mtl:psm compile mode... dso
checking --with-psm value... simple ok (unspecified)
checking --with-psm-libdir value... simple ok (unspecified)
checking psm.h usability... no
checking psm.h presence... yes
configure: WARNING: psm.h: present but cannot be compiled
configure: WARNING: psm.h:     check for missing prerequisite headers?
configure: WARNING: psm.h: see the Autoconf documentation
configure: WARNING: psm.h:     section "Present But Cannot Be Compiled"
configure: WARNING: psm.h: proceeding with the compiler's result
configure: WARNING: ## ------------------------------------------------------ ## configure: WARNING: ## Report this to http://www.open-mpi.org/community/help/ ## configure: WARNING: ## ------------------------------------------------------ ##
checking for psm.h... no
configure: error: PSM support requested but not found.  Aborting


Any feedback will be helpful. Thanks for your time!

Mr. Beans



On 8/4/13 10:31 AM, Elken, Tom wrote:

On 8/3/13 7:09 PM, RoboBeans wrote:

    On first 7 nodes:

    *[mpidemo@SERVER-3 ~]$ ofed_info | head -n 1*
    OFED-1.5.3.2:

    On last 4 nodes:

    *[mpidemo@sv-2 ~]$ ofed_info | head -n 1*
    -bash: ofed_info: command not found

    */[Tom] /*

    */This is a pretty good clue that OFED is not installed on the
    last 4 nodes.  You should fix that by installing OFED 1.5.3.2 on
    the last 4 nodes, OR better (but more work) install a newer OFED
    such as 1.5.4.1 or 3.5 on ALL the nodes (You need to look at the
    OFED release notes to see if your OS is supported by these OFEDs). /*

    *//*

    */BTW, since you are using QLogic HCAs, they typically work with
    the best performance when using the PSM API to the HCA.  PSM is
    part of OFED.  To use this by default with Open MPI, you can build
    Open MPI as follows:/*

    ./configure --with-psm --prefix=<install directory>

    make

    make install

    */
    /*With an Open MPI that is already built, you can try to use PSM
    as follows:
    mpirun ... --mca mtl psm --mca btl ^openib ...

    -Tom

    *[mpidemo@sv-2 ~]$ which ofed_info*
    /usr/bin/which: no ofed_info in
    
(/usr/OPENMPI/openmpi-1.7.2/bin:/usr/OPENMPI/openmpi-1.7.2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin/:/usr/lib/:/usr/lib:/usr:/usr/:/bin/:/usr/lib/:/usr/lib:/usr:/usr/)


    Are there some specific locations where I should look for
    ofed_info? How can I make sure that ofed was installed on a node
    or not?

    Thanks again!!!


    On 8/3/13 5:52 PM, Ralph Castain wrote:

        Are the ofed versions the same across all the machines? I
        would suspect that might be the problem.

        On Aug 3, 2013, at 4:06 PM, RoboBeans <robobe...@gmail.com
        <mailto:robobe...@gmail.com>> wrote:



        Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from
        source code) with no configuration arguments but I am facing
        the same issue. When I run a job using 1.5.4 (installed using
        yum), I get warnings but it doesn't affect my output.

        Example of warning that I get:

        sv-2.7960ipath_userinit: Mismatched user minor version (12)
        and driver minor version (11) while context sharing. Ensure
        that driver and library are from the same release.

        Each system has a QLogic card ("QLE7342-CK dual port IB
        card"), with the same OS but different kernel revision no.
        (e.g. 2.6.32-358.2.1.el6.x86_64, 2.6.32-358.el6.x86_64).

        Thank you for your time.

        On 8/3/13 2:05 PM, Ralph Castain wrote:

            Hmmm...strange indeed. I would remove those four configure
            options and give it a try. That will eliminate all the
            obvious things, I would think, though they aren't
            generally involved in the issue shown here. Still, worth
            taking out potential trouble sources.

            What is the connectivity between SERVER-2 and node 100?
            Should I assume that the first seven nodes are connected
            via one type of interconnect, and the other four are
            connected to those seven by another type?

            On Aug 3, 2013, at 1:30 PM, RoboBeans <robobe...@gmail.com
            <mailto:robobe...@gmail.com>> wrote:



            Thanks for looking into in Ralph. I modified the hosts
            file but I am still getting the same error. Any other
            pointers you can think of? The difference between this
            1.7.2 installation and 1.5.4 is that I installed 1.5.4
            using yum and for 1.7.2, I used the source code and
            configured with *--enable-event-thread-support
            --enable-opal-multi-threads --enable-orte-progress-threads
            --enable-mpi-thread-multiple
            *. Am I missing something here?

            //******************************************************************

            *$ cat mpi_hostfile*

            x.x.x.22 slots=15 max-slots=15
            x.x.x.24 slots=2 max-slots=2
            x.x.x.26 slots=14 max-slots=14
            x.x.x.28 slots=16 max-slots=16
            x.x.x.29 slots=14 max-slots=14
            x.x.x.30 slots=16 max-slots=16
            x.x.x.41 slots=46 max-slots=46
            x.x.x.101 slots=46 max-slots=46
            x.x.x.100 slots=46 max-slots=46
            x.x.x.102 slots=22 max-slots=22
            x.x.x.103 slots=22 max-slots=22

            //******************************************************************
            *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile
            --bynode ./test
            *
            [SERVER-2:08907] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0/0
            [SERVER-2:08907] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0
            [SERVER-2:08907] top: openmpi-sessions-mpidemo@SERVER-2_0
            [SERVER-2:08907] tmp: /tmp
            CentOS release 6.4 (Final)
            Kernel \r on an \m
            CentOS release 6.4 (Final)
            Kernel \r on an \m
            CentOS release 6.4 (Final)
            Kernel \r on an \m
            [SERVER-3:32517] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0/1
            [SERVER-3:32517] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0
            [SERVER-3:32517] top: openmpi-sessions-mpidemo@SERVER-3_0
            [SERVER-3:32517] tmp: /tmp
            CentOS release 6.4 (Final)
            Kernel \r on an \m
            CentOS release 6.4 (Final)
            Kernel \r on an \m
            [SERVER-6:11595] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0/4
            [SERVER-6:11595] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0
            [SERVER-6:11595] top: openmpi-sessions-mpidemo@SERVER-6_0
            [SERVER-6:11595] tmp: /tmp
            [SERVER-4:27445] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0/2
            [SERVER-4:27445] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0
            [SERVER-4:27445] top: openmpi-sessions-mpidemo@SERVER-4_0
            [SERVER-4:27445] tmp: /tmp
            [SERVER-7:02607] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0/5
            [SERVER-7:02607] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0
            [SERVER-7:02607] top: openmpi-sessions-mpidemo@SERVER-7_0
            [SERVER-7:02607] tmp: /tmp
            [sv-1:46100] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0/8
            [sv-1:46100] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0
            [sv-1:46100] top: openmpi-sessions-mpidemo@sv-1_0
            [sv-1:46100] tmp: /tmp
            CentOS release 6.4 (Final)
            Kernel \r on an \m
            [SERVER-5:16404] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0/3
            [SERVER-5:16404] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0
            [SERVER-5:16404] top: openmpi-sessions-mpidemo@SERVER-5_0
            [SERVER-5:16404] tmp: /tmp
            [sv-3:08575] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0/9
            [sv-3:08575] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0
            [sv-3:08575] top: openmpi-sessions-mpidemo@sv-3_0
            [sv-3:08575] tmp: /tmp
            [SERVER-14:10755] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0/6
            [SERVER-14:10755] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0
            [SERVER-14:10755] top: openmpi-sessions-mpidemo@SERVER-14_0
            [SERVER-14:10755] tmp: /tmp
            [sv-4:12040] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0/10
            [sv-4:12040] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0
            [sv-4:12040] top: openmpi-sessions-mpidemo@sv-4_0
            [sv-4:12040] tmp: /tmp
            [sv-2:07725] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0/7
            [sv-2:07725] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0
            [sv-2:07725] top: openmpi-sessions-mpidemo@sv-2_0
            [sv-2:07725] tmp: /tmp

             Mapper requested: NULL  Last mapper: round_robin  Mapping
            policy: BYNODE Ranking policy: NODE  Binding policy:
            NONE[NODE]  Cpu set: NULL  PPR: NULL
                 Num new daemons: 0    New daemon starting vpid INVALID
                 Num nodes: 10

             Data for node: SERVER-2         Launch id: -1    State: 2
                 Daemon: [[62216,0],0]    Daemon launched: True
                 Num slots: 15    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 15    Max slots: 15
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],0]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 0
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-15  Binding: NULL[0]

             Data for node: x.x.x.24         Launch id: -1    State: 0
                 Daemon: [[62216,0],1]    Daemon launched: False
                 Num slots: 2    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 2    Max slots: 2
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],1]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 1
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.26         Launch id: -1    State: 0
                 Daemon: [[62216,0],2]    Daemon launched: False
                 Num slots: 14    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 14    Max slots: 14
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],2]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 2
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.28         Launch id: -1    State: 0
                 Daemon: [[62216,0],3]    Daemon launched: False
                 Num slots: 16    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 16    Max slots: 16
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],3]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 3
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.29         Launch id: -1    State: 0
                 Daemon: [[62216,0],4]    Daemon launched: False
                 Num slots: 14    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 14    Max slots: 14
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],4]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 4
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.30         Launch id: -1    State: 0
                 Daemon: [[62216,0],5]    Daemon launched: False
                 Num slots: 16    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 16    Max slots: 16
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],5]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 5
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.41         Launch id: -1    State: 0
                 Daemon: [[62216,0],6]    Daemon launched: False
                 Num slots: 46    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 46    Max slots: 46
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],6]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 6
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.101         Launch id: -1    State: 0
                 Daemon: [[62216,0],7]    Daemon launched: False
                 Num slots: 46    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 46    Max slots: 46
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],7]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 7
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.100         Launch id: -1    State: 0
                 Daemon: [[62216,0],8]    Daemon launched: False
                 Num slots: 46    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 46    Max slots: 46
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],8]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 8
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]

             Data for node: x.x.x.102         Launch id: -1    State: 0
                 Daemon: [[62216,0],9]    Daemon launched: False
                 Num slots: 22    Slots in use: 1  Oversubscribed: FALSE
                 Num slots allocated: 22    Max slots: 22
                 Username on node: NULL
                 Num procs: 1    Next node_rank: 1
                 Data for proc: [[62216,1],9]
                     Pid: 0    Local rank: 0    Node rank: 0    App
            rank: 9
                     State: INITIALIZED    Restarts: 0    App_context:
            0    Locale: 0-7  Binding: NULL[0]
            [sv-1:46111] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1/8
            [sv-1:46111] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1
            [sv-1:46111] top: openmpi-sessions-mpidemo@sv-1_0
            [sv-1:46111] tmp: /tmp
            [SERVER-14:10768] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1/6
            [SERVER-14:10768] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1
            [SERVER-14:10768] top: openmpi-sessions-mpidemo@SERVER-14_0
            [SERVER-14:10768] tmp: /tmp
            [SERVER-2:08912] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1/0
            [SERVER-2:08912] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1
            [SERVER-2:08912] top: openmpi-sessions-mpidemo@SERVER-2_0
            [SERVER-2:08912] tmp: /tmp
            [SERVER-4:27460] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1/2
            [SERVER-4:27460] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1
            [SERVER-4:27460] top: openmpi-sessions-mpidemo@SERVER-4_0
            [SERVER-4:27460] tmp: /tmp
            [SERVER-6:11608] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1/4
            [SERVER-6:11608] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1
            [SERVER-6:11608] top: openmpi-sessions-mpidemo@SERVER-6_0
            [SERVER-6:11608] tmp: /tmp
            [SERVER-7:02620] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1/5
            [SERVER-7:02620] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1
            [SERVER-7:02620] top: openmpi-sessions-mpidemo@SERVER-7_0
            [SERVER-7:02620] tmp: /tmp
            [sv-3:08586] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1/9
            [sv-3:08586] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1
            [sv-3:08586] top: openmpi-sessions-mpidemo@sv-3_0
            [sv-3:08586] tmp: /tmp
            [sv-2:07736] procdir:
            /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1/7
            [sv-2:07736] jobdir:
            /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1
            [sv-2:07736] top: openmpi-sessions-mpidemo@sv-2_0
            [sv-2:07736] tmp: /tmp
            [SERVER-5:16418] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1/3
            [SERVER-5:16418] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1
            [SERVER-5:16418] top: openmpi-sessions-mpidemo@SERVER-5_0
            [SERVER-5:16418] tmp: /tmp
            [SERVER-3:32533] procdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1/1
            [SERVER-3:32533] jobdir:
            /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1
            [SERVER-3:32533] top: openmpi-sessions-mpidemo@SERVER-3_0
            [SERVER-3:32533] tmp: /tmp
              MPIR_being_debugged = 0
              MPIR_debug_state = 1
              MPIR_partial_attach_ok = 1
              MPIR_i_am_starter = 0
              MPIR_forward_output = 0
              MPIR_proctable_size = 10
              MPIR_proctable:
                (i, host, exe, pid) = (0, SERVER-2,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
                (i, host, exe, pid) = (1, x.x.x.24,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
                (i, host, exe, pid) = (2, x.x.x.26,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
                (i, host, exe, pid) = (3, x.x.x.28,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
                (i, host, exe, pid) = (4, x.x.x.29,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
                (i, host, exe, pid) = (5, x.x.x.30,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
                (i, host, exe, pid) = (6, x.x.x.41,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
                (i, host, exe, pid) = (7, x.x.x.101,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
                (i, host, exe, pid) = (8, x.x.x.100,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
                (i, host, exe, pid) = (9, x.x.x.102,
            /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
            MPIR_executable_path: NULL
            MPIR_server_arguments: NULL
            
--------------------------------------------------------------------------
            It looks like MPI_INIT failed for some reason; your
            parallel process is
            likely to abort.  There are many reasons that a parallel
            process can
            fail during MPI_INIT; some of which are due to
            configuration or environment
            problems.  This failure appears to be an internal failure;
            here's some
            additional information (which may only be relevant to an
            Open MPI
            developer):

              PML add procs failed
              --> Returned "Error" (-1) instead of "Success" (0)
            
--------------------------------------------------------------------------
            [SERVER-2:8912] *** An error occurred in MPI_Init
            [SERVER-2:8912] *** reported by process
            [140393673392129,140389596004352]
            [SERVER-2:8912] *** on a NULL communicator
            [SERVER-2:8912] *** Unknown error
            [SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in
            this communicator will now abort,
            [SERVER-2:8912] ***    and potentially your MPI job)
            
--------------------------------------------------------------------------
            An MPI process is aborting at a time when it cannot
            guarantee that all
of its peer processes in the job will be killed properly. You should
            double check that everything has shut down cleanly.

              Reason:     Before MPI_INIT completed
              Local host: SERVER-2
              PID:        8912
            
--------------------------------------------------------------------------
            
[sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
            [btl_openib_proc.c:157] ompi_modex_recv failed for peer
            [[62216,1],0]
            [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
            mca_base_modex_recv: failed with return value=-13
            [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
            mca_base_modex_recv: failed with return value=-13
            
--------------------------------------------------------------------------
            At least one pair of MPI processes are unable to reach
            each other for
            MPI communications.  This means that no Open MPI device
            has indicated
            that it can be used to communicate between these
            processes.  This is
            an error; Open MPI requires that all MPI processes be able
            to reach
            each other.  This error can sometimes be the result of
            forgetting to
            specify the "self" BTL.

              Process 1 ([[62216,1],8]) is on host: sv-1
              Process 2 ([[62216,1],0]) is on host: SERVER-2
              BTLs attempted: openib self sm tcp

            Your MPI job is now going to abort; sorry.
            
--------------------------------------------------------------------------
            
[sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
            [btl_openib_proc.c:157] ompi_modex_recv failed for peer
            [[62216,1],0]
            [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
            mca_base_modex_recv: failed with return value=-13
            [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
            mca_base_modex_recv: failed with return value=-13
            
--------------------------------------------------------------------------
            MPI_INIT has failed because at least one MPI process is
            unreachable
            from another.  This *usually* means that an underlying
            communication
            plugin -- such as a BTL or an MTL -- has either not loaded
            or not
            allowed itself to be used.  Your MPI job will now abort.

            You may wish to try to narrow down the problem;

             * Check the output of ompi_info to see which BTL/MTL
            plugins are
               available.
             * Run your application with MPI_THREAD_SINGLE.
             * Set the MCA parameter btl_base_verbose to 100 (or
            mtl_base_verbose,
               if using MTL-based communications) to see exactly which
               communication plugins were considered and/or discarded.
            
--------------------------------------------------------------------------
            
[sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
            [btl_openib_proc.c:157] ompi_modex_recv failed for peer
            [[62216,1],0]
            [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
            mca_base_modex_recv: failed with return value=-13
            [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
            mca_base_modex_recv: failed with return value=-13
            [SERVER-2:08907] sess_dir_finalize: proc session dir not
            empty - leaving
            [sv-4:12040] sess_dir_finalize: job session dir not empty
            - leaving
            [SERVER-14:10755] sess_dir_finalize: job session dir not
            empty - leaving
            [SERVER-2:08907] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-6:11595] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-6:11595] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-4:27445] sess_dir_finalize: proc session dir not
            empty - leaving
            exiting with status 0
            [SERVER-4:27445] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-6:11595] sess_dir_finalize: job session dir not
            empty - leaving
            [SERVER-7:02607] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-7:02607] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-7:02607] sess_dir_finalize: job session dir not
            empty - leaving
            [SERVER-5:16404] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-5:16404] sess_dir_finalize: proc session dir not
            empty - leaving
            exiting with status 0
            exiting with status 0
            exiting with status 0
            [SERVER-4:27445] sess_dir_finalize: job session dir not
            empty - leaving
            exiting with status 0
            [SERVER-3:32517] sess_dir_finalize: proc session dir not
            empty - leaving
            [SERVER-3:32517] sess_dir_finalize: proc session dir not
            empty - leaving
            [sv-3:08575] sess_dir_finalize: proc session dir not empty
            - leaving
            [sv-3:08575] sess_dir_finalize: job session dir not empty
            - leaving
            exiting with status 0
            [sv-1:46100] sess_dir_finalize: proc session dir not empty
            - leaving
            [sv-1:46100] sess_dir_finalize: job session dir not empty
            - leaving
            exiting with status 0
            [sv-2:07725] sess_dir_finalize: proc session dir not empty
            - leaving
            [sv-2:07725] sess_dir_finalize: job session dir not empty
            - leaving
            exiting with status 0
            [SERVER-5:16404] sess_dir_finalize: job session dir not
            empty - leaving
            exiting with status 0
            [SERVER-3:32517] sess_dir_finalize: job session dir not
            empty - leaving
            exiting with status 0
            
--------------------------------------------------------------------------
            mpirun has exited due to process rank 6 with PID 10768 on
            node x.x.x.41 exiting improperly. There are three reasons
            this could occur:

            1. this process did not call "init" before exiting, but
            others in
            the job did. This can cause a job to hang indefinitely
            while it waits
            for all processes to call "init". By rule, if one process
            calls "init",
            then ALL processes must call "init" prior to termination.

            2. this process called "init", but exited without calling
            "finalize".
            By rule, all processes that call "init" MUST call
            "finalize" prior to
            exiting or it will be considered an "abnormal termination"

            3. this process called "MPI_Abort" or "orte_abort" and the
            mca parameter
            orte_create_session_dirs is set to false. In this case,
            the run-time cannot
            detect that the abort call was an abnormal termination.
            Hence, the only
            error message you will receive is this one.

            This may have caused other processes in the application to be
            terminated by signals sent by mpirun (as reported here).

            You can avoid this message by specifying -quiet on the
            mpirun command line.

            
--------------------------------------------------------------------------
            [SERVER-2:08907] 6 more processes have sent help message
            help-mpi-runtime / mpi_init:startup:internal-failure
            [SERVER-2:08907] Set MCA parameter
            "orte_base_help_aggregate" to 0 to see all help / error
            messages
            [SERVER-2:08907] 9 more processes have sent help message
            help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
            [SERVER-2:08907] 9 more processes have sent help message
            help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all
            killed
            [SERVER-2:08907] 2 more processes have sent help message
            help-mca-bml-r2.txt / unreachable proc
            [SERVER-2:08907] 2 more processes have sent help message
            help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
            [SERVER-2:08907] sess_dir_finalize: job session dir not
            empty - leaving
            exiting with status 1

            //******************************************************************

            On 8/3/13 4:34 AM, Ralph Castain wrote:

                It looks like SERVER-2 cannot talk to your x.x.x.100
                machine. I note that you have some entries at the end
                of the hostfile that I don't understand - a list of
                hosts that can be reached? And I see that your
                x.x.x.22 machine isn't on it. Is that SERVER-2 by chance?

                Our hostfile parsing changed between the release
                series, but I know we never consciously supported the
                syntax you show below where you list capabilities, and
                then re-list the hosts in an apparent attempt to
                filter which ones can actually be used. It is possible
                that the 1.5 series somehow used that to exclude the
                22 machine, and that the 1.7 parser now doesn't do that.

                If you only include machines you actually intend to
                use in your hostfile, does the 1.7 series work?

                On Aug 3, 2013, at 3:58 AM, RoboBeans
                <robobe...@gmail.com <mailto:robobe...@gmail.com>> wrote:



                Hello everyone,

                I have installed openmpi 1.5.4 on 11 node cluster
                using "yum install openmpi openmpi-devel" and
                everything seems to be working fine. For testing I am
                using this test program

                
//******************************************************************

                *$ cat test.cpp*

                #include <stdio.h>
                #include <mpi.h>

                int main (int argc, char *argv[])
                {
                  int id, np;
                  char name[MPI_MAX_PROCESSOR_NAME];
                  int namelen;
                  int i;

                  MPI_Init (&argc, &argv);

                  MPI_Comm_size (MPI_COMM_WORLD, &np);
                  MPI_Comm_rank (MPI_COMM_WORLD, &id);
                  MPI_Get_processor_name (name, &namelen);

                  printf ("This is Process %2d out of %2d running on
                host %s\n", id, np, name);

                  MPI_Finalize ();

                  return (0);
                }

                
//******************************************************************

                and my hosts file look like this:

                *$ cat mpi_hostfile*

                # The Hostfile for Open MPI

                # specify number of slots for processes to run locally.
                #localhost slots=12
                #x.x.x.16 slots=12 max-slots=12
                #x.x.x.17 slots=12 max-slots=12
                #x.x.x.18 slots=12 max-slots=12
                #x.x.1x.19 slots=12 max-slots=12
                #x.x.x.20 slots=12 max-slots=12
                #x.x.x.55 slots=46 max-slots=46
                #x.x.x.56 slots=46 max-slots=46

                x.x.x.22 slots=15 max-slots=15
                x.x.x.24 slots=2 max-slots=2
                x.x.x.26 slots=14 max-slots=14
                x.x.x.28 slots=16 max-slots=16
                x.x.x.29 slots=14 max-slots=14
                x.x.x.30 slots=16 max-slots=16
                x.x.x.41 slots=46 max-slots=46
                x.x.x.101 slots=46 max-slots=46
                x.x.x.100 slots=46 max-slots=46
                x.x.x.102 slots=22 max-slots=22
                x.x.x.103 slots=22 max-slots=22

                # The following slave nodes are available to this machine:
                x.x.x.24
                x.x.x.26
                x.x.x.28
                x.x.x.29
                x.x.x.30
                x.x.x.41
                x.x.x.101
                x.x.x.100
                x.x.x.102
                x.x.x.103

                
//******************************************************************

                this is how my .bashrc looks like on each node:

                *$ cat ~/.bashrc*

                # .bashrc

                # Source global definitions
                if [ -f /etc/bashrc ]; then
                    . /etc/bashrc
                fi

                # User specific aliases and functions
                umask 077

                export PSM_SHAREDCONTEXTS_MAX=20

                #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
                export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}

                #export
                
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
                export
                
LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

                export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/

                
//******************************************************************

                *$ mpic++ test.cpp -o test*

                *$ mpirun -d --display-map -np 10 --hostfile
                mpi_hostfile --bynode ./test*

                
//******************************************************************

                These nodes are running 2.6.32-358.2.1.el6.x86_64 release

                *$ uname*
                Linux
                *$ uname -r*
                2.6.32-358.2.1.el6.x86_64
                *$ cat /etc/issue*
                CentOS release 6.4 (Final)
                Kernel \r on an \m

                
//******************************************************************

                Now, if I install openmpi 1.7.2 on each node
                separately then I can only use it on either first 7
                nodes or last 4 nodes but not on all of them.

                
//******************************************************************

                *$ gunzip -c openmpi-1.7.2.tar.gz | tar xf -

                $ cd openmpi-1.7.2

                $ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2
                --enable-event-thread-support
                --enable-opal-multi-threads
                --enable-orte-progress-threads
                --enable-mpi-thread-multiple

                $ make all install*

                
//******************************************************************

                This is the error message that i am receiving:


                *$ mpirun -d --display-map -np 10 --hostfile
                mpi_hostfile --bynode ./test*

                [SERVER-2:05284] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0/0
                [SERVER-2:05284] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0
                [SERVER-2:05284] top: openmpi-sessions-mpidemo@SERVER-2_0
                [SERVER-2:05284] tmp: /tmp
                CentOS release 6.4 (Final)
                Kernel \r on an \m
                CentOS release 6.4 (Final)
                Kernel \r on an \m
                CentOS release 6.4 (Final)
                Kernel \r on an \m
                [SERVER-3:28993] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0/1
                [SERVER-3:28993] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0
                [SERVER-3:28993] top: openmpi-sessions-mpidemo@SERVER-3_0
                [SERVER-3:28993] tmp: /tmp
                CentOS release 6.4 (Final)
                Kernel \r on an \m
                CentOS release 6.4 (Final)
                Kernel \r on an \m
                [SERVER-6:09087] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0/4
                [SERVER-6:09087] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0
                [SERVER-6:09087] top: openmpi-sessions-mpidemo@SERVER-6_0
                [SERVER-6:09087] tmp: /tmp
                [SERVER-7:32563] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0/5
                [SERVER-7:32563] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0
                [SERVER-7:32563] top: openmpi-sessions-mpidemo@SERVER-7_0
                [SERVER-7:32563] tmp: /tmp
                [SERVER-4:15711] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0/2
                [SERVER-4:15711] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0
                [SERVER-4:15711] top: openmpi-sessions-mpidemo@SERVER-4_0
                [SERVER-4:15711] tmp: /tmp
                [sv-1:45701] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0/8
                [sv-1:45701] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0
                [sv-1:45701] top: openmpi-sessions-mpidemo@sv-1_0
                [sv-1:45701] tmp: /tmp
                CentOS release 6.4 (Final)
                Kernel \r on an \m
                [sv-3:08352] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0/9
                [sv-3:08352] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0
                [sv-3:08352] top: openmpi-sessions-mpidemo@sv-3_0
                [sv-3:08352] tmp: /tmp
                [SERVER-5:12534] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0/3
                [SERVER-5:12534] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0
                [SERVER-5:12534] top: openmpi-sessions-mpidemo@SERVER-5_0
                [SERVER-5:12534] tmp: /tmp
                [SERVER-14:08399] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0/6
                [SERVER-14:08399] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0
                [SERVER-14:08399] top:
                openmpi-sessions-mpidemo@SERVER-14_0
                [SERVER-14:08399] tmp: /tmp
                [sv-4:11802] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0/10
                [sv-4:11802] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0
                [sv-4:11802] top: openmpi-sessions-mpidemo@sv-4_0
                [sv-4:11802] tmp: /tmp
                [sv-2:07503] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0/7
                [sv-2:07503] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0
                [sv-2:07503] top: openmpi-sessions-mpidemo@sv-2_0
                [sv-2:07503] tmp: /tmp

Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNODE Ranking policy: NODE Binding
                policy: NONE[NODE]  Cpu set: NULL  PPR: NULL
                     Num new daemons: 0    New daemon starting vpid
                INVALID
                     Num nodes: 10

                 Data for node: SERVER-2 Launch id: -1    State: 2
                     Daemon: [[50535,0],0]    Daemon launched: True
Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 15    Max slots: 15
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],0]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 0
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: NULL[0]

                 Data for node: x.x.x.24 Launch id: -1    State: 0
                     Daemon: [[50535,0],1]    Daemon launched: False
Num slots: 3 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 3    Max slots: 2
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],1]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 1
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.26 Launch id: -1    State: 0
                     Daemon: [[50535,0],2]    Daemon launched: False
Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 15    Max slots: 14
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],2]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 2
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.28 Launch id: -1    State: 0
                     Daemon: [[50535,0],3]    Daemon launched: False
Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 17    Max slots: 16
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],3]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 3
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.29 Launch id: -1    State: 0
                     Daemon: [[50535,0],4]    Daemon launched: False
Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 15    Max slots: 14
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],4]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 4
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.30 Launch id: -1    State: 0
                     Daemon: [[50535,0],5]    Daemon launched: False
Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 17    Max slots: 16
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],5]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 5
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.41 Launch id: -1    State: 0
                     Daemon: [[50535,0],6]    Daemon launched: False
Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 47    Max slots: 46
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],6]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 6
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.101 Launch id: -1    State: 0
                     Daemon: [[50535,0],7]    Daemon launched: False
Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 47    Max slots: 46
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],7]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 7
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.100 Launch id: -1    State: 0
                     Daemon: [[50535,0],8]    Daemon launched: False
Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 47    Max slots: 46
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],8]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 8
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]

                 Data for node: x.x.x.102 Launch id: -1    State: 0
                     Daemon: [[50535,0],9]    Daemon launched: False
Num slots: 23 Slots in use: 1 Oversubscribed: FALSE
                     Num slots allocated: 23    Max slots: 22
                     Username on node: NULL
                     Num procs: 1    Next node_rank: 1
                     Data for proc: [[50535,1],9]
                         Pid: 0    Local rank: 0 Node rank: 0    App
                rank: 9
State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 Binding: NULL[0]
                [sv-1:45712] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1/8
                [sv-1:45712] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1
                [sv-1:45712] top: openmpi-sessions-mpidemo@sv-1_0
                [sv-1:45712] tmp: /tmp
                [SERVER-14:08412] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1/6
                [SERVER-14:08412] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1
                [SERVER-14:08412] top:
                openmpi-sessions-mpidemo@SERVER-14_0
                [SERVER-14:08412] tmp: /tmp
                [SERVER-2:05291] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1/0
                [SERVER-2:05291] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1
                [SERVER-2:05291] top: openmpi-sessions-mpidemo@SERVER-2_0
                [SERVER-2:05291] tmp: /tmp
                [SERVER-4:15726] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1/2
                [SERVER-4:15726] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1
                [SERVER-4:15726] top: openmpi-sessions-mpidemo@SERVER-4_0
                [SERVER-4:15726] tmp: /tmp
                [SERVER-6:09100] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1/4
                [SERVER-6:09100] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1
                [SERVER-6:09100] top: openmpi-sessions-mpidemo@SERVER-6_0
                [SERVER-6:09100] tmp: /tmp
                [SERVER-7:32576] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1/5
                [SERVER-7:32576] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1
                [SERVER-7:32576] top: openmpi-sessions-mpidemo@SERVER-7_0
                [SERVER-7:32576] tmp: /tmp
                [sv-3:08363] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1/9
                [sv-3:08363] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1
                [sv-3:08363] top: openmpi-sessions-mpidemo@sv-3_0
                [sv-3:08363] tmp: /tmp
                [sv-2:07514] procdir:
                /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1/7
                [sv-2:07514] jobdir:
                /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1
                [sv-2:07514] top: openmpi-sessions-mpidemo@sv-2_0
                [sv-2:07514] tmp: /tmp
                [SERVER-5:12548] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1/3
                [SERVER-5:12548] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1
                [SERVER-5:12548] top: openmpi-sessions-mpidemo@SERVER-5_0
                [SERVER-5:12548] tmp: /tmp
                [SERVER-3:29009] procdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1/1
                [SERVER-3:29009] jobdir:
                /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1
                [SERVER-3:29009] top: openmpi-sessions-mpidemo@SERVER-3_0
                [SERVER-3:29009] tmp: /tmp
                  MPIR_being_debugged = 0
                  MPIR_debug_state = 1
                  MPIR_partial_attach_ok = 1
                  MPIR_i_am_starter = 0
                  MPIR_forward_output = 0
                  MPIR_proctable_size = 10
                  MPIR_proctable:
                    (i, host, exe, pid) = (0, SERVER-2,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
                    (i, host, exe, pid) = (1, x.x.x.24,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
                    (i, host, exe, pid) = (2, x.x.x.26,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
                    (i, host, exe, pid) = (3, x.x.x.28,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
                    (i, host, exe, pid) = (4, x.x.x.29,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
                    (i, host, exe, pid) = (5, x.x.x.30,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
                    (i, host, exe, pid) = (6, x.x.x.41,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
                    (i, host, exe, pid) = (7, x.x.x.101,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
                    (i, host, exe, pid) = (8, x.x.x.100,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
                    (i, host, exe, pid) = (9, x.x.x.102,
                /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
                MPIR_executable_path: NULL
                MPIR_server_arguments: NULL
                
--------------------------------------------------------------------------
                It looks like MPI_INIT failed for some reason; your
                parallel process is
                likely to abort.  There are many reasons that a
                parallel process can
                fail during MPI_INIT; some of which are due to
                configuration or environment
                problems.  This failure appears to be an internal
                failure; here's some
                additional information (which may only be relevant to
                an Open MPI
                developer):

                  PML add procs failed
                  --> Returned "Error" (-1) instead of "Success" (0)
                
--------------------------------------------------------------------------
                [SERVER-2:5291] *** An error occurred in MPI_Init
                [SERVER-2:5291] *** reported by process
                [140508871983105,140505560121344]
                [SERVER-2:5291] *** on a NULL communicator
                [SERVER-2:5291] *** Unknown error
                [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in
                this communicator will now abort,
                [SERVER-2:5291] ***    and potentially your MPI job)
                
--------------------------------------------------------------------------
                An MPI process is aborting at a time when it cannot
                guarantee that all
                of its peer processes in the job will be killed
                properly.  You should
                double check that everything has shut down cleanly.

                  Reason:     Before MPI_INIT completed
                  Local host: SERVER-2
                  PID:        5291
                
--------------------------------------------------------------------------
                
[sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
                [btl_openib_proc.c:157] ompi_modex_recv failed for
                peer [[50535,1],0]
                
[sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
                [btl_openib_proc.c:157] ompi_modex_recv failed for
                peer [[50535,1],0]
                [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
                mca_base_modex_recv: failed with return value=-13
                [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
                mca_base_modex_recv: failed with return value=-13
                [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
                mca_base_modex_recv: failed with return value=-13
                [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
                mca_base_modex_recv: failed with return value=-13
                
--------------------------------------------------------------------------
                At least one pair of MPI processes are unable to reach
                each other for
                MPI communications.  This means that no Open MPI
                device has indicated
                that it can be used to communicate between these
                processes.  This is
                an error; Open MPI requires that all MPI processes be
                able to reach
                each other.  This error can sometimes be the result of
                forgetting to
                specify the "self" BTL.

                  Process 1 ([[50535,1],8]) is on host: sv-1
                  Process 2 ([[50535,1],0]) is on host: SERVER-2
                  BTLs attempted: openib self sm tcp

                Your MPI job is now going to abort; sorry.
                
--------------------------------------------------------------------------
                
--------------------------------------------------------------------------
                MPI_INIT has failed because at least one MPI process
                is unreachable
                from another.  This *usually* means that an underlying
                communication
                plugin -- such as a BTL or an MTL -- has either not
                loaded or not
                allowed itself to be used.  Your MPI job will now abort.

                You may wish to try to narrow down the problem;

                 * Check the output of ompi_info to see which BTL/MTL
                plugins are
                   available.
                 * Run your application with MPI_THREAD_SINGLE.
                 * Set the MCA parameter btl_base_verbose to 100 (or
                mtl_base_verbose,
                   if using MTL-based communications) to see exactly which
                   communication plugins were considered and/or discarded.
                
--------------------------------------------------------------------------
                
[sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
                [btl_openib_proc.c:157] ompi_modex_recv failed for
                peer [[50535,1],0]
                [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
                mca_base_modex_recv: failed with return value=-13
                [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
                mca_base_modex_recv: failed with return value=-13
                [SERVER-2:05284] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-2:05284] sess_dir_finalize: proc session dir
                not empty - leaving
                [sv-4:11802] sess_dir_finalize: job session dir not
                empty - leaving
                [SERVER-14:08399] sess_dir_finalize: job session dir
                not empty - leaving
                [SERVER-6:09087] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-6:09087] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-4:15711] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-4:15711] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-6:09087] sess_dir_finalize: job session dir
                not empty - leaving
                exiting with status 0
                [SERVER-7:32563] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-7:32563] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-5:12534] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-5:12534] sess_dir_finalize: proc session dir
                not empty - leaving
                [SERVER-7:32563] sess_dir_finalize: job session dir
                not empty - leaving
                exiting with status 0
                exiting with status 0
                exiting with status 0
                [SERVER-4:15711] sess_dir_finalize: job session dir
                not empty - leaving
                [SERVER-3:28993] sess_dir_finalize: proc session dir
                not empty - leaving
                exiting with status 0
                [SERVER-3:28993] sess_dir_finalize: proc session dir
                not empty - leaving
                [sv-3:08352] sess_dir_finalize: proc session dir not
                empty - leaving
                [sv-3:08352] sess_dir_finalize: job session dir not
                empty - leaving
                [sv-1:45701] sess_dir_finalize: proc session dir not
                empty - leaving
                [sv-1:45701] sess_dir_finalize: job session dir not
                empty - leaving
                exiting with status 0
                exiting with status 0
                [sv-2:07503] sess_dir_finalize: proc session dir not
                empty - leaving
                [sv-2:07503] sess_dir_finalize: job session dir not
                empty - leaving
                exiting with status 0
                [SERVER-5:12534] sess_dir_finalize: job session dir
                not empty - leaving
                exiting with status 0
                [SERVER-3:28993] sess_dir_finalize: job session dir
                not empty - leaving
                exiting with status 0
                
--------------------------------------------------------------------------
                mpirun has exited due to process rank 6 with PID 8412 on
                node x.x.x.41 exiting improperly. There are three
                reasons this could occur:

                1. this process did not call "init" before exiting,
                but others in
                the job did. This can cause a job to hang indefinitely
                while it waits
                for all processes to call "init". By rule, if one
                process calls "init",
                then ALL processes must call "init" prior to termination.

                2. this process called "init", but exited without
                calling "finalize".
                By rule, all processes that call "init" MUST call
                "finalize" prior to
                exiting or it will be considered an "abnormal termination"

                3. this process called "MPI_Abort" or "orte_abort" and
                the mca parameter
                orte_create_session_dirs is set to false. In this
                case, the run-time cannot
                detect that the abort call was an abnormal
                termination. Hence, the only
                error message you will receive is this one.

                This may have caused other processes in the
                application to be
                terminated by signals sent by mpirun (as reported here).

                You can avoid this message by specifying -quiet on the
                mpirun command line.

                
--------------------------------------------------------------------------
                [SERVER-2:05284] 6 more processes have sent help
                message help-mpi-runtime /
                mpi_init:startup:internal-failure
                [SERVER-2:05284] Set MCA parameter
                "orte_base_help_aggregate" to 0 to see all help /
                error messages
                [SERVER-2:05284] 9 more processes have sent help
                message help-mpi-errors.txt / mpi_errors_are_fatal
                unknown handle
                [SERVER-2:05284] 9 more processes have sent help
                message help-mpi-runtime.txt / ompi mpi abort:cannot
                guarantee all killed
                [SERVER-2:05284] 2 more processes have sent help
                message help-mca-bml-r2.txt / unreachable proc
                [SERVER-2:05284] 2 more processes have sent help
                message help-mpi-runtime /
                mpi_init:startup:pml-add-procs-fail
                [SERVER-2:05284] sess_dir_finalize: job session dir
                not empty - leaving
                exiting with status 1

                
//******************************************************************

                Any feedback will be helpful. Thank you!

                Mr. Beans

                _______________________________________________
                users mailing list
                us...@open-mpi.org <mailto:us...@open-mpi.org>
                http://www.open-mpi.org/mailman/listinfo.cgi/users




                _______________________________________________

                users mailing list

                us...@open-mpi.org  <mailto:us...@open-mpi.org>

                http://www.open-mpi.org/mailman/listinfo.cgi/users

            _______________________________________________
            users mailing list
            us...@open-mpi.org <mailto:us...@open-mpi.org>
            http://www.open-mpi.org/mailman/listinfo.cgi/users




            _______________________________________________

            users mailing list

            us...@open-mpi.org  <mailto:us...@open-mpi.org>

            http://www.open-mpi.org/mailman/listinfo.cgi/users

        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/users




        _______________________________________________

        users mailing list

        us...@open-mpi.org  <mailto:us...@open-mpi.org>

        http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to