I'll let Tom suggest a solution for the psm error, but you really need to 
remove those thread-related config params. OMPI isn't really thread safe at 
this point.

On Aug 4, 2013, at 6:26 PM, RoboBeans <robobe...@gmail.com> wrote:

> Hi Tom,
> 
> As per your suggestion, i tried 
> 
> ./configure --with-psm --prefix=/opt/openmpi-1.7.2 
> --enable-event-thread-support --enable-opal-multi-threads 
> --enable-orte-progress-threads --enable-mpi-thread-multiple
> 
> but I am getting this error:
> 
> --- MCA component mtl:psm (m4 configuration macro)
> checking for MCA component mtl:psm compile mode... dso
> checking --with-psm value... simple ok (unspecified)
> checking --with-psm-libdir value... simple ok (unspecified)
> checking psm.h usability... no
> checking psm.h presence... yes
> configure: WARNING: psm.h: present but cannot be compiled
> configure: WARNING: psm.h:     check for missing prerequisite headers?
> configure: WARNING: psm.h: see the Autoconf documentation
> configure: WARNING: psm.h:     section "Present But Cannot Be Compiled"
> configure: WARNING: psm.h: proceeding with the compiler's result
> configure: WARNING:     ## 
> ------------------------------------------------------ ##
> configure: WARNING:     ## Report this to 
> http://www.open-mpi.org/community/help/ ##
> configure: WARNING:     ## 
> ------------------------------------------------------ ##
> checking for psm.h... no
> configure: error: PSM support requested but not found.  Aborting
> 
> 
> Any feedback will be helpful. Thanks for your time!
> 
> Mr. Beans
> 
> 
> 
> On 8/4/13 10:31 AM, Elken, Tom wrote:
>> On 8/3/13 7:09 PM, RoboBeans wrote:
>> On first 7 nodes:
>> 
>> [mpidemo@SERVER-3 ~]$ ofed_info | head -n 1
>> OFED-1.5.3.2:
>> 
>> On last 4 nodes:
>> 
>> [mpidemo@sv-2 ~]$ ofed_info | head -n 1
>> -bash: ofed_info: command not found
>> [Tom]
>> This is a pretty good clue that OFED is not installed on the last 4 nodes.  
>> You should fix that by installing OFED 1.5.3.2 on the last 4 nodes, OR 
>> better (but more work) install a newer OFED such as 1.5.4.1 or 3.5 on ALL 
>> the nodes (You need to look at the OFED release notes to see if your OS is 
>> supported by these OFEDs).
>>  
>> BTW, since you are using QLogic HCAs, they typically work with the best 
>> performance when using the PSM API to the HCA.  PSM is part of OFED.  To use 
>> this by default with Open MPI, you can build Open MPI as follows:
>> ./configure --with-psm --prefix=<install directory>
>> make
>> make install
>> 
>> With an Open MPI that is already built, you can try to use PSM as follows:
>> mpirun … --mca mtl psm --mca btl ^openib …
>>  
>> -Tom
>> 
>> [mpidemo@sv-2 ~]$ which ofed_info
>> /usr/bin/which: no ofed_info in 
>> (/usr/OPENMPI/openmpi-1.7.2/bin:/usr/OPENMPI/openmpi-1.7.2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin/:/usr/lib/:/usr/lib:/usr:/usr/:/bin/:/usr/lib/:/usr/lib:/usr:/usr/)
>> 
>> 
>> Are there some specific locations where I should look for ofed_info? How can 
>> I make sure that ofed was installed on a node or not?
>> 
>> Thanks again!!!
>> 
>> 
>> On 8/3/13 5:52 PM, Ralph Castain wrote:
>> Are the ofed versions the same across all the machines? I would suspect that 
>> might be the problem.
>>  
>>  
>> On Aug 3, 2013, at 4:06 PM, RoboBeans <robobe...@gmail.com> wrote:
>> 
>> 
>> Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from source code) 
>> with no configuration arguments but I am facing the same issue. When I run a 
>> job using 1.5.4 (installed using yum), I get warnings but it doesn't affect 
>> my output.                       
>> 
>> Example of warning that I get:
>> 
>> sv-2.7960ipath_userinit: Mismatched user minor version (12) and driver minor 
>> version (11) while context sharing. Ensure that driver and library are from 
>> the same release.
>> 
>> Each system has a QLogic card ("QLE7342-CK dual port IB card"), with the 
>> same OS but different kernel revision no. (e.g. 2.6.32-358.2.1.el6.x86_64, 
>> 2.6.32-358.el6.x86_64).
>> 
>> Thank you for your time. 
>>  
>> On 8/3/13 2:05 PM, Ralph Castain wrote:
>> Hmmm...strange indeed. I would remove those four configure options and give 
>> it a try. That will eliminate all the obvious things, I would think, though 
>> they aren't generally involved in the issue shown here. Still, worth taking 
>> out potential trouble sources.
>>  
>> What is the connectivity between SERVER-2 and node 100? Should I assume that 
>> the first seven nodes are connected via one type of interconnect, and the 
>> other four are connected to those seven by another type?
>>  
>>  
>> On Aug 3, 2013, at 1:30 PM, RoboBeans <robobe...@gmail.com> wrote:
>> 
>> 
>> Thanks for looking into in Ralph. I modified the hosts file but I am still 
>> getting the same error. Any other pointers you can think of? The difference 
>> between this 1.7.2 installation and 1.5.4 is that I installed 1.5.4 using 
>> yum and for 1.7.2, I used the source code and configured with 
>> --enable-event-thread-support --enable-opal-multi-threads 
>> --enable-orte-progress-threads --enable-mpi-thread-multiple
>> . Am I missing something here?
>> 
>> //******************************************************************
>> 
>> $ cat mpi_hostfile
>> 
>> x.x.x.22 slots=15 max-slots=15
>> x.x.x.24 slots=2 max-slots=2
>> x.x.x.26 slots=14 max-slots=14
>> x.x.x.28 slots=16 max-slots=16
>> x.x.x.29 slots=14 max-slots=14
>> x.x.x.30 slots=16 max-slots=16
>> x.x.x.41 slots=46 max-slots=46
>> x.x.x.101 slots=46 max-slots=46
>> x.x.x.100 slots=46 max-slots=46
>> x.x.x.102 slots=22 max-slots=22
>> x.x.x.103 slots=22 max-slots=22
>> 
>> //******************************************************************
>> $ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test
>> 
>> [SERVER-2:08907] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0/0
>> [SERVER-2:08907] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0
>> [SERVER-2:08907] top: openmpi-sessions-mpidemo@SERVER-2_0
>> [SERVER-2:08907] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-3:32517] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0/1
>> [SERVER-3:32517] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0
>> [SERVER-3:32517] top: openmpi-sessions-mpidemo@SERVER-3_0
>> [SERVER-3:32517] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-6:11595] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0/4
>> [SERVER-6:11595] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0
>> [SERVER-6:11595] top: openmpi-sessions-mpidemo@SERVER-6_0
>> [SERVER-6:11595] tmp: /tmp
>> [SERVER-4:27445] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0/2
>> [SERVER-4:27445] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0
>> [SERVER-4:27445] top: openmpi-sessions-mpidemo@SERVER-4_0
>> [SERVER-4:27445] tmp: /tmp
>> [SERVER-7:02607] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0/5
>> [SERVER-7:02607] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0
>> [SERVER-7:02607] top: openmpi-sessions-mpidemo@SERVER-7_0
>> [SERVER-7:02607] tmp: /tmp
>> [sv-1:46100] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0/8
>> [sv-1:46100] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0
>> [sv-1:46100] top: openmpi-sessions-mpidemo@sv-1_0
>> [sv-1:46100] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-5:16404] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0/3
>> [SERVER-5:16404] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0
>> [SERVER-5:16404] top: openmpi-sessions-mpidemo@SERVER-5_0
>> [SERVER-5:16404] tmp: /tmp
>> [sv-3:08575] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0/9
>> [sv-3:08575] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0
>> [sv-3:08575] top: openmpi-sessions-mpidemo@sv-3_0
>> [sv-3:08575] tmp: /tmp
>> [SERVER-14:10755] procdir: 
>> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0/6
>> [SERVER-14:10755] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0
>> [SERVER-14:10755] top: openmpi-sessions-mpidemo@SERVER-14_0
>> [SERVER-14:10755] tmp: /tmp
>> [sv-4:12040] procdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0/10
>> [sv-4:12040] jobdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0
>> [sv-4:12040] top: openmpi-sessions-mpidemo@sv-4_0
>> [sv-4:12040] tmp: /tmp
>> [sv-2:07725] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0/7
>> [sv-2:07725] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0
>> [sv-2:07725] top: openmpi-sessions-mpidemo@sv-2_0
>> [sv-2:07725] tmp: /tmp
>> 
>>  Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYNODE  
>> Ranking policy: NODE  Binding policy: NONE[NODE]  Cpu set: NULL  PPR: NULL
>>      Num new daemons: 0    New daemon starting vpid INVALID
>>      Num nodes: 10
>> 
>>  Data for node: SERVER-2         Launch id: -1    State: 2
>>      Daemon: [[62216,0],0]    Daemon launched: True
>>      Num slots: 15    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 15    Max slots: 15
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],0]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 0
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-15 
>>    Binding: NULL[0]
>> 
>>  Data for node: x.x.x.24         Launch id: -1    State: 0
>>      Daemon: [[62216,0],1]    Daemon launched: False
>>      Num slots: 2    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 2    Max slots: 2
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],1]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 1
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.26         Launch id: -1    State: 0
>>      Daemon: [[62216,0],2]    Daemon launched: False
>>      Num slots: 14    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 14    Max slots: 14
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],2]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 2
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.28         Launch id: -1    State: 0
>>      Daemon: [[62216,0],3]    Daemon launched: False
>>      Num slots: 16    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 16    Max slots: 16
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],3]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 3
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.29         Launch id: -1    State: 0
>>      Daemon: [[62216,0],4]    Daemon launched: False
>>      Num slots: 14    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 14    Max slots: 14
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],4]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 4
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.30         Launch id: -1    State: 0
>>      Daemon: [[62216,0],5]    Daemon launched: False
>>      Num slots: 16    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 16    Max slots: 16
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],5]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 5
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.41         Launch id: -1    State: 0
>>      Daemon: [[62216,0],6]    Daemon launched: False
>>      Num slots: 46    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 46    Max slots: 46
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],6]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 6
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.101         Launch id: -1    State: 0
>>      Daemon: [[62216,0],7]    Daemon launched: False
>>      Num slots: 46    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 46    Max slots: 46
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],7]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 7
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.100         Launch id: -1    State: 0
>>      Daemon: [[62216,0],8]    Daemon launched: False
>>      Num slots: 46    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 46    Max slots: 46
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],8]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 8
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.102         Launch id: -1    State: 0
>>      Daemon: [[62216,0],9]    Daemon launched: False
>>      Num slots: 22    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 22    Max slots: 22
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[62216,1],9]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 9
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> [sv-1:46111] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1/8
>> [sv-1:46111] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1
>> [sv-1:46111] top: openmpi-sessions-mpidemo@sv-1_0
>> [sv-1:46111] tmp: /tmp
>> [SERVER-14:10768] procdir: 
>> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1/6
>> [SERVER-14:10768] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1
>> [SERVER-14:10768] top: openmpi-sessions-mpidemo@SERVER-14_0
>> [SERVER-14:10768] tmp: /tmp
>> [SERVER-2:08912] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1/0
>> [SERVER-2:08912] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1
>> [SERVER-2:08912] top: openmpi-sessions-mpidemo@SERVER-2_0
>> [SERVER-2:08912] tmp: /tmp
>> [SERVER-4:27460] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1/2
>> [SERVER-4:27460] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1
>> [SERVER-4:27460] top: openmpi-sessions-mpidemo@SERVER-4_0
>> [SERVER-4:27460] tmp: /tmp
>> [SERVER-6:11608] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1/4
>> [SERVER-6:11608] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1
>> [SERVER-6:11608] top: openmpi-sessions-mpidemo@SERVER-6_0
>> [SERVER-6:11608] tmp: /tmp
>> [SERVER-7:02620] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1/5
>> [SERVER-7:02620] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1
>> [SERVER-7:02620] top: openmpi-sessions-mpidemo@SERVER-7_0
>> [SERVER-7:02620] tmp: /tmp
>> [sv-3:08586] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1/9
>> [sv-3:08586] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1
>> [sv-3:08586] top: openmpi-sessions-mpidemo@sv-3_0
>> [sv-3:08586] tmp: /tmp
>> [sv-2:07736] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1/7
>> [sv-2:07736] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1
>> [sv-2:07736] top: openmpi-sessions-mpidemo@sv-2_0
>> [sv-2:07736] tmp: /tmp
>> [SERVER-5:16418] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1/3
>> [SERVER-5:16418] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1
>> [SERVER-5:16418] top: openmpi-sessions-mpidemo@SERVER-5_0
>> [SERVER-5:16418] tmp: /tmp
>> [SERVER-3:32533] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1/1
>> [SERVER-3:32533] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1
>> [SERVER-3:32533] top: openmpi-sessions-mpidemo@SERVER-3_0
>> [SERVER-3:32533] tmp: /tmp
>>   MPIR_being_debugged = 0
>>   MPIR_debug_state = 1
>>   MPIR_partial_attach_ok = 1
>>   MPIR_i_am_starter = 0
>>   MPIR_forward_output = 0
>>   MPIR_proctable_size = 10
>>   MPIR_proctable:
>>     (i, host, exe, pid) = (0, SERVER-2, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
>>     (i, host, exe, pid) = (1, x.x.x.24, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
>>     (i, host, exe, pid) = (2, x.x.x.26, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
>>     (i, host, exe, pid) = (3, x.x.x.28, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
>>     (i, host, exe, pid) = (4, x.x.x.29, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
>>     (i, host, exe, pid) = (5, x.x.x.30, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
>>     (i, host, exe, pid) = (6, x.x.x.41, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
>>     (i, host, exe, pid) = (7, x.x.x.101, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
>>     (i, host, exe, pid) = (8, x.x.x.100, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
>>     (i, host, exe, pid) = (9, x.x.x.102, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
>> MPIR_executable_path: NULL
>> MPIR_server_arguments: NULL
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>>   PML add procs failed
>>   --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [SERVER-2:8912] *** An error occurred in MPI_Init
>> [SERVER-2:8912] *** reported by process [140393673392129,140389596004352]
>> [SERVER-2:8912] *** on a NULL communicator
>> [SERVER-2:8912] *** Unknown error
>> [SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>> will now abort,
>> [SERVER-2:8912] ***    and potentially your MPI job)
>> --------------------------------------------------------------------------
>> An MPI process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly.  You should
>> double check that everything has shut down cleanly.
>> 
>>   Reason:     Before MPI_INIT completed
>>   Local host: SERVER-2
>>   PID:        8912
>> --------------------------------------------------------------------------
>> [sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> 
>>   Process 1 ([[62216,1],8]) is on host: sv-1
>>   Process 2 ([[62216,1],0]) is on host: SERVER-2
>>   BTLs attempted: openib self sm tcp
>> 
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> [sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> MPI_INIT has failed because at least one MPI process is unreachable
>> from another.  This *usually* means that an underlying communication
>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>> allowed itself to be used.  Your MPI job will now abort.
>> 
>> You may wish to try to narrow down the problem;
>> 
>>  * Check the output of ompi_info to see which BTL/MTL plugins are
>>    available.
>>  * Run your application with MPI_THREAD_SINGLE.
>>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>>    if using MTL-based communications) to see exactly which
>>    communication plugins were considered and/or discarded.
>> --------------------------------------------------------------------------
>> [sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-4:12040] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-14:10755] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving
>> exiting with status 0
>> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:11595] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:02607] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving
>> exiting with status 0
>> exiting with status 0
>> exiting with status 0
>> [SERVER-4:27445] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08575] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08575] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [sv-1:46100] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-1:46100] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [sv-2:07725] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-2:07725] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-5:16404] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:32517] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 6 with PID 10768 on
>> node x.x.x.41 exiting improperly. There are three reasons this could occur:
>> 
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>> 
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>> 
>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>> orte_create_session_dirs is set to false. In this case, the run-time cannot
>> detect that the abort call was an abnormal termination. Hence, the only
>> error message you will receive is this one.
>> 
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> 
>> You can avoid this message by specifying -quiet on the mpirun command line.
>> 
>> --------------------------------------------------------------------------
>> [SERVER-2:08907] 6 more processes have sent help message help-mpi-runtime / 
>> mpi_init:startup:internal-failure
>> [SERVER-2:08907] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
>> [SERVER-2:08907] 9 more processes have sent help message help-mpi-errors.txt 
>> / mpi_errors_are_fatal unknown handle
>> [SERVER-2:08907] 9 more processes have sent help message 
>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>> [SERVER-2:08907] 2 more processes have sent help message help-mca-bml-r2.txt 
>> / unreachable proc
>> [SERVER-2:08907] 2 more processes have sent help message help-mpi-runtime / 
>> mpi_init:startup:pml-add-procs-fail
>> [SERVER-2:08907] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 1 
>> 
>> //******************************************************************
>> 
>> On 8/3/13 4:34 AM, Ralph Castain wrote:
>> It looks like SERVER-2 cannot talk to your x.x.x.100 machine. I note that 
>> you have some entries at the end of the hostfile that I don't understand - a 
>> list of hosts that can be reached? And I see that your x.x.x.22 machine 
>> isn't on it. Is that SERVER-2 by chance?
>>  
>> Our hostfile parsing changed between the release series, but I know we never 
>> consciously supported the syntax you show below where you list capabilities, 
>> and then re-list the hosts in an apparent attempt to filter which ones can 
>> actually be used. It is possible that the 1.5 series somehow used that to 
>> exclude the 22 machine, and that the 1.7 parser now doesn't do that.
>>  
>> If you only include machines you actually intend to use in your hostfile, 
>> does the 1.7 series work?
>>  
>> On Aug 3, 2013, at 3:58 AM, RoboBeans <robobe...@gmail.com> wrote:
>> 
>> 
>> Hello everyone,
>> 
>> I have installed openmpi 1.5.4 on 11 node cluster using "yum install openmpi 
>> openmpi-devel" and everything seems to be working fine. For testing I am 
>> using this test program
>> 
>> //******************************************************************
>> 
>> $ cat test.cpp
>> 
>> #include <stdio.h>
>> #include <mpi.h>
>> 
>> int main (int argc, char *argv[])
>> {
>>   int id, np;
>>   char name[MPI_MAX_PROCESSOR_NAME];
>>   int namelen;
>>   int i;
>> 
>>   MPI_Init (&argc, &argv);
>> 
>>   MPI_Comm_size (MPI_COMM_WORLD, &np);
>>   MPI_Comm_rank (MPI_COMM_WORLD, &id);
>>   MPI_Get_processor_name (name, &namelen);
>> 
>>   printf ("This is Process %2d out of %2d running on host %s\n", id, np, 
>> name);
>> 
>>   MPI_Finalize ();
>> 
>>   return (0);
>> }
>> 
>> //******************************************************************
>> 
>> and my hosts file look like this:
>> 
>> $ cat mpi_hostfile
>> 
>> # The Hostfile for Open MPI
>> 
>> # specify number of slots for processes to run locally.
>> #localhost slots=12
>> #x.x.x.16 slots=12 max-slots=12
>> #x.x.x.17 slots=12 max-slots=12
>> #x.x.x.18 slots=12 max-slots=12
>> #x.x.1x.19 slots=12 max-slots=12
>> #x.x.x.20 slots=12 max-slots=12
>> #x.x.x.55 slots=46 max-slots=46
>> #x.x.x.56 slots=46 max-slots=46
>> 
>> x.x.x.22 slots=15 max-slots=15
>> x.x.x.24 slots=2 max-slots=2
>> x.x.x.26 slots=14 max-slots=14
>> x.x.x.28 slots=16 max-slots=16
>> x.x.x.29 slots=14 max-slots=14
>> x.x.x.30 slots=16 max-slots=16
>> x.x.x.41 slots=46 max-slots=46
>> x.x.x.101 slots=46 max-slots=46
>> x.x.x.100 slots=46 max-slots=46
>> x.x.x.102 slots=22 max-slots=22
>> x.x.x.103 slots=22 max-slots=22
>> 
>> # The following slave nodes are available to this machine:
>> x.x.x.24
>> x.x.x.26
>> x.x.x.28
>> x.x.x.29
>> x.x.x.30
>> x.x.x.41
>> x.x.x.101
>> x.x.x.100
>> x.x.x.102
>> x.x.x.103
>> 
>> //******************************************************************
>> 
>> this is how my .bashrc looks like on each node:
>> 
>> $ cat ~/.bashrc
>> 
>> # .bashrc
>> 
>> # Source global definitions
>> if [ -f /etc/bashrc ]; then
>>     . /etc/bashrc
>> fi
>> 
>> # User specific aliases and functions
>> umask 077
>> 
>> export PSM_SHAREDCONTEXTS_MAX=20
>> 
>> #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
>> export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}
>> 
>> #export 
>> LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>> export 
>> LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>> 
>> export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/
>> 
>> //******************************************************************
>> 
>> $ mpic++ test.cpp -o test
>> 
>> $ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test
>> 
>> //******************************************************************
>> 
>> These nodes are running 2.6.32-358.2.1.el6.x86_64 release
>> 
>> $ uname
>> Linux
>> $ uname -r
>> 2.6.32-358.2.1.el6.x86_64
>> $ cat /etc/issue
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> 
>> //******************************************************************
>> 
>> Now, if I install openmpi 1.7.2 on each node separately then I can only use 
>> it on either first 7 nodes or last 4 nodes but not on all of them.  
>> 
>> //******************************************************************
>> 
>> $ gunzip -c openmpi-1.7.2.tar.gz | tar xf -
>> 
>> $ cd openmpi-1.7.2
>>     
>> $ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2 
>> --enable-event-thread-support --enable-opal-multi-threads 
>> --enable-orte-progress-threads --enable-mpi-thread-multiple
>> 
>> $ make all install
>> 
>> //******************************************************************
>> 
>> This is the error message that i am receiving:
>> 
>> 
>> $ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test
>> 
>> [SERVER-2:05284] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0/0
>> [SERVER-2:05284] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0
>> [SERVER-2:05284] top: openmpi-sessions-mpidemo@SERVER-2_0
>> [SERVER-2:05284] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-3:28993] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0/1
>> [SERVER-3:28993] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0
>> [SERVER-3:28993] top: openmpi-sessions-mpidemo@SERVER-3_0
>> [SERVER-3:28993] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-6:09087] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0/4
>> [SERVER-6:09087] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0
>> [SERVER-6:09087] top: openmpi-sessions-mpidemo@SERVER-6_0
>> [SERVER-6:09087] tmp: /tmp
>> [SERVER-7:32563] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0/5
>> [SERVER-7:32563] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0
>> [SERVER-7:32563] top: openmpi-sessions-mpidemo@SERVER-7_0
>> [SERVER-7:32563] tmp: /tmp
>> [SERVER-4:15711] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0/2
>> [SERVER-4:15711] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0
>> [SERVER-4:15711] top: openmpi-sessions-mpidemo@SERVER-4_0
>> [SERVER-4:15711] tmp: /tmp
>> [sv-1:45701] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0/8
>> [sv-1:45701] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0
>> [sv-1:45701] top: openmpi-sessions-mpidemo@sv-1_0
>> [sv-1:45701] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [sv-3:08352] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0/9
>> [sv-3:08352] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0
>> [sv-3:08352] top: openmpi-sessions-mpidemo@sv-3_0
>> [sv-3:08352] tmp: /tmp
>> [SERVER-5:12534] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0/3
>> [SERVER-5:12534] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0
>> [SERVER-5:12534] top: openmpi-sessions-mpidemo@SERVER-5_0
>> [SERVER-5:12534] tmp: /tmp
>> [SERVER-14:08399] procdir: 
>> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0/6
>> [SERVER-14:08399] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0
>> [SERVER-14:08399] top: openmpi-sessions-mpidemo@SERVER-14_0
>> [SERVER-14:08399] tmp: /tmp
>> [sv-4:11802] procdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0/10
>> [sv-4:11802] jobdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0
>> [sv-4:11802] top: openmpi-sessions-mpidemo@sv-4_0
>> [sv-4:11802] tmp: /tmp
>> [sv-2:07503] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0/7
>> [sv-2:07503] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0
>> [sv-2:07503] top: openmpi-sessions-mpidemo@sv-2_0
>> [sv-2:07503] tmp: /tmp
>> 
>>  Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYNODE  
>> Ranking policy: NODE  Binding policy: NONE[NODE]  Cpu set: NULL  PPR: NULL
>>      Num new daemons: 0    New daemon starting vpid INVALID
>>      Num nodes: 10
>> 
>>  Data for node: SERVER-2         Launch id: -1    State: 2
>>      Daemon: [[50535,0],0]    Daemon launched: True
>>      Num slots: 15    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 15    Max slots: 15
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],0]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 0
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-15 
>>    Binding: NULL[0]
>> 
>>  Data for node: x.x.x.24         Launch id: -1    State: 0
>>      Daemon: [[50535,0],1]    Daemon launched: False
>>      Num slots: 3    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 3    Max slots: 2
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],1]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 1
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.26         Launch id: -1    State: 0
>>      Daemon: [[50535,0],2]    Daemon launched: False
>>      Num slots: 15    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 15    Max slots: 14
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],2]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 2
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.28         Launch id: -1    State: 0
>>      Daemon: [[50535,0],3]    Daemon launched: False
>>      Num slots: 17    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 17    Max slots: 16
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],3]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 3
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.29         Launch id: -1    State: 0
>>      Daemon: [[50535,0],4]    Daemon launched: False
>>      Num slots: 15    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 15    Max slots: 14
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],4]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 4
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.30         Launch id: -1    State: 0
>>      Daemon: [[50535,0],5]    Daemon launched: False
>>      Num slots: 17    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 17    Max slots: 16
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],5]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 5
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.41         Launch id: -1    State: 0
>>      Daemon: [[50535,0],6]    Daemon launched: False
>>      Num slots: 47    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 47    Max slots: 46
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],6]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 6
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.101         Launch id: -1    State: 0
>>      Daemon: [[50535,0],7]    Daemon launched: False
>>      Num slots: 47    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 47    Max slots: 46
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],7]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 7
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.100         Launch id: -1    State: 0
>>      Daemon: [[50535,0],8]    Daemon launched: False
>>      Num slots: 47    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 47    Max slots: 46
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],8]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 8
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> 
>>  Data for node: x.x.x.102         Launch id: -1    State: 0
>>      Daemon: [[50535,0],9]    Daemon launched: False
>>      Num slots: 23    Slots in use: 1    Oversubscribed: FALSE
>>      Num slots allocated: 23    Max slots: 22
>>      Username on node: NULL
>>      Num procs: 1    Next node_rank: 1
>>      Data for proc: [[50535,1],9]
>>          Pid: 0    Local rank: 0    Node rank: 0    App rank: 9
>>          State: INITIALIZED    Restarts: 0    App_context: 0    Locale: 0-7  
>>   Binding: NULL[0]
>> [sv-1:45712] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1/8
>> [sv-1:45712] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1
>> [sv-1:45712] top: openmpi-sessions-mpidemo@sv-1_0
>> [sv-1:45712] tmp: /tmp
>> [SERVER-14:08412] procdir: 
>> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1/6
>> [SERVER-14:08412] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1
>> [SERVER-14:08412] top: openmpi-sessions-mpidemo@SERVER-14_0
>> [SERVER-14:08412] tmp: /tmp
>> [SERVER-2:05291] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1/0
>> [SERVER-2:05291] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1
>> [SERVER-2:05291] top: openmpi-sessions-mpidemo@SERVER-2_0
>> [SERVER-2:05291] tmp: /tmp
>> [SERVER-4:15726] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1/2
>> [SERVER-4:15726] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1
>> [SERVER-4:15726] top: openmpi-sessions-mpidemo@SERVER-4_0
>> [SERVER-4:15726] tmp: /tmp
>> [SERVER-6:09100] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1/4
>> [SERVER-6:09100] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1
>> [SERVER-6:09100] top: openmpi-sessions-mpidemo@SERVER-6_0
>> [SERVER-6:09100] tmp: /tmp
>> [SERVER-7:32576] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1/5
>> [SERVER-7:32576] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1
>> [SERVER-7:32576] top: openmpi-sessions-mpidemo@SERVER-7_0
>> [SERVER-7:32576] tmp: /tmp
>> [sv-3:08363] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1/9
>> [sv-3:08363] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1
>> [sv-3:08363] top: openmpi-sessions-mpidemo@sv-3_0
>> [sv-3:08363] tmp: /tmp
>> [sv-2:07514] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1/7
>> [sv-2:07514] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1
>> [sv-2:07514] top: openmpi-sessions-mpidemo@sv-2_0
>> [sv-2:07514] tmp: /tmp
>> [SERVER-5:12548] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1/3
>> [SERVER-5:12548] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1
>> [SERVER-5:12548] top: openmpi-sessions-mpidemo@SERVER-5_0
>> [SERVER-5:12548] tmp: /tmp
>> [SERVER-3:29009] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1/1
>> [SERVER-3:29009] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1
>> [SERVER-3:29009] top: openmpi-sessions-mpidemo@SERVER-3_0
>> [SERVER-3:29009] tmp: /tmp
>>   MPIR_being_debugged = 0
>>   MPIR_debug_state = 1
>>   MPIR_partial_attach_ok = 1
>>   MPIR_i_am_starter = 0
>>   MPIR_forward_output = 0
>>   MPIR_proctable_size = 10
>>   MPIR_proctable:
>>     (i, host, exe, pid) = (0, SERVER-2, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
>>     (i, host, exe, pid) = (1, x.x.x.24, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
>>     (i, host, exe, pid) = (2, x.x.x.26, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
>>     (i, host, exe, pid) = (3, x.x.x.28, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
>>     (i, host, exe, pid) = (4, x.x.x.29, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
>>     (i, host, exe, pid) = (5, x.x.x.30, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
>>     (i, host, exe, pid) = (6, x.x.x.41, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
>>     (i, host, exe, pid) = (7, x.x.x.101, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
>>     (i, host, exe, pid) = (8, x.x.x.100, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
>>     (i, host, exe, pid) = (9, x.x.x.102, 
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
>> MPIR_executable_path: NULL
>> MPIR_server_arguments: NULL
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>>   PML add procs failed
>>   --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [SERVER-2:5291] *** An error occurred in MPI_Init
>> [SERVER-2:5291] *** reported by process [140508871983105,140505560121344]
>> [SERVER-2:5291] *** on a NULL communicator
>> [SERVER-2:5291] *** Unknown error
>> [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>> will now abort,
>> [SERVER-2:5291] ***    and potentially your MPI job)
>> --------------------------------------------------------------------------
>> An MPI process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly.  You should
>> double check that everything has shut down cleanly.
>> 
>>   Reason:     Before MPI_INIT completed
>>   Local host: SERVER-2
>>   PID:        5291
>> --------------------------------------------------------------------------
>> [sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>> [sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> 
>>   Process 1 ([[50535,1],8]) is on host: sv-1
>>   Process 2 ([[50535,1],0]) is on host: SERVER-2
>>   BTLs attempted: openib self sm tcp
>> 
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> MPI_INIT has failed because at least one MPI process is unreachable
>> from another.  This *usually* means that an underlying communication
>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>> allowed itself to be used.  Your MPI job will now abort.
>> 
>> You may wish to try to narrow down the problem;
>> 
>>  * Check the output of ompi_info to see which BTL/MTL plugins are
>>    available.
>>  * Run your application with MPI_THREAD_SINGLE.
>>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>>    if using MTL-based communications) to see exactly which
>>    communication plugins were considered and/or discarded.
>> --------------------------------------------------------------------------
>> [sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] 
>> mca_base_modex_recv: failed with return value=-13
>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-4:11802] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-14:08399] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:09087] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:32563] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> exiting with status 0
>> exiting with status 0
>> [SERVER-4:15711] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08352] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08352] sess_dir_finalize: job session dir not empty - leaving
>> [sv-1:45701] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-1:45701] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> exiting with status 0
>> [sv-2:07503] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-2:07503] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-5:12534] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:28993] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 6 with PID 8412 on
>> node x.x.x.41 exiting improperly. There are three reasons this could occur:
>> 
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>> 
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>> 
>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>> orte_create_session_dirs is set to false. In this case, the run-time cannot
>> detect that the abort call was an abnormal termination. Hence, the only
>> error message you will receive is this one.
>> 
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> 
>> You can avoid this message by specifying -quiet on the mpirun command line.
>> 
>> --------------------------------------------------------------------------
>> [SERVER-2:05284] 6 more processes have sent help message help-mpi-runtime / 
>> mpi_init:startup:internal-failure
>> [SERVER-2:05284] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
>> [SERVER-2:05284] 9 more processes have sent help message help-mpi-errors.txt 
>> / mpi_errors_are_fatal unknown handle
>> [SERVER-2:05284] 9 more processes have sent help message 
>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>> [SERVER-2:05284] 2 more processes have sent help message help-mca-bml-r2.txt 
>> / unreachable proc
>> [SERVER-2:05284] 2 more processes have sent help message help-mpi-runtime / 
>> mpi_init:startup:pml-add-procs-fail
>> [SERVER-2:05284] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 1
>> 
>> //******************************************************************
>> 
>> Any feedback will be helpful. Thank you!
>> 
>> Mr. Beans
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>>  
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to