On 8/3/13 7:09 PM, RoboBeans wrote:
On first 7 nodes:
*[mpidemo@SERVER-3 ~]$ ofed_info | head -n 1*
OFED-1.5.3.2:
On last 4 nodes:
*[mpidemo@sv-2 ~]$ ofed_info | head -n 1*
-bash: ofed_info: command not found
*/[Tom] /*
*/This is a pretty good clue that OFED is not installed on the
last 4 nodes. You should fix that by installing OFED 1.5.3.2 on
the last 4 nodes, OR better (but more work) install a newer OFED
such as 1.5.4.1 or 3.5 on ALL the nodes (You need to look at the
OFED release notes to see if your OS is supported by these OFEDs). /*
*//*
*/BTW, since you are using QLogic HCAs, they typically work with
the best performance when using the PSM API to the HCA. PSM is
part of OFED. To use this by default with Open MPI, you can build
Open MPI as follows:/*
./configure --with-psm --prefix=<install directory>
make
make install
*/
/*With an Open MPI that is already built, you can try to use PSM
as follows:
mpirun ... --mca mtl psm --mca btl ^openib ...
-Tom
*[mpidemo@sv-2 ~]$ which ofed_info*
/usr/bin/which: no ofed_info in
(/usr/OPENMPI/openmpi-1.7.2/bin:/usr/OPENMPI/openmpi-1.7.2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin/:/usr/lib/:/usr/lib:/usr:/usr/:/bin/:/usr/lib/:/usr/lib:/usr:/usr/)
Are there some specific locations where I should look for
ofed_info? How can I make sure that ofed was installed on a node
or not?
Thanks again!!!
On 8/3/13 5:52 PM, Ralph Castain wrote:
Are the ofed versions the same across all the machines? I
would suspect that might be the problem.
On Aug 3, 2013, at 4:06 PM, RoboBeans <robobe...@gmail.com
<mailto:robobe...@gmail.com>> wrote:
Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from
source code) with no configuration arguments but I am facing
the same issue. When I run a job using 1.5.4 (installed using
yum), I get warnings but it doesn't affect my output.
Example of warning that I get:
sv-2.7960ipath_userinit: Mismatched user minor version (12)
and driver minor version (11) while context sharing. Ensure
that driver and library are from the same release.
Each system has a QLogic card ("QLE7342-CK dual port IB
card"), with the same OS but different kernel revision no.
(e.g. 2.6.32-358.2.1.el6.x86_64, 2.6.32-358.el6.x86_64).
Thank you for your time.
On 8/3/13 2:05 PM, Ralph Castain wrote:
Hmmm...strange indeed. I would remove those four configure
options and give it a try. That will eliminate all the
obvious things, I would think, though they aren't
generally involved in the issue shown here. Still, worth
taking out potential trouble sources.
What is the connectivity between SERVER-2 and node 100?
Should I assume that the first seven nodes are connected
via one type of interconnect, and the other four are
connected to those seven by another type?
On Aug 3, 2013, at 1:30 PM, RoboBeans <robobe...@gmail.com
<mailto:robobe...@gmail.com>> wrote:
Thanks for looking into in Ralph. I modified the hosts
file but I am still getting the same error. Any other
pointers you can think of? The difference between this
1.7.2 installation and 1.5.4 is that I installed 1.5.4
using yum and for 1.7.2, I used the source code and
configured with *--enable-event-thread-support
--enable-opal-multi-threads --enable-orte-progress-threads
--enable-mpi-thread-multiple
*. Am I missing something here?
//******************************************************************
*$ cat mpi_hostfile*
x.x.x.22 slots=15 max-slots=15
x.x.x.24 slots=2 max-slots=2
x.x.x.26 slots=14 max-slots=14
x.x.x.28 slots=16 max-slots=16
x.x.x.29 slots=14 max-slots=14
x.x.x.30 slots=16 max-slots=16
x.x.x.41 slots=46 max-slots=46
x.x.x.101 slots=46 max-slots=46
x.x.x.100 slots=46 max-slots=46
x.x.x.102 slots=22 max-slots=22
x.x.x.103 slots=22 max-slots=22
//******************************************************************
*$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile
--bynode ./test
*
[SERVER-2:08907] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0/0
[SERVER-2:08907] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0
[SERVER-2:08907] top: openmpi-sessions-mpidemo@SERVER-2_0
[SERVER-2:08907] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-3:32517] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0/1
[SERVER-3:32517] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0
[SERVER-3:32517] top: openmpi-sessions-mpidemo@SERVER-3_0
[SERVER-3:32517] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-6:11595] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0/4
[SERVER-6:11595] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0
[SERVER-6:11595] top: openmpi-sessions-mpidemo@SERVER-6_0
[SERVER-6:11595] tmp: /tmp
[SERVER-4:27445] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0/2
[SERVER-4:27445] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0
[SERVER-4:27445] top: openmpi-sessions-mpidemo@SERVER-4_0
[SERVER-4:27445] tmp: /tmp
[SERVER-7:02607] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0/5
[SERVER-7:02607] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0
[SERVER-7:02607] top: openmpi-sessions-mpidemo@SERVER-7_0
[SERVER-7:02607] tmp: /tmp
[sv-1:46100] procdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0/8
[sv-1:46100] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0
[sv-1:46100] top: openmpi-sessions-mpidemo@sv-1_0
[sv-1:46100] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-5:16404] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0/3
[SERVER-5:16404] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0
[SERVER-5:16404] top: openmpi-sessions-mpidemo@SERVER-5_0
[SERVER-5:16404] tmp: /tmp
[sv-3:08575] procdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0/9
[sv-3:08575] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0
[sv-3:08575] top: openmpi-sessions-mpidemo@sv-3_0
[sv-3:08575] tmp: /tmp
[SERVER-14:10755] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0/6
[SERVER-14:10755] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0
[SERVER-14:10755] top: openmpi-sessions-mpidemo@SERVER-14_0
[SERVER-14:10755] tmp: /tmp
[sv-4:12040] procdir:
/tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0/10
[sv-4:12040] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0
[sv-4:12040] top: openmpi-sessions-mpidemo@sv-4_0
[sv-4:12040] tmp: /tmp
[sv-2:07725] procdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0/7
[sv-2:07725] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0
[sv-2:07725] top: openmpi-sessions-mpidemo@sv-2_0
[sv-2:07725] tmp: /tmp
Mapper requested: NULL Last mapper: round_robin Mapping
policy: BYNODE Ranking policy: NODE Binding policy:
NONE[NODE] Cpu set: NULL PPR: NULL
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 10
Data for node: SERVER-2 Launch id: -1 State: 2
Daemon: [[62216,0],0] Daemon launched: True
Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 15 Max slots: 15
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],0]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 0
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-15 Binding: NULL[0]
Data for node: x.x.x.24 Launch id: -1 State: 0
Daemon: [[62216,0],1] Daemon launched: False
Num slots: 2 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 2 Max slots: 2
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],1]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 1
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.26 Launch id: -1 State: 0
Daemon: [[62216,0],2] Daemon launched: False
Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 14 Max slots: 14
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],2]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 2
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.28 Launch id: -1 State: 0
Daemon: [[62216,0],3] Daemon launched: False
Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 16 Max slots: 16
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],3]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 3
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.29 Launch id: -1 State: 0
Daemon: [[62216,0],4] Daemon launched: False
Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 14 Max slots: 14
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],4]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 4
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.30 Launch id: -1 State: 0
Daemon: [[62216,0],5] Daemon launched: False
Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 16 Max slots: 16
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],5]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 5
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.41 Launch id: -1 State: 0
Daemon: [[62216,0],6] Daemon launched: False
Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 46 Max slots: 46
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],6]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 6
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.101 Launch id: -1 State: 0
Daemon: [[62216,0],7] Daemon launched: False
Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 46 Max slots: 46
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],7]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 7
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.100 Launch id: -1 State: 0
Daemon: [[62216,0],8] Daemon launched: False
Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 46 Max slots: 46
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],8]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 8
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.102 Launch id: -1 State: 0
Daemon: [[62216,0],9] Daemon launched: False
Num slots: 22 Slots in use: 1 Oversubscribed: FALSE
Num slots allocated: 22 Max slots: 22
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[62216,1],9]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 9
State: INITIALIZED Restarts: 0 App_context:
0 Locale: 0-7 Binding: NULL[0]
[sv-1:46111] procdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1/8
[sv-1:46111] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1
[sv-1:46111] top: openmpi-sessions-mpidemo@sv-1_0
[sv-1:46111] tmp: /tmp
[SERVER-14:10768] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1/6
[SERVER-14:10768] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1
[SERVER-14:10768] top: openmpi-sessions-mpidemo@SERVER-14_0
[SERVER-14:10768] tmp: /tmp
[SERVER-2:08912] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1/0
[SERVER-2:08912] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1
[SERVER-2:08912] top: openmpi-sessions-mpidemo@SERVER-2_0
[SERVER-2:08912] tmp: /tmp
[SERVER-4:27460] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1/2
[SERVER-4:27460] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1
[SERVER-4:27460] top: openmpi-sessions-mpidemo@SERVER-4_0
[SERVER-4:27460] tmp: /tmp
[SERVER-6:11608] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1/4
[SERVER-6:11608] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1
[SERVER-6:11608] top: openmpi-sessions-mpidemo@SERVER-6_0
[SERVER-6:11608] tmp: /tmp
[SERVER-7:02620] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1/5
[SERVER-7:02620] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1
[SERVER-7:02620] top: openmpi-sessions-mpidemo@SERVER-7_0
[SERVER-7:02620] tmp: /tmp
[sv-3:08586] procdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1/9
[sv-3:08586] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1
[sv-3:08586] top: openmpi-sessions-mpidemo@sv-3_0
[sv-3:08586] tmp: /tmp
[sv-2:07736] procdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1/7
[sv-2:07736] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1
[sv-2:07736] top: openmpi-sessions-mpidemo@sv-2_0
[sv-2:07736] tmp: /tmp
[SERVER-5:16418] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1/3
[SERVER-5:16418] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1
[SERVER-5:16418] top: openmpi-sessions-mpidemo@SERVER-5_0
[SERVER-5:16418] tmp: /tmp
[SERVER-3:32533] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1/1
[SERVER-3:32533] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1
[SERVER-3:32533] top: openmpi-sessions-mpidemo@SERVER-3_0
[SERVER-3:32533] tmp: /tmp
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 10
MPIR_proctable:
(i, host, exe, pid) = (0, SERVER-2,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
(i, host, exe, pid) = (1, x.x.x.24,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
(i, host, exe, pid) = (2, x.x.x.26,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
(i, host, exe, pid) = (3, x.x.x.28,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
(i, host, exe, pid) = (4, x.x.x.29,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
(i, host, exe, pid) = (5, x.x.x.30,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
(i, host, exe, pid) = (6, x.x.x.41,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
(i, host, exe, pid) = (7, x.x.x.101,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
(i, host, exe, pid) = (8, x.x.x.100,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
(i, host, exe, pid) = (9, x.x.x.102,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a parallel
process can
fail during MPI_INIT; some of which are due to
configuration or environment
problems. This failure appears to be an internal failure;
here's some
additional information (which may only be relevant to an
Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[SERVER-2:8912] *** An error occurred in MPI_Init
[SERVER-2:8912] *** reported by process
[140393673392129,140389596004352]
[SERVER-2:8912] *** on a NULL communicator
[SERVER-2:8912] *** Unknown error
[SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in
this communicator will now abort,
[SERVER-2:8912] *** and potentially your MPI job)
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot
guarantee that all
of its peer processes in the job will be killed properly.
You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: SERVER-2
PID: 8912
--------------------------------------------------------------------------
[sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer
[[62216,1],0]
[sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach
each other for
MPI communications. This means that no Open MPI device
has indicated
that it can be used to communicate between these
processes. This is
an error; Open MPI requires that all MPI processes be able
to reach
each other. This error can sometimes be the result of
forgetting to
specify the "self" BTL.
Process 1 ([[62216,1],8]) is on host: sv-1
Process 2 ([[62216,1],0]) is on host: SERVER-2
BTLs attempted: openib self sm tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer
[[62216,1],0]
[sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is
unreachable
from another. This *usually* means that an underlying
communication
plugin -- such as a BTL or an MTL -- has either not loaded
or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL
plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or
mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer
[[62216,1],0]
[sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[SERVER-2:08907] sess_dir_finalize: proc session dir not
empty - leaving
[sv-4:12040] sess_dir_finalize: job session dir not empty
- leaving
[SERVER-14:10755] sess_dir_finalize: job session dir not
empty - leaving
[SERVER-2:08907] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-6:11595] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-6:11595] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-4:27445] sess_dir_finalize: proc session dir not
empty - leaving
exiting with status 0
[SERVER-4:27445] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-6:11595] sess_dir_finalize: job session dir not
empty - leaving
[SERVER-7:02607] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-7:02607] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-7:02607] sess_dir_finalize: job session dir not
empty - leaving
[SERVER-5:16404] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-5:16404] sess_dir_finalize: proc session dir not
empty - leaving
exiting with status 0
exiting with status 0
exiting with status 0
[SERVER-4:27445] sess_dir_finalize: job session dir not
empty - leaving
exiting with status 0
[SERVER-3:32517] sess_dir_finalize: proc session dir not
empty - leaving
[SERVER-3:32517] sess_dir_finalize: proc session dir not
empty - leaving
[sv-3:08575] sess_dir_finalize: proc session dir not empty
- leaving
[sv-3:08575] sess_dir_finalize: job session dir not empty
- leaving
exiting with status 0
[sv-1:46100] sess_dir_finalize: proc session dir not empty
- leaving
[sv-1:46100] sess_dir_finalize: job session dir not empty
- leaving
exiting with status 0
[sv-2:07725] sess_dir_finalize: proc session dir not empty
- leaving
[sv-2:07725] sess_dir_finalize: job session dir not empty
- leaving
exiting with status 0
[SERVER-5:16404] sess_dir_finalize: job session dir not
empty - leaving
exiting with status 0
[SERVER-3:32517] sess_dir_finalize: job session dir not
empty - leaving
exiting with status 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 6 with PID 10768 on
node x.x.x.41 exiting improperly. There are three reasons
this could occur:
1. this process did not call "init" before exiting, but
others in
the job did. This can cause a job to hang indefinitely
while it waits
for all processes to call "init". By rule, if one process
calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling
"finalize".
By rule, all processes that call "init" MUST call
"finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the
mca parameter
orte_create_session_dirs is set to false. In this case,
the run-time cannot
detect that the abort call was an abnormal termination.
Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the
mpirun command line.
--------------------------------------------------------------------------
[SERVER-2:08907] 6 more processes have sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
[SERVER-2:08907] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error
messages
[SERVER-2:08907] 9 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[SERVER-2:08907] 9 more processes have sent help message
help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all
killed
[SERVER-2:08907] 2 more processes have sent help message
help-mca-bml-r2.txt / unreachable proc
[SERVER-2:08907] 2 more processes have sent help message
help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[SERVER-2:08907] sess_dir_finalize: job session dir not
empty - leaving
exiting with status 1
//******************************************************************
On 8/3/13 4:34 AM, Ralph Castain wrote:
It looks like SERVER-2 cannot talk to your x.x.x.100
machine. I note that you have some entries at the end
of the hostfile that I don't understand - a list of
hosts that can be reached? And I see that your
x.x.x.22 machine isn't on it. Is that SERVER-2 by chance?
Our hostfile parsing changed between the release
series, but I know we never consciously supported the
syntax you show below where you list capabilities, and
then re-list the hosts in an apparent attempt to
filter which ones can actually be used. It is possible
that the 1.5 series somehow used that to exclude the
22 machine, and that the 1.7 parser now doesn't do that.
If you only include machines you actually intend to
use in your hostfile, does the 1.7 series work?
On Aug 3, 2013, at 3:58 AM, RoboBeans
<robobe...@gmail.com <mailto:robobe...@gmail.com>> wrote:
Hello everyone,
I have installed openmpi 1.5.4 on 11 node cluster
using "yum install openmpi openmpi-devel" and
everything seems to be working fine. For testing I am
using this test program
//******************************************************************
*$ cat test.cpp*
#include <stdio.h>
#include <mpi.h>
int main (int argc, char *argv[])
{
int id, np;
char name[MPI_MAX_PROCESSOR_NAME];
int namelen;
int i;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &np);
MPI_Comm_rank (MPI_COMM_WORLD, &id);
MPI_Get_processor_name (name, &namelen);
printf ("This is Process %2d out of %2d running on
host %s\n", id, np, name);
MPI_Finalize ();
return (0);
}
//******************************************************************
and my hosts file look like this:
*$ cat mpi_hostfile*
# The Hostfile for Open MPI
# specify number of slots for processes to run locally.
#localhost slots=12
#x.x.x.16 slots=12 max-slots=12
#x.x.x.17 slots=12 max-slots=12
#x.x.x.18 slots=12 max-slots=12
#x.x.1x.19 slots=12 max-slots=12
#x.x.x.20 slots=12 max-slots=12
#x.x.x.55 slots=46 max-slots=46
#x.x.x.56 slots=46 max-slots=46
x.x.x.22 slots=15 max-slots=15
x.x.x.24 slots=2 max-slots=2
x.x.x.26 slots=14 max-slots=14
x.x.x.28 slots=16 max-slots=16
x.x.x.29 slots=14 max-slots=14
x.x.x.30 slots=16 max-slots=16
x.x.x.41 slots=46 max-slots=46
x.x.x.101 slots=46 max-slots=46
x.x.x.100 slots=46 max-slots=46
x.x.x.102 slots=22 max-slots=22
x.x.x.103 slots=22 max-slots=22
# The following slave nodes are available to this machine:
x.x.x.24
x.x.x.26
x.x.x.28
x.x.x.29
x.x.x.30
x.x.x.41
x.x.x.101
x.x.x.100
x.x.x.102
x.x.x.103
//******************************************************************
this is how my .bashrc looks like on each node:
*$ cat ~/.bashrc*
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
umask 077
export PSM_SHAREDCONTEXTS_MAX=20
#export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}
#export
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export
LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/
//******************************************************************
*$ mpic++ test.cpp -o test*
*$ mpirun -d --display-map -np 10 --hostfile
mpi_hostfile --bynode ./test*
//******************************************************************
These nodes are running 2.6.32-358.2.1.el6.x86_64 release
*$ uname*
Linux
*$ uname -r*
2.6.32-358.2.1.el6.x86_64
*$ cat /etc/issue*
CentOS release 6.4 (Final)
Kernel \r on an \m
//******************************************************************
Now, if I install openmpi 1.7.2 on each node
separately then I can only use it on either first 7
nodes or last 4 nodes but not on all of them.
//******************************************************************
*$ gunzip -c openmpi-1.7.2.tar.gz | tar xf -
$ cd openmpi-1.7.2
$ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2
--enable-event-thread-support
--enable-opal-multi-threads
--enable-orte-progress-threads
--enable-mpi-thread-multiple
$ make all install*
//******************************************************************
This is the error message that i am receiving:
*$ mpirun -d --display-map -np 10 --hostfile
mpi_hostfile --bynode ./test*
[SERVER-2:05284] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0/0
[SERVER-2:05284] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0
[SERVER-2:05284] top: openmpi-sessions-mpidemo@SERVER-2_0
[SERVER-2:05284] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-3:28993] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0/1
[SERVER-3:28993] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0
[SERVER-3:28993] top: openmpi-sessions-mpidemo@SERVER-3_0
[SERVER-3:28993] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-6:09087] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0/4
[SERVER-6:09087] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0
[SERVER-6:09087] top: openmpi-sessions-mpidemo@SERVER-6_0
[SERVER-6:09087] tmp: /tmp
[SERVER-7:32563] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0/5
[SERVER-7:32563] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0
[SERVER-7:32563] top: openmpi-sessions-mpidemo@SERVER-7_0
[SERVER-7:32563] tmp: /tmp
[SERVER-4:15711] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0/2
[SERVER-4:15711] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0
[SERVER-4:15711] top: openmpi-sessions-mpidemo@SERVER-4_0
[SERVER-4:15711] tmp: /tmp
[sv-1:45701] procdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0/8
[sv-1:45701] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0
[sv-1:45701] top: openmpi-sessions-mpidemo@sv-1_0
[sv-1:45701] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
[sv-3:08352] procdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0/9
[sv-3:08352] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0
[sv-3:08352] top: openmpi-sessions-mpidemo@sv-3_0
[sv-3:08352] tmp: /tmp
[SERVER-5:12534] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0/3
[SERVER-5:12534] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0
[SERVER-5:12534] top: openmpi-sessions-mpidemo@SERVER-5_0
[SERVER-5:12534] tmp: /tmp
[SERVER-14:08399] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0/6
[SERVER-14:08399] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0
[SERVER-14:08399] top:
openmpi-sessions-mpidemo@SERVER-14_0
[SERVER-14:08399] tmp: /tmp
[sv-4:11802] procdir:
/tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0/10
[sv-4:11802] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0
[sv-4:11802] top: openmpi-sessions-mpidemo@sv-4_0
[sv-4:11802] tmp: /tmp
[sv-2:07503] procdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0/7
[sv-2:07503] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0
[sv-2:07503] top: openmpi-sessions-mpidemo@sv-2_0
[sv-2:07503] tmp: /tmp
Mapper requested: NULL Last mapper: round_robin
Mapping policy: BYNODE Ranking policy: NODE Binding
policy: NONE[NODE] Cpu set: NULL PPR: NULL
Num new daemons: 0 New daemon starting vpid
INVALID
Num nodes: 10
Data for node: SERVER-2 Launch id: -1 State: 2
Daemon: [[50535,0],0] Daemon launched: True
Num slots: 15 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 15 Max slots: 15
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],0]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 0
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-15 Binding: NULL[0]
Data for node: x.x.x.24 Launch id: -1 State: 0
Daemon: [[50535,0],1] Daemon launched: False
Num slots: 3 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 3 Max slots: 2
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],1]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 1
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.26 Launch id: -1 State: 0
Daemon: [[50535,0],2] Daemon launched: False
Num slots: 15 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 15 Max slots: 14
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],2]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 2
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.28 Launch id: -1 State: 0
Daemon: [[50535,0],3] Daemon launched: False
Num slots: 17 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 17 Max slots: 16
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],3]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 3
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.29 Launch id: -1 State: 0
Daemon: [[50535,0],4] Daemon launched: False
Num slots: 15 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 15 Max slots: 14
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],4]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 4
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.30 Launch id: -1 State: 0
Daemon: [[50535,0],5] Daemon launched: False
Num slots: 17 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 17 Max slots: 16
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],5]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 5
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.41 Launch id: -1 State: 0
Daemon: [[50535,0],6] Daemon launched: False
Num slots: 47 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 47 Max slots: 46
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],6]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 6
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.101 Launch id: -1 State: 0
Daemon: [[50535,0],7] Daemon launched: False
Num slots: 47 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 47 Max slots: 46
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],7]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 7
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.100 Launch id: -1 State: 0
Daemon: [[50535,0],8] Daemon launched: False
Num slots: 47 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 47 Max slots: 46
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],8]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 8
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
Data for node: x.x.x.102 Launch id: -1 State: 0
Daemon: [[50535,0],9] Daemon launched: False
Num slots: 23 Slots in use: 1
Oversubscribed: FALSE
Num slots allocated: 23 Max slots: 22
Username on node: NULL
Num procs: 1 Next node_rank: 1
Data for proc: [[50535,1],9]
Pid: 0 Local rank: 0 Node rank: 0 App
rank: 9
State: INITIALIZED Restarts: 0
App_context: 0 Locale: 0-7 Binding: NULL[0]
[sv-1:45712] procdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1/8
[sv-1:45712] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1
[sv-1:45712] top: openmpi-sessions-mpidemo@sv-1_0
[sv-1:45712] tmp: /tmp
[SERVER-14:08412] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1/6
[SERVER-14:08412] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1
[SERVER-14:08412] top:
openmpi-sessions-mpidemo@SERVER-14_0
[SERVER-14:08412] tmp: /tmp
[SERVER-2:05291] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1/0
[SERVER-2:05291] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1
[SERVER-2:05291] top: openmpi-sessions-mpidemo@SERVER-2_0
[SERVER-2:05291] tmp: /tmp
[SERVER-4:15726] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1/2
[SERVER-4:15726] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1
[SERVER-4:15726] top: openmpi-sessions-mpidemo@SERVER-4_0
[SERVER-4:15726] tmp: /tmp
[SERVER-6:09100] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1/4
[SERVER-6:09100] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1
[SERVER-6:09100] top: openmpi-sessions-mpidemo@SERVER-6_0
[SERVER-6:09100] tmp: /tmp
[SERVER-7:32576] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1/5
[SERVER-7:32576] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1
[SERVER-7:32576] top: openmpi-sessions-mpidemo@SERVER-7_0
[SERVER-7:32576] tmp: /tmp
[sv-3:08363] procdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1/9
[sv-3:08363] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1
[sv-3:08363] top: openmpi-sessions-mpidemo@sv-3_0
[sv-3:08363] tmp: /tmp
[sv-2:07514] procdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1/7
[sv-2:07514] jobdir:
/tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1
[sv-2:07514] top: openmpi-sessions-mpidemo@sv-2_0
[sv-2:07514] tmp: /tmp
[SERVER-5:12548] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1/3
[SERVER-5:12548] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1
[SERVER-5:12548] top: openmpi-sessions-mpidemo@SERVER-5_0
[SERVER-5:12548] tmp: /tmp
[SERVER-3:29009] procdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1/1
[SERVER-3:29009] jobdir:
/tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1
[SERVER-3:29009] top: openmpi-sessions-mpidemo@SERVER-3_0
[SERVER-3:29009] tmp: /tmp
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 10
MPIR_proctable:
(i, host, exe, pid) = (0, SERVER-2,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
(i, host, exe, pid) = (1, x.x.x.24,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
(i, host, exe, pid) = (2, x.x.x.26,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
(i, host, exe, pid) = (3, x.x.x.28,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
(i, host, exe, pid) = (4, x.x.x.29,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
(i, host, exe, pid) = (5, x.x.x.30,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
(i, host, exe, pid) = (6, x.x.x.41,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
(i, host, exe, pid) = (7, x.x.x.101,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
(i, host, exe, pid) = (8, x.x.x.100,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
(i, host, exe, pid) = (9, x.x.x.102,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your
parallel process is
likely to abort. There are many reasons that a
parallel process can
fail during MPI_INIT; some of which are due to
configuration or environment
problems. This failure appears to be an internal
failure; here's some
additional information (which may only be relevant to
an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[SERVER-2:5291] *** An error occurred in MPI_Init
[SERVER-2:5291] *** reported by process
[140508871983105,140505560121344]
[SERVER-2:5291] *** on a NULL communicator
[SERVER-2:5291] *** Unknown error
[SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in
this communicator will now abort,
[SERVER-2:5291] *** and potentially your MPI job)
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot
guarantee that all
of its peer processes in the job will be killed
properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: SERVER-2
PID: 5291
--------------------------------------------------------------------------
[sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for
peer [[50535,1],0]
[sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for
peer [[50535,1],0]
[sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach
each other for
MPI communications. This means that no Open MPI
device has indicated
that it can be used to communicate between these
processes. This is
an error; Open MPI requires that all MPI processes be
able to reach
each other. This error can sometimes be the result of
forgetting to
specify the "self" BTL.
Process 1 ([[50535,1],8]) is on host: sv-1
Process 2 ([[50535,1],0]) is on host: SERVER-2
BTLs attempted: openib self sm tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process
is unreachable
from another. This *usually* means that an underlying
communication
plugin -- such as a BTL or an MTL -- has either not
loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL
plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or
mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for
peer [[50535,1],0]
[sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[SERVER-2:05284] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-2:05284] sess_dir_finalize: proc session dir
not empty - leaving
[sv-4:11802] sess_dir_finalize: job session dir not
empty - leaving
[SERVER-14:08399] sess_dir_finalize: job session dir
not empty - leaving
[SERVER-6:09087] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-6:09087] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-4:15711] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-4:15711] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-6:09087] sess_dir_finalize: job session dir
not empty - leaving
exiting with status 0
[SERVER-7:32563] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-7:32563] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-5:12534] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-5:12534] sess_dir_finalize: proc session dir
not empty - leaving
[SERVER-7:32563] sess_dir_finalize: job session dir
not empty - leaving
exiting with status 0
exiting with status 0
exiting with status 0
[SERVER-4:15711] sess_dir_finalize: job session dir
not empty - leaving
[SERVER-3:28993] sess_dir_finalize: proc session dir
not empty - leaving
exiting with status 0
[SERVER-3:28993] sess_dir_finalize: proc session dir
not empty - leaving
[sv-3:08352] sess_dir_finalize: proc session dir not
empty - leaving
[sv-3:08352] sess_dir_finalize: job session dir not
empty - leaving
[sv-1:45701] sess_dir_finalize: proc session dir not
empty - leaving
[sv-1:45701] sess_dir_finalize: job session dir not
empty - leaving
exiting with status 0
exiting with status 0
[sv-2:07503] sess_dir_finalize: proc session dir not
empty - leaving
[sv-2:07503] sess_dir_finalize: job session dir not
empty - leaving
exiting with status 0
[SERVER-5:12534] sess_dir_finalize: job session dir
not empty - leaving
exiting with status 0
[SERVER-3:28993] sess_dir_finalize: job session dir
not empty - leaving
exiting with status 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 6 with PID 8412 on
node x.x.x.41 exiting improperly. There are three
reasons this could occur:
1. this process did not call "init" before exiting,
but others in
the job did. This can cause a job to hang indefinitely
while it waits
for all processes to call "init". By rule, if one
process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without
calling "finalize".
By rule, all processes that call "init" MUST call
"finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and
the mca parameter
orte_create_session_dirs is set to false. In this
case, the run-time cannot
detect that the abort call was an abnormal
termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the
application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the
mpirun command line.
--------------------------------------------------------------------------
[SERVER-2:05284] 6 more processes have sent help
message help-mpi-runtime /
mpi_init:startup:internal-failure
[SERVER-2:05284] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help /
error messages
[SERVER-2:05284] 9 more processes have sent help
message help-mpi-errors.txt / mpi_errors_are_fatal
unknown handle
[SERVER-2:05284] 9 more processes have sent help
message help-mpi-runtime.txt / ompi mpi abort:cannot
guarantee all killed
[SERVER-2:05284] 2 more processes have sent help
message help-mca-bml-r2.txt / unreachable proc
[SERVER-2:05284] 2 more processes have sent help
message help-mpi-runtime /
mpi_init:startup:pml-add-procs-fail
[SERVER-2:05284] sess_dir_finalize: job session dir
not empty - leaving
exiting with status 1
//******************************************************************
Any feedback will be helpful. Thank you!
Mr. Beans
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users