I'll let Tom suggest a solution for the psm error, but you really need to remove those thread-related config params. OMPI isn't really thread safe at this point.
On Aug 4, 2013, at 6:26 PM, RoboBeans <robobe...@gmail.com> wrote: > Hi Tom, > > As per your suggestion, i tried > > ./configure --with-psm --prefix=/opt/openmpi-1.7.2 > --enable-event-thread-support --enable-opal-multi-threads > --enable-orte-progress-threads --enable-mpi-thread-multiple > > but I am getting this error: > > --- MCA component mtl:psm (m4 configuration macro) > checking for MCA component mtl:psm compile mode... dso > checking --with-psm value... simple ok (unspecified) > checking --with-psm-libdir value... simple ok (unspecified) > checking psm.h usability... no > checking psm.h presence... yes > configure: WARNING: psm.h: present but cannot be compiled > configure: WARNING: psm.h: check for missing prerequisite headers? > configure: WARNING: psm.h: see the Autoconf documentation > configure: WARNING: psm.h: section "Present But Cannot Be Compiled" > configure: WARNING: psm.h: proceeding with the compiler's result > configure: WARNING: ## > ------------------------------------------------------ ## > configure: WARNING: ## Report this to > http://www.open-mpi.org/community/help/ ## > configure: WARNING: ## > ------------------------------------------------------ ## > checking for psm.h... no > configure: error: PSM support requested but not found. Aborting > > > Any feedback will be helpful. Thanks for your time! > > Mr. Beans > > > > On 8/4/13 10:31 AM, Elken, Tom wrote: >> On 8/3/13 7:09 PM, RoboBeans wrote: >> On first 7 nodes: >> >> [mpidemo@SERVER-3 ~]$ ofed_info | head -n 1 >> OFED-1.5.3.2: >> >> On last 4 nodes: >> >> [mpidemo@sv-2 ~]$ ofed_info | head -n 1 >> -bash: ofed_info: command not found >> [Tom] >> This is a pretty good clue that OFED is not installed on the last 4 nodes. >> You should fix that by installing OFED 1.5.3.2 on the last 4 nodes, OR >> better (but more work) install a newer OFED such as 1.5.4.1 or 3.5 on ALL >> the nodes (You need to look at the OFED release notes to see if your OS is >> supported by these OFEDs). >> >> BTW, since you are using QLogic HCAs, they typically work with the best >> performance when using the PSM API to the HCA. PSM is part of OFED. To use >> this by default with Open MPI, you can build Open MPI as follows: >> ./configure --with-psm --prefix=<install directory> >> make >> make install >> >> With an Open MPI that is already built, you can try to use PSM as follows: >> mpirun … --mca mtl psm --mca btl ^openib … >> >> -Tom >> >> [mpidemo@sv-2 ~]$ which ofed_info >> /usr/bin/which: no ofed_info in >> (/usr/OPENMPI/openmpi-1.7.2/bin:/usr/OPENMPI/openmpi-1.7.2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin/:/usr/lib/:/usr/lib:/usr:/usr/:/bin/:/usr/lib/:/usr/lib:/usr:/usr/) >> >> >> Are there some specific locations where I should look for ofed_info? How can >> I make sure that ofed was installed on a node or not? >> >> Thanks again!!! >> >> >> On 8/3/13 5:52 PM, Ralph Castain wrote: >> Are the ofed versions the same across all the machines? I would suspect that >> might be the problem. >> >> >> On Aug 3, 2013, at 4:06 PM, RoboBeans <robobe...@gmail.com> wrote: >> >> >> Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from source code) >> with no configuration arguments but I am facing the same issue. When I run a >> job using 1.5.4 (installed using yum), I get warnings but it doesn't affect >> my output. >> >> Example of warning that I get: >> >> sv-2.7960ipath_userinit: Mismatched user minor version (12) and driver minor >> version (11) while context sharing. Ensure that driver and library are from >> the same release. >> >> Each system has a QLogic card ("QLE7342-CK dual port IB card"), with the >> same OS but different kernel revision no. (e.g. 2.6.32-358.2.1.el6.x86_64, >> 2.6.32-358.el6.x86_64). >> >> Thank you for your time. >> >> On 8/3/13 2:05 PM, Ralph Castain wrote: >> Hmmm...strange indeed. I would remove those four configure options and give >> it a try. That will eliminate all the obvious things, I would think, though >> they aren't generally involved in the issue shown here. Still, worth taking >> out potential trouble sources. >> >> What is the connectivity between SERVER-2 and node 100? Should I assume that >> the first seven nodes are connected via one type of interconnect, and the >> other four are connected to those seven by another type? >> >> >> On Aug 3, 2013, at 1:30 PM, RoboBeans <robobe...@gmail.com> wrote: >> >> >> Thanks for looking into in Ralph. I modified the hosts file but I am still >> getting the same error. Any other pointers you can think of? The difference >> between this 1.7.2 installation and 1.5.4 is that I installed 1.5.4 using >> yum and for 1.7.2, I used the source code and configured with >> --enable-event-thread-support --enable-opal-multi-threads >> --enable-orte-progress-threads --enable-mpi-thread-multiple >> . Am I missing something here? >> >> //****************************************************************** >> >> $ cat mpi_hostfile >> >> x.x.x.22 slots=15 max-slots=15 >> x.x.x.24 slots=2 max-slots=2 >> x.x.x.26 slots=14 max-slots=14 >> x.x.x.28 slots=16 max-slots=16 >> x.x.x.29 slots=14 max-slots=14 >> x.x.x.30 slots=16 max-slots=16 >> x.x.x.41 slots=46 max-slots=46 >> x.x.x.101 slots=46 max-slots=46 >> x.x.x.100 slots=46 max-slots=46 >> x.x.x.102 slots=22 max-slots=22 >> x.x.x.103 slots=22 max-slots=22 >> >> //****************************************************************** >> $ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test >> >> [SERVER-2:08907] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0/0 >> [SERVER-2:08907] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/0 >> [SERVER-2:08907] top: openmpi-sessions-mpidemo@SERVER-2_0 >> [SERVER-2:08907] tmp: /tmp >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> [SERVER-3:32517] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0/1 >> [SERVER-3:32517] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/0 >> [SERVER-3:32517] top: openmpi-sessions-mpidemo@SERVER-3_0 >> [SERVER-3:32517] tmp: /tmp >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> [SERVER-6:11595] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0/4 >> [SERVER-6:11595] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/0 >> [SERVER-6:11595] top: openmpi-sessions-mpidemo@SERVER-6_0 >> [SERVER-6:11595] tmp: /tmp >> [SERVER-4:27445] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0/2 >> [SERVER-4:27445] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/0 >> [SERVER-4:27445] top: openmpi-sessions-mpidemo@SERVER-4_0 >> [SERVER-4:27445] tmp: /tmp >> [SERVER-7:02607] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0/5 >> [SERVER-7:02607] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/0 >> [SERVER-7:02607] top: openmpi-sessions-mpidemo@SERVER-7_0 >> [SERVER-7:02607] tmp: /tmp >> [sv-1:46100] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0/8 >> [sv-1:46100] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/0 >> [sv-1:46100] top: openmpi-sessions-mpidemo@sv-1_0 >> [sv-1:46100] tmp: /tmp >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> [SERVER-5:16404] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0/3 >> [SERVER-5:16404] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/0 >> [SERVER-5:16404] top: openmpi-sessions-mpidemo@SERVER-5_0 >> [SERVER-5:16404] tmp: /tmp >> [sv-3:08575] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0/9 >> [sv-3:08575] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/0 >> [sv-3:08575] top: openmpi-sessions-mpidemo@sv-3_0 >> [sv-3:08575] tmp: /tmp >> [SERVER-14:10755] procdir: >> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0/6 >> [SERVER-14:10755] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/0 >> [SERVER-14:10755] top: openmpi-sessions-mpidemo@SERVER-14_0 >> [SERVER-14:10755] tmp: /tmp >> [sv-4:12040] procdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0/10 >> [sv-4:12040] jobdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/62216/0 >> [sv-4:12040] top: openmpi-sessions-mpidemo@sv-4_0 >> [sv-4:12040] tmp: /tmp >> [sv-2:07725] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0/7 >> [sv-2:07725] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/0 >> [sv-2:07725] top: openmpi-sessions-mpidemo@sv-2_0 >> [sv-2:07725] tmp: /tmp >> >> Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNODE >> Ranking policy: NODE Binding policy: NONE[NODE] Cpu set: NULL PPR: NULL >> Num new daemons: 0 New daemon starting vpid INVALID >> Num nodes: 10 >> >> Data for node: SERVER-2 Launch id: -1 State: 2 >> Daemon: [[62216,0],0] Daemon launched: True >> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 15 Max slots: 15 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],0] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 >> Binding: NULL[0] >> >> Data for node: x.x.x.24 Launch id: -1 State: 0 >> Daemon: [[62216,0],1] Daemon launched: False >> Num slots: 2 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 2 Max slots: 2 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],1] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.26 Launch id: -1 State: 0 >> Daemon: [[62216,0],2] Daemon launched: False >> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 14 Max slots: 14 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],2] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.28 Launch id: -1 State: 0 >> Daemon: [[62216,0],3] Daemon launched: False >> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 16 Max slots: 16 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],3] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.29 Launch id: -1 State: 0 >> Daemon: [[62216,0],4] Daemon launched: False >> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 14 Max slots: 14 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],4] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.30 Launch id: -1 State: 0 >> Daemon: [[62216,0],5] Daemon launched: False >> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 16 Max slots: 16 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],5] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.41 Launch id: -1 State: 0 >> Daemon: [[62216,0],6] Daemon launched: False >> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 46 Max slots: 46 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],6] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.101 Launch id: -1 State: 0 >> Daemon: [[62216,0],7] Daemon launched: False >> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 46 Max slots: 46 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],7] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.100 Launch id: -1 State: 0 >> Daemon: [[62216,0],8] Daemon launched: False >> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 46 Max slots: 46 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],8] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.102 Launch id: -1 State: 0 >> Daemon: [[62216,0],9] Daemon launched: False >> Num slots: 22 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 22 Max slots: 22 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[62216,1],9] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> [sv-1:46111] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1/8 >> [sv-1:46111] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/62216/1 >> [sv-1:46111] top: openmpi-sessions-mpidemo@sv-1_0 >> [sv-1:46111] tmp: /tmp >> [SERVER-14:10768] procdir: >> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1/6 >> [SERVER-14:10768] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/62216/1 >> [SERVER-14:10768] top: openmpi-sessions-mpidemo@SERVER-14_0 >> [SERVER-14:10768] tmp: /tmp >> [SERVER-2:08912] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1/0 >> [SERVER-2:08912] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/62216/1 >> [SERVER-2:08912] top: openmpi-sessions-mpidemo@SERVER-2_0 >> [SERVER-2:08912] tmp: /tmp >> [SERVER-4:27460] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1/2 >> [SERVER-4:27460] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/62216/1 >> [SERVER-4:27460] top: openmpi-sessions-mpidemo@SERVER-4_0 >> [SERVER-4:27460] tmp: /tmp >> [SERVER-6:11608] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1/4 >> [SERVER-6:11608] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/62216/1 >> [SERVER-6:11608] top: openmpi-sessions-mpidemo@SERVER-6_0 >> [SERVER-6:11608] tmp: /tmp >> [SERVER-7:02620] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1/5 >> [SERVER-7:02620] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/62216/1 >> [SERVER-7:02620] top: openmpi-sessions-mpidemo@SERVER-7_0 >> [SERVER-7:02620] tmp: /tmp >> [sv-3:08586] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1/9 >> [sv-3:08586] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/62216/1 >> [sv-3:08586] top: openmpi-sessions-mpidemo@sv-3_0 >> [sv-3:08586] tmp: /tmp >> [sv-2:07736] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1/7 >> [sv-2:07736] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/62216/1 >> [sv-2:07736] top: openmpi-sessions-mpidemo@sv-2_0 >> [sv-2:07736] tmp: /tmp >> [SERVER-5:16418] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1/3 >> [SERVER-5:16418] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/62216/1 >> [SERVER-5:16418] top: openmpi-sessions-mpidemo@SERVER-5_0 >> [SERVER-5:16418] tmp: /tmp >> [SERVER-3:32533] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1/1 >> [SERVER-3:32533] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/62216/1 >> [SERVER-3:32533] top: openmpi-sessions-mpidemo@SERVER-3_0 >> [SERVER-3:32533] tmp: /tmp >> MPIR_being_debugged = 0 >> MPIR_debug_state = 1 >> MPIR_partial_attach_ok = 1 >> MPIR_i_am_starter = 0 >> MPIR_forward_output = 0 >> MPIR_proctable_size = 10 >> MPIR_proctable: >> (i, host, exe, pid) = (0, SERVER-2, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912) >> (i, host, exe, pid) = (1, x.x.x.24, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533) >> (i, host, exe, pid) = (2, x.x.x.26, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460) >> (i, host, exe, pid) = (3, x.x.x.28, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418) >> (i, host, exe, pid) = (4, x.x.x.29, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608) >> (i, host, exe, pid) = (5, x.x.x.30, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620) >> (i, host, exe, pid) = (6, x.x.x.41, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768) >> (i, host, exe, pid) = (7, x.x.x.101, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736) >> (i, host, exe, pid) = (8, x.x.x.100, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111) >> (i, host, exe, pid) = (9, x.x.x.102, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586) >> MPIR_executable_path: NULL >> MPIR_server_arguments: NULL >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> PML add procs failed >> --> Returned "Error" (-1) instead of "Success" (0) >> -------------------------------------------------------------------------- >> [SERVER-2:8912] *** An error occurred in MPI_Init >> [SERVER-2:8912] *** reported by process [140393673392129,140389596004352] >> [SERVER-2:8912] *** on a NULL communicator >> [SERVER-2:8912] *** Unknown error >> [SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [SERVER-2:8912] *** and potentially your MPI job) >> -------------------------------------------------------------------------- >> An MPI process is aborting at a time when it cannot guarantee that all >> of its peer processes in the job will be killed properly. You should >> double check that everything has shut down cleanly. >> >> Reason: Before MPI_INIT completed >> Local host: SERVER-2 >> PID: 8912 >> -------------------------------------------------------------------------- >> [sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0] >> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> -------------------------------------------------------------------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[62216,1],8]) is on host: sv-1 >> Process 2 ([[62216,1],0]) is on host: SERVER-2 >> BTLs attempted: openib self sm tcp >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> [sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0] >> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> -------------------------------------------------------------------------- >> MPI_INIT has failed because at least one MPI process is unreachable >> from another. This *usually* means that an underlying communication >> plugin -- such as a BTL or an MTL -- has either not loaded or not >> allowed itself to be used. Your MPI job will now abort. >> >> You may wish to try to narrow down the problem; >> >> * Check the output of ompi_info to see which BTL/MTL plugins are >> available. >> * Run your application with MPI_THREAD_SINGLE. >> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, >> if using MTL-based communications) to see exactly which >> communication plugins were considered and/or discarded. >> -------------------------------------------------------------------------- >> [sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0] >> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving >> [sv-4:12040] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-14:10755] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving >> exiting with status 0 >> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-6:11595] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-7:02607] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving >> exiting with status 0 >> exiting with status 0 >> exiting with status 0 >> [SERVER-4:27445] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving >> [sv-3:08575] sess_dir_finalize: proc session dir not empty - leaving >> [sv-3:08575] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [sv-1:46100] sess_dir_finalize: proc session dir not empty - leaving >> [sv-1:46100] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [sv-2:07725] sess_dir_finalize: proc session dir not empty - leaving >> [sv-2:07725] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [SERVER-5:16404] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [SERVER-3:32517] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 6 with PID 10768 on >> node x.x.x.41 exiting improperly. There are three reasons this could occur: >> >> 1. this process did not call "init" before exiting, but others in >> the job did. This can cause a job to hang indefinitely while it waits >> for all processes to call "init". By rule, if one process calls "init", >> then ALL processes must call "init" prior to termination. >> >> 2. this process called "init", but exited without calling "finalize". >> By rule, all processes that call "init" MUST call "finalize" prior to >> exiting or it will be considered an "abnormal termination" >> >> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >> orte_create_session_dirs is set to false. In this case, the run-time cannot >> detect that the abort call was an abnormal termination. Hence, the only >> error message you will receive is this one. >> >> This may have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> >> You can avoid this message by specifying -quiet on the mpirun command line. >> >> -------------------------------------------------------------------------- >> [SERVER-2:08907] 6 more processes have sent help message help-mpi-runtime / >> mpi_init:startup:internal-failure >> [SERVER-2:08907] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> [SERVER-2:08907] 9 more processes have sent help message help-mpi-errors.txt >> / mpi_errors_are_fatal unknown handle >> [SERVER-2:08907] 9 more processes have sent help message >> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed >> [SERVER-2:08907] 2 more processes have sent help message help-mca-bml-r2.txt >> / unreachable proc >> [SERVER-2:08907] 2 more processes have sent help message help-mpi-runtime / >> mpi_init:startup:pml-add-procs-fail >> [SERVER-2:08907] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 1 >> >> //****************************************************************** >> >> On 8/3/13 4:34 AM, Ralph Castain wrote: >> It looks like SERVER-2 cannot talk to your x.x.x.100 machine. I note that >> you have some entries at the end of the hostfile that I don't understand - a >> list of hosts that can be reached? And I see that your x.x.x.22 machine >> isn't on it. Is that SERVER-2 by chance? >> >> Our hostfile parsing changed between the release series, but I know we never >> consciously supported the syntax you show below where you list capabilities, >> and then re-list the hosts in an apparent attempt to filter which ones can >> actually be used. It is possible that the 1.5 series somehow used that to >> exclude the 22 machine, and that the 1.7 parser now doesn't do that. >> >> If you only include machines you actually intend to use in your hostfile, >> does the 1.7 series work? >> >> On Aug 3, 2013, at 3:58 AM, RoboBeans <robobe...@gmail.com> wrote: >> >> >> Hello everyone, >> >> I have installed openmpi 1.5.4 on 11 node cluster using "yum install openmpi >> openmpi-devel" and everything seems to be working fine. For testing I am >> using this test program >> >> //****************************************************************** >> >> $ cat test.cpp >> >> #include <stdio.h> >> #include <mpi.h> >> >> int main (int argc, char *argv[]) >> { >> int id, np; >> char name[MPI_MAX_PROCESSOR_NAME]; >> int namelen; >> int i; >> >> MPI_Init (&argc, &argv); >> >> MPI_Comm_size (MPI_COMM_WORLD, &np); >> MPI_Comm_rank (MPI_COMM_WORLD, &id); >> MPI_Get_processor_name (name, &namelen); >> >> printf ("This is Process %2d out of %2d running on host %s\n", id, np, >> name); >> >> MPI_Finalize (); >> >> return (0); >> } >> >> //****************************************************************** >> >> and my hosts file look like this: >> >> $ cat mpi_hostfile >> >> # The Hostfile for Open MPI >> >> # specify number of slots for processes to run locally. >> #localhost slots=12 >> #x.x.x.16 slots=12 max-slots=12 >> #x.x.x.17 slots=12 max-slots=12 >> #x.x.x.18 slots=12 max-slots=12 >> #x.x.1x.19 slots=12 max-slots=12 >> #x.x.x.20 slots=12 max-slots=12 >> #x.x.x.55 slots=46 max-slots=46 >> #x.x.x.56 slots=46 max-slots=46 >> >> x.x.x.22 slots=15 max-slots=15 >> x.x.x.24 slots=2 max-slots=2 >> x.x.x.26 slots=14 max-slots=14 >> x.x.x.28 slots=16 max-slots=16 >> x.x.x.29 slots=14 max-slots=14 >> x.x.x.30 slots=16 max-slots=16 >> x.x.x.41 slots=46 max-slots=46 >> x.x.x.101 slots=46 max-slots=46 >> x.x.x.100 slots=46 max-slots=46 >> x.x.x.102 slots=22 max-slots=22 >> x.x.x.103 slots=22 max-slots=22 >> >> # The following slave nodes are available to this machine: >> x.x.x.24 >> x.x.x.26 >> x.x.x.28 >> x.x.x.29 >> x.x.x.30 >> x.x.x.41 >> x.x.x.101 >> x.x.x.100 >> x.x.x.102 >> x.x.x.103 >> >> //****************************************************************** >> >> this is how my .bashrc looks like on each node: >> >> $ cat ~/.bashrc >> >> # .bashrc >> >> # Source global definitions >> if [ -f /etc/bashrc ]; then >> . /etc/bashrc >> fi >> >> # User specific aliases and functions >> umask 077 >> >> export PSM_SHAREDCONTEXTS_MAX=20 >> >> #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH} >> export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH} >> >> #export >> LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} >> export >> LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} >> >> export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/ >> >> //****************************************************************** >> >> $ mpic++ test.cpp -o test >> >> $ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test >> >> //****************************************************************** >> >> These nodes are running 2.6.32-358.2.1.el6.x86_64 release >> >> $ uname >> Linux >> $ uname -r >> 2.6.32-358.2.1.el6.x86_64 >> $ cat /etc/issue >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> >> //****************************************************************** >> >> Now, if I install openmpi 1.7.2 on each node separately then I can only use >> it on either first 7 nodes or last 4 nodes but not on all of them. >> >> //****************************************************************** >> >> $ gunzip -c openmpi-1.7.2.tar.gz | tar xf - >> >> $ cd openmpi-1.7.2 >> >> $ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2 >> --enable-event-thread-support --enable-opal-multi-threads >> --enable-orte-progress-threads --enable-mpi-thread-multiple >> >> $ make all install >> >> //****************************************************************** >> >> This is the error message that i am receiving: >> >> >> $ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test >> >> [SERVER-2:05284] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0/0 >> [SERVER-2:05284] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/0 >> [SERVER-2:05284] top: openmpi-sessions-mpidemo@SERVER-2_0 >> [SERVER-2:05284] tmp: /tmp >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> [SERVER-3:28993] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0/1 >> [SERVER-3:28993] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/0 >> [SERVER-3:28993] top: openmpi-sessions-mpidemo@SERVER-3_0 >> [SERVER-3:28993] tmp: /tmp >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> [SERVER-6:09087] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0/4 >> [SERVER-6:09087] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/0 >> [SERVER-6:09087] top: openmpi-sessions-mpidemo@SERVER-6_0 >> [SERVER-6:09087] tmp: /tmp >> [SERVER-7:32563] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0/5 >> [SERVER-7:32563] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/0 >> [SERVER-7:32563] top: openmpi-sessions-mpidemo@SERVER-7_0 >> [SERVER-7:32563] tmp: /tmp >> [SERVER-4:15711] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0/2 >> [SERVER-4:15711] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/0 >> [SERVER-4:15711] top: openmpi-sessions-mpidemo@SERVER-4_0 >> [SERVER-4:15711] tmp: /tmp >> [sv-1:45701] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0/8 >> [sv-1:45701] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/0 >> [sv-1:45701] top: openmpi-sessions-mpidemo@sv-1_0 >> [sv-1:45701] tmp: /tmp >> CentOS release 6.4 (Final) >> Kernel \r on an \m >> [sv-3:08352] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0/9 >> [sv-3:08352] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/0 >> [sv-3:08352] top: openmpi-sessions-mpidemo@sv-3_0 >> [sv-3:08352] tmp: /tmp >> [SERVER-5:12534] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0/3 >> [SERVER-5:12534] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/0 >> [SERVER-5:12534] top: openmpi-sessions-mpidemo@SERVER-5_0 >> [SERVER-5:12534] tmp: /tmp >> [SERVER-14:08399] procdir: >> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0/6 >> [SERVER-14:08399] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/0 >> [SERVER-14:08399] top: openmpi-sessions-mpidemo@SERVER-14_0 >> [SERVER-14:08399] tmp: /tmp >> [sv-4:11802] procdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0/10 >> [sv-4:11802] jobdir: /tmp/openmpi-sessions-mpidemo@sv-4_0/50535/0 >> [sv-4:11802] top: openmpi-sessions-mpidemo@sv-4_0 >> [sv-4:11802] tmp: /tmp >> [sv-2:07503] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0/7 >> [sv-2:07503] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/0 >> [sv-2:07503] top: openmpi-sessions-mpidemo@sv-2_0 >> [sv-2:07503] tmp: /tmp >> >> Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNODE >> Ranking policy: NODE Binding policy: NONE[NODE] Cpu set: NULL PPR: NULL >> Num new daemons: 0 New daemon starting vpid INVALID >> Num nodes: 10 >> >> Data for node: SERVER-2 Launch id: -1 State: 2 >> Daemon: [[50535,0],0] Daemon launched: True >> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 15 Max slots: 15 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],0] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 >> Binding: NULL[0] >> >> Data for node: x.x.x.24 Launch id: -1 State: 0 >> Daemon: [[50535,0],1] Daemon launched: False >> Num slots: 3 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 3 Max slots: 2 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],1] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.26 Launch id: -1 State: 0 >> Daemon: [[50535,0],2] Daemon launched: False >> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 15 Max slots: 14 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],2] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.28 Launch id: -1 State: 0 >> Daemon: [[50535,0],3] Daemon launched: False >> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 17 Max slots: 16 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],3] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.29 Launch id: -1 State: 0 >> Daemon: [[50535,0],4] Daemon launched: False >> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 15 Max slots: 14 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],4] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.30 Launch id: -1 State: 0 >> Daemon: [[50535,0],5] Daemon launched: False >> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 17 Max slots: 16 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],5] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.41 Launch id: -1 State: 0 >> Daemon: [[50535,0],6] Daemon launched: False >> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 47 Max slots: 46 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],6] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.101 Launch id: -1 State: 0 >> Daemon: [[50535,0],7] Daemon launched: False >> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 47 Max slots: 46 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],7] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.100 Launch id: -1 State: 0 >> Daemon: [[50535,0],8] Daemon launched: False >> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 47 Max slots: 46 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],8] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> >> Data for node: x.x.x.102 Launch id: -1 State: 0 >> Daemon: [[50535,0],9] Daemon launched: False >> Num slots: 23 Slots in use: 1 Oversubscribed: FALSE >> Num slots allocated: 23 Max slots: 22 >> Username on node: NULL >> Num procs: 1 Next node_rank: 1 >> Data for proc: [[50535,1],9] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9 >> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-7 >> Binding: NULL[0] >> [sv-1:45712] procdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1/8 >> [sv-1:45712] jobdir: /tmp/openmpi-sessions-mpidemo@sv-1_0/50535/1 >> [sv-1:45712] top: openmpi-sessions-mpidemo@sv-1_0 >> [sv-1:45712] tmp: /tmp >> [SERVER-14:08412] procdir: >> /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1/6 >> [SERVER-14:08412] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-14_0/50535/1 >> [SERVER-14:08412] top: openmpi-sessions-mpidemo@SERVER-14_0 >> [SERVER-14:08412] tmp: /tmp >> [SERVER-2:05291] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1/0 >> [SERVER-2:05291] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-2_0/50535/1 >> [SERVER-2:05291] top: openmpi-sessions-mpidemo@SERVER-2_0 >> [SERVER-2:05291] tmp: /tmp >> [SERVER-4:15726] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1/2 >> [SERVER-4:15726] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-4_0/50535/1 >> [SERVER-4:15726] top: openmpi-sessions-mpidemo@SERVER-4_0 >> [SERVER-4:15726] tmp: /tmp >> [SERVER-6:09100] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1/4 >> [SERVER-6:09100] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-6_0/50535/1 >> [SERVER-6:09100] top: openmpi-sessions-mpidemo@SERVER-6_0 >> [SERVER-6:09100] tmp: /tmp >> [SERVER-7:32576] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1/5 >> [SERVER-7:32576] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-7_0/50535/1 >> [SERVER-7:32576] top: openmpi-sessions-mpidemo@SERVER-7_0 >> [SERVER-7:32576] tmp: /tmp >> [sv-3:08363] procdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1/9 >> [sv-3:08363] jobdir: /tmp/openmpi-sessions-mpidemo@sv-3_0/50535/1 >> [sv-3:08363] top: openmpi-sessions-mpidemo@sv-3_0 >> [sv-3:08363] tmp: /tmp >> [sv-2:07514] procdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1/7 >> [sv-2:07514] jobdir: /tmp/openmpi-sessions-mpidemo@sv-2_0/50535/1 >> [sv-2:07514] top: openmpi-sessions-mpidemo@sv-2_0 >> [sv-2:07514] tmp: /tmp >> [SERVER-5:12548] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1/3 >> [SERVER-5:12548] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-5_0/50535/1 >> [SERVER-5:12548] top: openmpi-sessions-mpidemo@SERVER-5_0 >> [SERVER-5:12548] tmp: /tmp >> [SERVER-3:29009] procdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1/1 >> [SERVER-3:29009] jobdir: /tmp/openmpi-sessions-mpidemo@SERVER-3_0/50535/1 >> [SERVER-3:29009] top: openmpi-sessions-mpidemo@SERVER-3_0 >> [SERVER-3:29009] tmp: /tmp >> MPIR_being_debugged = 0 >> MPIR_debug_state = 1 >> MPIR_partial_attach_ok = 1 >> MPIR_i_am_starter = 0 >> MPIR_forward_output = 0 >> MPIR_proctable_size = 10 >> MPIR_proctable: >> (i, host, exe, pid) = (0, SERVER-2, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291) >> (i, host, exe, pid) = (1, x.x.x.24, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009) >> (i, host, exe, pid) = (2, x.x.x.26, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726) >> (i, host, exe, pid) = (3, x.x.x.28, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548) >> (i, host, exe, pid) = (4, x.x.x.29, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100) >> (i, host, exe, pid) = (5, x.x.x.30, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576) >> (i, host, exe, pid) = (6, x.x.x.41, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412) >> (i, host, exe, pid) = (7, x.x.x.101, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514) >> (i, host, exe, pid) = (8, x.x.x.100, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712) >> (i, host, exe, pid) = (9, x.x.x.102, >> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363) >> MPIR_executable_path: NULL >> MPIR_server_arguments: NULL >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> PML add procs failed >> --> Returned "Error" (-1) instead of "Success" (0) >> -------------------------------------------------------------------------- >> [SERVER-2:5291] *** An error occurred in MPI_Init >> [SERVER-2:5291] *** reported by process [140508871983105,140505560121344] >> [SERVER-2:5291] *** on a NULL communicator >> [SERVER-2:5291] *** Unknown error >> [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [SERVER-2:5291] *** and potentially your MPI job) >> -------------------------------------------------------------------------- >> An MPI process is aborting at a time when it cannot guarantee that all >> of its peer processes in the job will be killed properly. You should >> double check that everything has shut down cleanly. >> >> Reason: Before MPI_INIT completed >> Local host: SERVER-2 >> PID: 5291 >> -------------------------------------------------------------------------- >> [sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0] >> [sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0] >> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> -------------------------------------------------------------------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[50535,1],8]) is on host: sv-1 >> Process 2 ([[50535,1],0]) is on host: SERVER-2 >> BTLs attempted: openib self sm tcp >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> MPI_INIT has failed because at least one MPI process is unreachable >> from another. This *usually* means that an underlying communication >> plugin -- such as a BTL or an MTL -- has either not loaded or not >> allowed itself to be used. Your MPI job will now abort. >> >> You may wish to try to narrow down the problem; >> >> * Check the output of ompi_info to see which BTL/MTL plugins are >> available. >> * Run your application with MPI_THREAD_SINGLE. >> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, >> if using MTL-based communications) to see exactly which >> communication plugins were considered and/or discarded. >> -------------------------------------------------------------------------- >> [sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0] >> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create] >> mca_base_modex_recv: failed with return value=-13 >> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty - leaving >> [sv-4:11802] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-14:08399] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-6:09087] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty - leaving >> [SERVER-7:32563] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> exiting with status 0 >> exiting with status 0 >> [SERVER-4:15711] sess_dir_finalize: job session dir not empty - leaving >> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty - leaving >> exiting with status 0 >> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty - leaving >> [sv-3:08352] sess_dir_finalize: proc session dir not empty - leaving >> [sv-3:08352] sess_dir_finalize: job session dir not empty - leaving >> [sv-1:45701] sess_dir_finalize: proc session dir not empty - leaving >> [sv-1:45701] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> exiting with status 0 >> [sv-2:07503] sess_dir_finalize: proc session dir not empty - leaving >> [sv-2:07503] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [SERVER-5:12534] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> [SERVER-3:28993] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 0 >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 6 with PID 8412 on >> node x.x.x.41 exiting improperly. There are three reasons this could occur: >> >> 1. this process did not call "init" before exiting, but others in >> the job did. This can cause a job to hang indefinitely while it waits >> for all processes to call "init". By rule, if one process calls "init", >> then ALL processes must call "init" prior to termination. >> >> 2. this process called "init", but exited without calling "finalize". >> By rule, all processes that call "init" MUST call "finalize" prior to >> exiting or it will be considered an "abnormal termination" >> >> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >> orte_create_session_dirs is set to false. In this case, the run-time cannot >> detect that the abort call was an abnormal termination. Hence, the only >> error message you will receive is this one. >> >> This may have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> >> You can avoid this message by specifying -quiet on the mpirun command line. >> >> -------------------------------------------------------------------------- >> [SERVER-2:05284] 6 more processes have sent help message help-mpi-runtime / >> mpi_init:startup:internal-failure >> [SERVER-2:05284] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> [SERVER-2:05284] 9 more processes have sent help message help-mpi-errors.txt >> / mpi_errors_are_fatal unknown handle >> [SERVER-2:05284] 9 more processes have sent help message >> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed >> [SERVER-2:05284] 2 more processes have sent help message help-mca-bml-r2.txt >> / unreachable proc >> [SERVER-2:05284] 2 more processes have sent help message help-mpi-runtime / >> mpi_init:startup:pml-add-procs-fail >> [SERVER-2:05284] sess_dir_finalize: job session dir not empty - leaving >> exiting with status 1 >> >> //****************************************************************** >> >> Any feedback will be helpful. Thank you! >> >> Mr. Beans >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users