Yes, the same version of OpenMPI (1.6.5) is running on all of the machines, 
verified via 'mpirun --version' and also checked library paths via 'ldd'.

Non-MPI programs work fine.

Kevin

List-Post: users@lists.open-mpi.org
Date: Fri, 11 Oct 2013 20:06:22 +0000
From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>

1. Can you verify that you're running the same version/build of Open MPI on all 
three machines (mpirun machine, g18-6, and g17-33)?

2. Can you mpirun non-MPI programs, like hostname?


On Oct 10, 2013, at 8:41 AM, Kevin M. Hildebrand <ke...@umd.edu> wrote:

> Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some with 
> Mellanox cards and some with Qlogic cards.  I'm getting errors indicating "At 
> least one pair of MPI processes are unable to reach each other for MPI 
> communications".  As far as I can tell all of the nodes are properly 
> configured and able to reach each other, via IP and non-IP connections.
> I've also discovered that even if I turn off the IB transport via "--mca btl 
> tcp,self" I'm still getting the same issue.
> The test works fine if I run it confined to hosts with identical IB cards.
> I'd appreciate some assistance in figuring out what I'm doing wrong.
>  
> Thanks,
> Kevin
>  
> Here's a log of a failed run:
> 
> > mpirun -d --debug-daemons --mca btl tcp,self --mca orte_base_help_aggregate 
> > 0 --mca btl_base_verbose 100 -np 2 -machinefile foo.hosts 
> > /homes/kevin/alltoall.mpi-1.6.5
> [compute-g18-5.deepthought.umd.edu:20574] procdir: 
> /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0/0
> [compute-g18-5.deepthought.umd.edu:20574] jobdir: 
> /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0
> [compute-g18-5.deepthought.umd.edu:20574] top: 
> openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0
> [compute-g18-5.deepthought.umd.edu:20574] tmp: /tmp
> [compute-g18-5.deepthought.umd.edu:20574] mpirun: reset PATH: 
> /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_r
>       
> ftware/gcc/4.8.1/sys/bin:/cell_root/software/moab/bin:/cell_root/software/gold/bin:/usr/local/ofed/1.5.4/sbin:/usr/local/ofed/1.5.4/bin:/homes/kevin/bin:/homes/kevin/bin/amd64:/dept/oit/glue/
>       
> scripts:/usr/local/scripts:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/afsws/bin:/usr/afsws/etc
> [compute-g18-5.deepthought.umd.edu:20574] mpirun: reset LD_LIBRARY_PATH: 
> /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/lib:/usr/local/ofed/1.5.4/lib64
> Daemon was launched on compute-g17-33.deepthought.umd.edu - beginning to 
> initialize
> [compute-g17-33.deepthought.umd.edu:20174] procdir: 
> /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0/1
> [compute-g17-33.deepthought.umd.edu:20174] jobdir: 
> /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0
> [compute-g17-33.deepthought.umd.edu:20174] top: 
> openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0
> [compute-g17-33.deepthought.umd.edu:20174] tmp: /tmp
> Daemon [[63142,0],1] checking in as pid 20174 on host 
> compute-g17-33.deepthought.umd.edu
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: up and 
> running - waiting for commands!
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
> add_local_procs
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[0].name 
> compute-g18-5 daemon 0
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[1].name 
> compute-g17-33 daemon 1
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
> add_local_procs
>   MPIR_being_debugged = 0
>   MPIR_debug_state = 1
>   MPIR_partial_attach_ok = 1
>   MPIR_i_am_starter = 0
>   MPIR_forward_output = 0
>   MPIR_proctable_size = 2
>   MPIR_proctable:
>     (i, host, exe, pid) = (0, compute-g18-5.deepthought.umd.edu, 
> /homes/kevin/alltoall.mpi-1.6.5, 20576)
>     (i, host, exe, pid) = (1, compute-g17-33, 
> /homes/kevin/alltoall.mpi-1.6.5, 20175)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [compute-g18-5.deepthought.umd.edu:20576] procdir: 
> /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1/0
> [compute-g18-5.deepthought.umd.edu:20576] jobdir: 
> /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1
> [compute-g18-5.deepthought.umd.edu:20576] top: 
> openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0
> [compute-g18-5.deepthought.umd.edu:20576] tmp: /tmp
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_recv: received 
> sync+nidmap from local proc [[63142,1],0]
> [compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[0].name 
> compute-g18-5 daemon 0
> [compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[1].name 
> compute-g17-33 daemon 1
> [compute-g17-33.deepthought.umd.edu:20175] procdir: 
> /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1/1
> [compute-g17-33.deepthought.umd.edu:20175] jobdir: 
> /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1
> [compute-g17-33.deepthought.umd.edu:20175] top: 
> openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0
> [compute-g17-33.deepthought.umd.edu:20175] tmp: /tmp
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_recv: received 
> sync+nidmap from local proc [[63142,1],1]
> [compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[0].name 
> compute-g18-5 daemon 0
> [compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[1].name 
> compute-g17-33 daemon 1
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: Looking 
> for btl components
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: opening 
> btl components
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found 
> loaded component self
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
> self has no register function
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
> self open function successful
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found 
> loaded component tcp
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
> tcp register function successful
> [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
> tcp open function successful
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: Looking 
> for btl components
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: opening 
> btl components
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found 
> loaded component self
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
> self has no register function
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
> self open function successful
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found 
> loaded component tcp
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
> tcp register function successful
> [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
> tcp open function successful
> [compute-g17-33.deepthought.umd.:20175] select: initializing btl component 
> self
> [compute-g17-33.deepthought.umd.:20175] select: init of component self 
> returned success
> [compute-g17-33.deepthought.umd.:20175] select: initializing btl component tcp
> [compute-g17-33.deepthought.umd.:20175] btl: tcp: Searching for exclude 
> address+prefix: 127.0.0.1 / 8
> [compute-g17-33.deepthought.umd.:20175] btl: tcp: Found match: 127.0.0.1 (lo)
> [compute-g17-33.deepthought.umd.:20175] select: init of component tcp 
> returned success
> [compute-g18-5.deepthought.umd.e:20576] mca: base: close: component self 
> closed
> [compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component 
> self
> [compute-g18-5.deepthought.umd.e:20576] mca: base: close: component tcp closed
> [compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component 
> tcp
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
> message_local_procs
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
> message_local_procs
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>  
>   PML add procs failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
>  
>   Reason:     Before MPI_INIT completed
>   Local host: compute-g18-5.deepthought.umd.edu
>   PID:        20576
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>  
>   Process 1 ([[63142,1],1]) is on host: compute-g17-33.deepthought.umd.edu
>   Process 2 ([[63142,1],0]) is on host: compute-g18-5
>   BTLs attempted: self tcp
>  
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
>  
> You may wish to try to narrow down the problem;
>  
> * Check the output of ompi_info to see which BTL/MTL plugins are
>    available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>    if using MTL-based communications) to see exactly which
>    communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
>  
>   Reason:     Before MPI_INIT completed
>   Local host: compute-g17-33.deepthought.umd.edu
>   PID:        20175
> --------------------------------------------------------------------------
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
> waitpid_fired cmd
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
> iof_complete cmd
> [compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: proc session 
> dir not empty - leaving
> [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir 
> not empty - leaving
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
> iof_complete cmd
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 20175 on
> node compute-g17-33 exiting improperly. There are two reasons this could 
> occur:
>  
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>  
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>  
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
> exit cmd
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
> exit cmd
> [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: finalizing
> [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: job session dir 
> not empty - leaving
> [compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: job session dir 
> not empty - leaving
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data 
> for [63142,0]
> [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data 
> for [63142,1]
> [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir 
> not empty - leaving
> orterun: exiting with status 1
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



------------------------------

Subject: Digest Footer

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

------------------------------

End of users Digest, Vol 2705, Issue 1
**************************************

Reply via email to