Yes, the same version of OpenMPI (1.6.5) is running on all of the machines, verified via 'mpirun --version' and also checked library paths via 'ldd'.
Non-MPI programs work fine. Kevin List-Post: users@lists.open-mpi.org Date: Fri, 11 Oct 2013 20:06:22 +0000 From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 1. Can you verify that you're running the same version/build of Open MPI on all three machines (mpirun machine, g18-6, and g17-33)? 2. Can you mpirun non-MPI programs, like hostname? On Oct 10, 2013, at 8:41 AM, Kevin M. Hildebrand <ke...@umd.edu> wrote: > Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some with > Mellanox cards and some with Qlogic cards. I'm getting errors indicating "At > least one pair of MPI processes are unable to reach each other for MPI > communications". As far as I can tell all of the nodes are properly > configured and able to reach each other, via IP and non-IP connections. > I've also discovered that even if I turn off the IB transport via "--mca btl > tcp,self" I'm still getting the same issue. > The test works fine if I run it confined to hosts with identical IB cards. > I'd appreciate some assistance in figuring out what I'm doing wrong. > > Thanks, > Kevin > > Here's a log of a failed run: > > > mpirun -d --debug-daemons --mca btl tcp,self --mca orte_base_help_aggregate > > 0 --mca btl_base_verbose 100 -np 2 -machinefile foo.hosts > > /homes/kevin/alltoall.mpi-1.6.5 > [compute-g18-5.deepthought.umd.edu:20574] procdir: > /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0/0 > [compute-g18-5.deepthought.umd.edu:20574] jobdir: > /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0 > [compute-g18-5.deepthought.umd.edu:20574] top: > openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0 > [compute-g18-5.deepthought.umd.edu:20574] tmp: /tmp > [compute-g18-5.deepthought.umd.edu:20574] mpirun: reset PATH: > /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_r > > ftware/gcc/4.8.1/sys/bin:/cell_root/software/moab/bin:/cell_root/software/gold/bin:/usr/local/ofed/1.5.4/sbin:/usr/local/ofed/1.5.4/bin:/homes/kevin/bin:/homes/kevin/bin/amd64:/dept/oit/glue/ > > scripts:/usr/local/scripts:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/afsws/bin:/usr/afsws/etc > [compute-g18-5.deepthought.umd.edu:20574] mpirun: reset LD_LIBRARY_PATH: > /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/lib:/usr/local/ofed/1.5.4/lib64 > Daemon was launched on compute-g17-33.deepthought.umd.edu - beginning to > initialize > [compute-g17-33.deepthought.umd.edu:20174] procdir: > /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0/1 > [compute-g17-33.deepthought.umd.edu:20174] jobdir: > /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0 > [compute-g17-33.deepthought.umd.edu:20174] top: > openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0 > [compute-g17-33.deepthought.umd.edu:20174] tmp: /tmp > Daemon [[63142,0],1] checking in as pid 20174 on host > compute-g17-33.deepthought.umd.edu > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: up and > running - waiting for commands! > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received > add_local_procs > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[0].name > compute-g18-5 daemon 0 > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[1].name > compute-g17-33 daemon 1 > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received > add_local_procs > MPIR_being_debugged = 0 > MPIR_debug_state = 1 > MPIR_partial_attach_ok = 1 > MPIR_i_am_starter = 0 > MPIR_forward_output = 0 > MPIR_proctable_size = 2 > MPIR_proctable: > (i, host, exe, pid) = (0, compute-g18-5.deepthought.umd.edu, > /homes/kevin/alltoall.mpi-1.6.5, 20576) > (i, host, exe, pid) = (1, compute-g17-33, > /homes/kevin/alltoall.mpi-1.6.5, 20175) > MPIR_executable_path: NULL > MPIR_server_arguments: NULL > [compute-g18-5.deepthought.umd.edu:20576] procdir: > /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1/0 > [compute-g18-5.deepthought.umd.edu:20576] jobdir: > /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1 > [compute-g18-5.deepthought.umd.edu:20576] top: > openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0 > [compute-g18-5.deepthought.umd.edu:20576] tmp: /tmp > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_recv: received > sync+nidmap from local proc [[63142,1],0] > [compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[0].name > compute-g18-5 daemon 0 > [compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[1].name > compute-g17-33 daemon 1 > [compute-g17-33.deepthought.umd.edu:20175] procdir: > /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1/1 > [compute-g17-33.deepthought.umd.edu:20175] jobdir: > /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1 > [compute-g17-33.deepthought.umd.edu:20175] top: > openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0 > [compute-g17-33.deepthought.umd.edu:20175] tmp: /tmp > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_recv: received > sync+nidmap from local proc [[63142,1],1] > [compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[0].name > compute-g18-5 daemon 0 > [compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[1].name > compute-g17-33 daemon 1 > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: Looking > for btl components > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: opening > btl components > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found > loaded component self > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component > self has no register function > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component > self open function successful > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found > loaded component tcp > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component > tcp register function successful > [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component > tcp open function successful > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: Looking > for btl components > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: opening > btl components > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found > loaded component self > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component > self has no register function > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component > self open function successful > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found > loaded component tcp > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component > tcp register function successful > [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component > tcp open function successful > [compute-g17-33.deepthought.umd.:20175] select: initializing btl component > self > [compute-g17-33.deepthought.umd.:20175] select: init of component self > returned success > [compute-g17-33.deepthought.umd.:20175] select: initializing btl component tcp > [compute-g17-33.deepthought.umd.:20175] btl: tcp: Searching for exclude > address+prefix: 127.0.0.1 / 8 > [compute-g17-33.deepthought.umd.:20175] btl: tcp: Found match: 127.0.0.1 (lo) > [compute-g17-33.deepthought.umd.:20175] select: init of component tcp > returned success > [compute-g18-5.deepthought.umd.e:20576] mca: base: close: component self > closed > [compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component > self > [compute-g18-5.deepthought.umd.e:20576] mca: base: close: component tcp closed > [compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component > tcp > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received > message_local_procs > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received > message_local_procs > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: compute-g18-5.deepthought.umd.edu > PID: 20576 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[63142,1],1]) is on host: compute-g17-33.deepthought.umd.edu > Process 2 ([[63142,1],0]) is on host: compute-g18-5 > BTLs attempted: self tcp > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > -------------------------------------------------------------------------- > MPI_INIT has failed because at least one MPI process is unreachable > from another. This *usually* means that an underlying communication > plugin -- such as a BTL or an MTL -- has either not loaded or not > allowed itself to be used. Your MPI job will now abort. > > You may wish to try to narrow down the problem; > > * Check the output of ompi_info to see which BTL/MTL plugins are > available. > * Run your application with MPI_THREAD_SINGLE. > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > if using MTL-based communications) to see exactly which > communication plugins were considered and/or discarded. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: compute-g17-33.deepthought.umd.edu > PID: 20175 > -------------------------------------------------------------------------- > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received > waitpid_fired cmd > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received > iof_complete cmd > [compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: proc session > dir not empty - leaving > [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir > not empty - leaving > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received > iof_complete cmd > -------------------------------------------------------------------------- > mpirun has exited due to process rank 1 with PID 20175 on > node compute-g17-33 exiting improperly. There are two reasons this could > occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received > exit cmd > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received > exit cmd > [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: finalizing > [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: job session dir > not empty - leaving > [compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: job session dir > not empty - leaving > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data > for [63142,0] > [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data > for [63142,1] > [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir > not empty - leaving > orterun: exiting with status 1 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ------------------------------ Subject: Digest Footer _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ------------------------------ End of users Digest, Vol 2705, Issue 1 **************************************