Hi Jeff,
Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Correct.
Can it be because torque (pbs_mom) is not running on the head node and mpiexec
attempts to contact it ?
Not for Open MPI's mpiexec, no.
Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM
stuff (i.e., Torque stuff) if it sees the environment variable markers
indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or
localhost launch in your case, since you didn't specify any hosts).
If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI
"hostname" command from Linux), then something is seriously borked with your Open MPI
installation.
mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0
Try running with:
mpirun -np 4 --mca plm_base_verbose 10 hostname
This should show the steps OMPI is trying to take to launch the 4 copies of
"hostname" and potentially give some insight into where it's hanging.
Also, just to make sure, you have ensured that you're compiling everything with
a single compiler toolchain, and the support libraries from that specific
compiler toolchain are available on any server on which you're running (to
include the head node and compute nodes), right?
Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6
with the same results). Almost every software (that is compiler,
toolchain, etc.) is installed on lustre, from sources and is the same on
both the login (head) node and the compute.
The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration
(computes have GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the
-devel packages, some fonts/X11 libraries, etc.), but all the packages
that are on the computes are also on the login node.
And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the
Open MPI installation that you expect it to point to. E.g., if you "ldd ring_c", it
shows the libmpi.so that you expect. And "which mpiexec" shows the mpirun that you
expect. Etc.
As per the content of "env.out" in the archive, yes. They point to the
OMPI 1.8.2rc4 installation directories, on lustre, and are shared
between the head node and the compute nodes.
Maxime