Hi all,

Your help with the following torque integration issue will be much
appreciated: whenever I try to start a openmpi job on more than one
node, it simply does not start up on the nodes.
The torque job fails with the following:

> Fri Dec 18 22:11:07 CET 2009
>  OpenMPI with PPU-GCC was loaded
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         node2 - daemon did not report back when launched
> Fri Dec 18 22:12:47 CET 2009

I am quite confident about the compilation and installation of torque
and openmpi, since it runs without error on one node:
> Fri Dec 18 22:14:11 CET 2009
>  OpenMPI with PPU-GCC was loaded
> Process 1 on node1 out of 2
> Process 0 on node1 out of 2
> Fri Dec 18 22:14:12 CET 2009

The called programm is a simple helloworld which runs without errors
started manually on the nodes; therefore it also runs without errors
using a hostfile to daemonize on more than one node. I already tried to
compile openmpi with default prefix:
>   $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default

Also the called helloworld is compiled with and without -rpath, so I
just wanted to be sure regarding any linked library issue.

Now, the interesting fact is the following: I compiled on one node a
kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
pbs, mpi and helloworld daemons. And as already mentioned at the
beginning, therefore I assumed that the mpi startup within torque is not
working for me.
Please request any further logs or so you want to review, I did not
wanted to get the mail to large at first.
Any ideas?

Greetings,
Johann



Reply via email to