Are the OMPI libraries and binaries installed at the same place on all the remote nodes?
Are you setting the LD_LIBRARY_PATH correctly? Are the Torque libs available in the same place on the remote nodes? Remember, Torque runs mpirun on a backend node - not on the frontend. These are the most typical problems. On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote: > Hi all, > > Your help with the following torque integration issue will be much > appreciated: whenever I try to start a openmpi job on more than one > node, it simply does not start up on the nodes. > The torque job fails with the following: > >> Fri Dec 18 22:11:07 CET 2009 >> OpenMPI with PPU-GCC was loaded >> -------------------------------------------------------------------------- >> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >> launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons on the nodes shown >> below. Additional manual cleanup may be required - please refer to >> the "orte-clean" tool for assistance. >> -------------------------------------------------------------------------- >> node2 - daemon did not report back when launched >> Fri Dec 18 22:12:47 CET 2009 > > I am quite confident about the compilation and installation of torque > and openmpi, since it runs without error on one node: >> Fri Dec 18 22:14:11 CET 2009 >> OpenMPI with PPU-GCC was loaded >> Process 1 on node1 out of 2 >> Process 0 on node1 out of 2 >> Fri Dec 18 22:14:12 CET 2009 > > The called programm is a simple helloworld which runs without errors > started manually on the nodes; therefore it also runs without errors > using a hostfile to daemonize on more than one node. I already tried to > compile openmpi with default prefix: >> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32 >> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32 >> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32 >> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized >> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32 >> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32 >> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default > > Also the called helloworld is compiled with and without -rpath, so I > just wanted to be sure regarding any linked library issue. > > Now, the interesting fact is the following: I compiled on one node a > kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the > pbs, mpi and helloworld daemons. And as already mentioned at the > beginning, therefore I assumed that the mpi startup within torque is not > working for me. > Please request any further logs or so you want to review, I did not > wanted to get the mail to large at first. > Any ideas? > > Greetings, > Johann > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users