Ah, and do I have to take care of the MCA ras plugin by my own? I tried somethings like > mpirun --mca ras tm --mca btl ras,plm --mca ras_tm_nodefile_dir > /var/spool/torque/aux/ hellocluster but despite that it has not helped/worked out ([node3:22726] mca: base: components_open: component pml / csum open function failed) it also does not look so convenient to me...
Greetings Johann Johann Knechtel schrieb: > Hi Ralph and all, > > Yes, the OMPI libs and binaries are at the same place on the nodes, I > packed OMPI via checkinstall and installed the deb via pdsh on the nodes. > The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile > nodefile hellocluster" without problems. But when started via torque job > it does not work out. I do assume correctly, that the LD_LIBRARY_PATH > will be exported by torque to the daemonized mpirunners, dont I? > The torque libs are all on the same place, I installed the package shell > scripts via pdsh. > > Greetings, > Johann > > > Ralph Castain schrieb: > >> Are the OMPI libraries and binaries installed at the same place on all the >> remote nodes? >> >> Are you setting the LD_LIBRARY_PATH correctly? >> >> Are the Torque libs available in the same place on the remote nodes? >> Remember, Torque runs mpirun on a backend node - not on the frontend. >> >> These are the most typical problems. >> >> >> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote: >> >> >> >>> Hi all, >>> >>> Your help with the following torque integration issue will be much >>> appreciated: whenever I try to start a openmpi job on more than one >>> node, it simply does not start up on the nodes. >>> The torque job fails with the following: >>> >>> >>> >>>> Fri Dec 18 22:11:07 CET 2009 >>>> OpenMPI with PPU-GCC was loaded >>>> -------------------------------------------------------------------------- >>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>>> launch so we are aborting. >>>> >>>> There may be more information reported by the environment (see above). >>>> >>>> This may be because the daemon was unable to find all the needed shared >>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >>>> location of the shared libraries on the remote nodes and this will >>>> automatically be forwarded to the remote nodes. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>> below. Additional manual cleanup may be required - please refer to >>>> the "orte-clean" tool for assistance. >>>> -------------------------------------------------------------------------- >>>> node2 - daemon did not report back when launched >>>> Fri Dec 18 22:12:47 CET 2009 >>>> >>>> >>> I am quite confident about the compilation and installation of torque >>> and openmpi, since it runs without error on one node: >>> >>> >>>> Fri Dec 18 22:14:11 CET 2009 >>>> OpenMPI with PPU-GCC was loaded >>>> Process 1 on node1 out of 2 >>>> Process 0 on node1 out of 2 >>>> Fri Dec 18 22:14:12 CET 2009 >>>> >>>> >>> The called programm is a simple helloworld which runs without errors >>> started manually on the nodes; therefore it also runs without errors >>> using a hostfile to daemonize on more than one node. I already tried to >>> compile openmpi with default prefix: >>> >>> >>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32 >>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32 >>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32 >>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized >>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32 >>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32 >>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default >>>> >>>> >>> Also the called helloworld is compiled with and without -rpath, so I >>> just wanted to be sure regarding any linked library issue. >>> >>> Now, the interesting fact is the following: I compiled on one node a >>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the >>> pbs, mpi and helloworld daemons. And as already mentioned at the >>> beginning, therefore I assumed that the mpi startup within torque is not >>> working for me. >>> Please request any further logs or so you want to review, I did not >>> wanted to get the mail to large at first. >>> Any ideas? >>> >>> Greetings, >>> Johann >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >