Sorry - hit "send" and then saw the version sitting right there in the subject! Doh...
First, let's try verifying what components are actually getting used. Run this: mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted Then get an allocation and run mpirun -pernode which orted and mpirun -pernode -mca plm rsh which orted and see what happens On Dec 19, 2009, at 5:17 PM, Ralph Castain wrote: > That error has nothing to do with Torque. The cmd line is simply wrong - you > are specifying a btl that doesn't exist. > > It should work just fine with > > mpirun -n X hellocluster > > Nothing else is required. When you run > > mpirun --hostfile nodefile hellocluster > > OMPI will still use Torque to do the launch - it just gets the list of nodes > from your nodefile instead of the PBS_NODEFILE. > > You may have stated it below, but I can't find it: what version of OMPI are > you using? Are there additional versions installed on your system? > > > On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote: > >> Ah, and do I have to take care of the MCA ras plugin by my own? >> I tried somethings like >>> mpirun --mca ras tm --mca btl ras,plm --mca ras_tm_nodefile_dir >>> /var/spool/torque/aux/ hellocluster >> but despite that it has not helped/worked out ([node3:22726] mca: base: >> components_open: component pml / csum open function failed) it also does >> not look so convenient to me... >> >> Greetings >> Johann >> >> >> Johann Knechtel schrieb: >>> Hi Ralph and all, >>> >>> Yes, the OMPI libs and binaries are at the same place on the nodes, I >>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes. >>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile >>> nodefile hellocluster" without problems. But when started via torque job >>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH >>> will be exported by torque to the daemonized mpirunners, dont I? >>> The torque libs are all on the same place, I installed the package shell >>> scripts via pdsh. >>> >>> Greetings, >>> Johann >>> >>> >>> Ralph Castain schrieb: >>> >>>> Are the OMPI libraries and binaries installed at the same place on all the >>>> remote nodes? >>>> >>>> Are you setting the LD_LIBRARY_PATH correctly? >>>> >>>> Are the Torque libs available in the same place on the remote nodes? >>>> Remember, Torque runs mpirun on a backend node - not on the frontend. >>>> >>>> These are the most typical problems. >>>> >>>> >>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote: >>>> >>>> >>>> >>>>> Hi all, >>>>> >>>>> Your help with the following torque integration issue will be much >>>>> appreciated: whenever I try to start a openmpi job on more than one >>>>> node, it simply does not start up on the nodes. >>>>> The torque job fails with the following: >>>>> >>>>> >>>>> >>>>>> Fri Dec 18 22:11:07 CET 2009 >>>>>> OpenMPI with PPU-GCC was loaded >>>>>> -------------------------------------------------------------------------- >>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>>>>> launch so we are aborting. >>>>>> >>>>>> There may be more information reported by the environment (see above). >>>>>> >>>>>> This may be because the daemon was unable to find all the needed shared >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>> the >>>>>> location of the shared libraries on the remote nodes and this will >>>>>> automatically be forwarded to the remote nodes. >>>>>> -------------------------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> -------------------------------------------------------------------------- >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>>>> below. Additional manual cleanup may be required - please refer to >>>>>> the "orte-clean" tool for assistance. >>>>>> -------------------------------------------------------------------------- >>>>>> node2 - daemon did not report back when launched >>>>>> Fri Dec 18 22:12:47 CET 2009 >>>>>> >>>>>> >>>>> I am quite confident about the compilation and installation of torque >>>>> and openmpi, since it runs without error on one node: >>>>> >>>>> >>>>>> Fri Dec 18 22:14:11 CET 2009 >>>>>> OpenMPI with PPU-GCC was loaded >>>>>> Process 1 on node1 out of 2 >>>>>> Process 0 on node1 out of 2 >>>>>> Fri Dec 18 22:14:12 CET 2009 >>>>>> >>>>>> >>>>> The called programm is a simple helloworld which runs without errors >>>>> started manually on the nodes; therefore it also runs without errors >>>>> using a hostfile to daemonize on more than one node. I already tried to >>>>> compile openmpi with default prefix: >>>>> >>>>> >>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32 >>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32 >>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32 >>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized >>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32 >>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32 >>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default >>>>>> >>>>>> >>>>> Also the called helloworld is compiled with and without -rpath, so I >>>>> just wanted to be sure regarding any linked library issue. >>>>> >>>>> Now, the interesting fact is the following: I compiled on one node a >>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the >>>>> pbs, mpi and helloworld daemons. And as already mentioned at the >>>>> beginning, therefore I assumed that the mpi startup within torque is not >>>>> working for me. >>>>> Please request any further logs or so you want to review, I did not >>>>> wanted to get the mail to large at first. >>>>> Any ideas? >>>>> >>>>> Greetings, >>>>> Johann >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >