Sorry - hit "send" and then saw the version sitting right there in the subject! 
Doh...

First, let's try verifying what components are actually getting used. Run this:

mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted

Then get an allocation and run

mpirun -pernode which orted

and

mpirun -pernode -mca plm rsh which orted

and see what happens


On Dec 19, 2009, at 5:17 PM, Ralph Castain wrote:

> That error has nothing to do with Torque. The cmd line is simply wrong - you 
> are specifying a btl that doesn't exist.
> 
> It should work just fine with
> 
> mpirun -n X hellocluster
> 
> Nothing else is required. When you run
> 
> mpirun --hostfile nodefile hellocluster
> 
> OMPI will still use Torque to do the launch - it just gets the list of nodes 
> from your nodefile instead of the PBS_NODEFILE.
> 
> You may have stated it below, but I can't find it: what version of OMPI are 
> you using? Are there additional versions installed on your system?
> 
> 
> On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote:
> 
>> Ah, and do I have to take care of the MCA ras plugin by my own?
>> I tried somethings like
>>> mpirun --mca ras tm --mca btl ras,plm  --mca ras_tm_nodefile_dir
>>> /var/spool/torque/aux/ hellocluster
>> but despite that it has not helped/worked out ([node3:22726] mca: base:
>> components_open: component pml / csum open function failed) it also does
>> not look so convenient to me...
>> 
>> Greetings
>> Johann
>> 
>> 
>> Johann Knechtel schrieb:
>>> Hi Ralph and all,
>>> 
>>> Yes, the OMPI libs and binaries are at the same place on the nodes, I
>>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes.
>>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile
>>> nodefile hellocluster" without problems. But when started via torque job
>>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH
>>> will be exported by torque to the daemonized mpirunners, dont I?
>>> The torque libs are all on the same place, I installed the package shell
>>> scripts via pdsh.
>>> 
>>> Greetings,
>>> Johann
>>> 
>>> 
>>> Ralph Castain schrieb:
>>> 
>>>> Are the OMPI libraries and binaries installed at the same place on all the 
>>>> remote nodes?
>>>> 
>>>> Are you setting the LD_LIBRARY_PATH correctly?
>>>> 
>>>> Are the Torque libs available in the same place on the remote nodes? 
>>>> Remember, Torque runs mpirun on a backend node - not on the frontend.
>>>> 
>>>> These are the most typical problems. 
>>>> 
>>>> 
>>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote:
>>>> 
>>>> 
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Your help with the following torque integration issue will be much
>>>>> appreciated: whenever I try to start a openmpi job on more than one
>>>>> node, it simply does not start up on the nodes.
>>>>> The torque job fails with the following:
>>>>> 
>>>>> 
>>>>> 
>>>>>> Fri Dec 18 22:11:07 CET 2009
>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>> --------------------------------------------------------------------------
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>>>>> launch so we are aborting.
>>>>>> 
>>>>>> There may be more information reported by the environment (see above).
>>>>>> 
>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have 
>>>>>> the
>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>> the "orte-clean" tool for assistance.
>>>>>> --------------------------------------------------------------------------
>>>>>>      node2 - daemon did not report back when launched
>>>>>> Fri Dec 18 22:12:47 CET 2009
>>>>>> 
>>>>>> 
>>>>> I am quite confident about the compilation and installation of torque
>>>>> and openmpi, since it runs without error on one node:
>>>>> 
>>>>> 
>>>>>> Fri Dec 18 22:14:11 CET 2009
>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>> Process 1 on node1 out of 2
>>>>>> Process 0 on node1 out of 2
>>>>>> Fri Dec 18 22:14:12 CET 2009
>>>>>> 
>>>>>> 
>>>>> The called programm is a simple helloworld which runs without errors
>>>>> started manually on the nodes; therefore it also runs without errors
>>>>> using a hostfile to daemonize on more than one node. I already tried to
>>>>> compile openmpi with default prefix:
>>>>> 
>>>>> 
>>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
>>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
>>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
>>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
>>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
>>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
>>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default
>>>>>> 
>>>>>> 
>>>>> Also the called helloworld is compiled with and without -rpath, so I
>>>>> just wanted to be sure regarding any linked library issue.
>>>>> 
>>>>> Now, the interesting fact is the following: I compiled on one node a
>>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
>>>>> pbs, mpi and helloworld daemons. And as already mentioned at the
>>>>> beginning, therefore I assumed that the mpi startup within torque is not
>>>>> working for me.
>>>>> Please request any further logs or so you want to review, I did not
>>>>> wanted to get the mail to large at first.
>>>>> Any ideas?
>>>>> 
>>>>> Greetings,
>>>>> Johann
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


Reply via email to