Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all

Ralph Castain Sun, 20 Dec 2009 10:17:02 -0500

The "mca plm rsh" param tells OMPI to use the rsh launcher instead of the 
Torque launcher. The only real difference between them is that the rsh launcher 
"pre-sets" the prefix into the remote environment prior to executing the orted 
- and the Torque launcher doesn't.


So it sounds like you aren't getting the remote path setup properly by Torque 
when it starts the remote daemon. This might have something to do with your 
Torque config - IIRC, there are some controls relating to such behavior. I'm 
not an expert in that area, so you might want to chat with the Torque folks or 
look at their FAQ area.

Meantime, the rsh launcher works just as well and is just as fast as the Torque 
launcher. Only negative is that the Torque local daemons don't know of the 
existence of your procs, which can impact cleanup if something goes wrong.

Ralph


On Dec 20, 2009, at 5:28 AM, Johann Knechtel wrote:

> Ralph, thank you very much for your input! The parameter "mca plm rsh"
> did it. I am just curious about the reasons for that behavior?
> You can find the complete output of the different commands embedded in
> your mail below. The first line states the successful load of the OMPI
> environment, we use the modules package on our cluster.
> 
> Greetings
> Johann
> 
> 
> Ralph Castain schrieb:
>> Sorry - hit "send" and then saw the version sitting right there in the 
>> subject! Doh...
>> 
>> First, let's try verifying what components are actually getting used. Run 
>> this:
>> 
>> mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted
>> 
> OpenMPI with PPU-GCC was loaded
> [node1:00706] mca: base: components_open: Looking for plm components
> [node1:00706] mca: base: components_open: opening plm components
> [node1:00706] mca: base: components_open: found loaded component rsh
> [node1:00706] mca: base: components_open: component rsh has no register
> function
> [node1:00706] mca: base: components_open: component rsh open function
> successful
> [node1:00706] mca: base: components_open: found loaded component slurm
> [node1:00706] mca: base: components_open: component slurm has no
> register function
> [node1:00706] mca: base: components_open: component slurm open function
> successful
> [node1:00706] mca: base: components_open: found loaded component tm
> [node1:00706] mca: base: components_open: component tm has no register
> function
> [node1:00706] mca: base: components_open: component tm open function
> successful
> [node1:00706] mca:base:select: Auto-selecting plm components
> [node1:00706] mca:base:select:(  plm) Querying component [rsh]
> [node1:00706] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [node1:00706] mca:base:select:(  plm) Querying component [slurm]
> [node1:00706] mca:base:select:(  plm) Skipping component [slurm]. Query
> failed to return a module
> [node1:00706] mca:base:select:(  plm) Querying component [tm]
> [node1:00706] mca:base:select:(  plm) Query of component [tm] set
> priority to 75
> [node1:00706] mca:base:select:(  plm) Selected component [tm]
> [node1:00706] mca: base: close: component rsh closed
> [node1:00706] mca: base: close: unloading component rsh
> [node1:00706] mca: base: close: component slurm closed
> [node1:00706] mca: base: close: unloading component slurm
> [node1:00706] mca: base: components_open: Looking for ras components
> [node1:00706] mca: base: components_open: opening ras components
> [node1:00706] mca: base: components_open: found loaded component slurm
> [node1:00706] mca: base: components_open: component slurm has no
> register function
> [node1:00706] mca: base: components_open: component slurm open function
> successful
> [node1:00706] mca: base: components_open: found loaded component tm
> [node1:00706] mca: base: components_open: component tm has no register
> function
> [node1:00706] mca: base: components_open: component tm open function
> successful
> [node1:00706] mca:base:select: Auto-selecting ras components
> [node1:00706] mca:base:select:(  ras) Querying component [slurm]
> [node1:00706] mca:base:select:(  ras) Skipping component [slurm]. Query
> failed to return a module
> [node1:00706] mca:base:select:(  ras) Querying component [tm]
> [node1:00706] mca:base:select:(  ras) Query of component [tm] set
> priority to 100
> [node1:00706] mca:base:select:(  ras) Selected component [tm]
> [node1:00706] mca: base: close: unloading component slurm
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
> [node1:00706] mca: base: close: unloading component tm
> [node1:00706] mca: base: close: component tm closed
> [node1:00706] mca: base: close: unloading component tm
> 
>> Then get an allocation and run
>> 
>> mpirun -pernode which orted
>> 
> OpenMPI with PPU-GCC was loaded
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>        node2 - daemon did not report back when launched
>> and
>> 
>> mpirun -pernode -mca plm rsh which orted
>> 
> OpenMPI with PPU-GCC was loaded
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
>> and see what happens
>> 
>> 
>> On Dec 19, 2009, at 5:17 PM, Ralph Castain wrote:
>> 
>> 
>>> That error has nothing to do with Torque. The cmd line is simply wrong - 
>>> you are specifying a btl that doesn't exist.
>>> 
>>> It should work just fine with
>>> 
>>> mpirun -n X hellocluster
>>> 
>>> Nothing else is required. When you run
>>> 
>>> mpirun --hostfile nodefile hellocluster
>>> 
>>> OMPI will still use Torque to do the launch - it just gets the list of 
>>> nodes from your nodefile instead of the PBS_NODEFILE.
>>> 
>>> You may have stated it below, but I can't find it: what version of OMPI are 
>>> you using? Are there additional versions installed on your system?
>>> 
>>> 
>>> On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote:
>>> 
>>> 
>>>> Ah, and do I have to take care of the MCA ras plugin by my own?
>>>> I tried somethings like
>>>> 
>>>>> mpirun --mca ras tm --mca btl ras,plm  --mca ras_tm_nodefile_dir
>>>>> /var/spool/torque/aux/ hellocluster
>>>>> 
>>>> but despite that it has not helped/worked out ([node3:22726] mca: base:
>>>> components_open: component pml / csum open function failed) it also does
>>>> not look so convenient to me...
>>>> 
>>>> Greetings
>>>> Johann
>>>> 
>>>> 
>>>> Johann Knechtel schrieb:
>>>> 
>>>>> Hi Ralph and all,
>>>>> 
>>>>> Yes, the OMPI libs and binaries are at the same place on the nodes, I
>>>>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes.
>>>>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile
>>>>> nodefile hellocluster" without problems. But when started via torque job
>>>>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH
>>>>> will be exported by torque to the daemonized mpirunners, dont I?
>>>>> The torque libs are all on the same place, I installed the package shell
>>>>> scripts via pdsh.
>>>>> 
>>>>> Greetings,
>>>>> Johann
>>>>> 
>>>>> 
>>>>> Ralph Castain schrieb:
>>>>> 
>>>>> 
>>>>>> Are the OMPI libraries and binaries installed at the same place on all 
>>>>>> the remote nodes?
>>>>>> 
>>>>>> Are you setting the LD_LIBRARY_PATH correctly?
>>>>>> 
>>>>>> Are the Torque libs available in the same place on the remote nodes? 
>>>>>> Remember, Torque runs mpirun on a backend node - not on the frontend.
>>>>>> 
>>>>>> These are the most typical problems. 
>>>>>> 
>>>>>> 
>>>>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> Your help with the following torque integration issue will be much
>>>>>>> appreciated: whenever I try to start a openmpi job on more than one
>>>>>>> node, it simply does not start up on the nodes.
>>>>>>> The torque job fails with the following:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Fri Dec 18 22:11:07 CET 2009
>>>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting 
>>>>>>>> to
>>>>>>>> launch so we are aborting.
>>>>>>>> 
>>>>>>>> There may be more information reported by the environment (see above).
>>>>>>>> 
>>>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have 
>>>>>>>> the
>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>>>> that caused that situation.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>     node2 - daemon did not report back when launched
>>>>>>>> Fri Dec 18 22:12:47 CET 2009
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> I am quite confident about the compilation and installation of torque
>>>>>>> and openmpi, since it runs without error on one node:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Fri Dec 18 22:14:11 CET 2009
>>>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>>>> Process 1 on node1 out of 2
>>>>>>>> Process 0 on node1 out of 2
>>>>>>>> Fri Dec 18 22:14:12 CET 2009
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> The called programm is a simple helloworld which runs without errors
>>>>>>> started manually on the nodes; therefore it also runs without errors
>>>>>>> using a hostfile to daemonize on more than one node. I already tried to
>>>>>>> compile openmpi with default prefix:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
>>>>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
>>>>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
>>>>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
>>>>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
>>>>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
>>>>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> Also the called helloworld is compiled with and without -rpath, so I
>>>>>>> just wanted to be sure regarding any linked library issue.
>>>>>>> 
>>>>>>> Now, the interesting fact is the following: I compiled on one node a
>>>>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
>>>>>>> pbs, mpi and helloworld daemons. And as already mentioned at the
>>>>>>> beginning, therefore I assumed that the mpi startup within torque is not
>>>>>>> working for me.
>>>>>>> Please request any further logs or so you want to review, I did not
>>>>>>> wanted to get the mail to large at first.
>>>>>>> Any ideas?
>>>>>>> 
>>>>>>> Greetings,
>>>>>>> Johann
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all

Reply via email to