You can also run: ompi_info | grep 'plm: tm'
(note the quotes, because you need to include the space)

If you see a line listing the TM PLM plugin, then you have Torque / PBS support 
built in to Open MPI.  If you don't, then you don't.  :-)


> On Sep 7, 2016, at 11:01 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> I will double check the name.
> If you did not configure with --disable-dlopen, then mpirun only links with 
> opal and orte.
> At run time, these libs will dlopen the plugins (from the openmpi sub 
> directory, they are named mca_abc_xyz.so)
> If you have support for tm, then one of the plugin will be linked with torque 
> libs
> 
> Cheers,
> 
> Gilles
> 
> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>> Hi Gilles,
>> 
>> I do not have this library. Maybe this helps already...
>> 
>> libmca_common_sm.so  libmpi_mpifh.so  libmpi_usempif08.so          
>> libompitrace.so  libopen-rte.so
>> libmpi_cxx.so        libmpi.so        libmpi_usempi_ignore_tkr.so  
>> libopen-pal.so   liboshmem.so
>> 
>> and mpirun does only link to libopen-pal/libopen-rte (aside the standard 
>> stuff)
>> 
>> But still it is telling me that it has support for tm? libtorque is 
>> there and the headers are also there and since i have enabled 
>> tm...*sigh*
>> 
>> Thanks again!
>> 
>> Oswin
>> 
>> On 2016-09-07 16:21, Gilles Gouaillardet wrote:
>>> Note the torque library will only show up if you configure'd with
>>> --disable-dlopen. Otherwise, you can ldd
>>> /.../lib/openmpi/mca_plm_tm.so
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Bennet Fauber <ben...@umich.edu> wrote:
>>>> Oswin,
>>>> 
>>>> Does the torque library show up if you run
>>>> 
>>>> $ ldd mpirun
>>>> 
>>>> That would indicate that Torque support is compiled in.
>>>> 
>>>> Also, what happens if you use the same hostfile, or some hostfile as
>>>> an explicit argument when you run mpirun from within the torque job?
>>>> 
>>>> -- bennet
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause
>>>> <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>> Hi Gilles,
>>>>> 
>>>>> Thanks for the hint with the machinefile. I know it is not equivalent 
>>>>> and i
>>>>> do not intend to use that approach. I just wanted to know whether I 
>>>>> could
>>>>> start the program successfully at all.
>>>>> 
>>>>> Outside torque(4.2), rsh seems to be used which works fine, querying 
>>>>> a
>>>>> password if no kerberos ticket is there
>>>>> 
>>>>> Here is the output:
>>>>> [zbh251@a00551 ~]$ mpirun -V
>>>>> mpirun (Open MPI) 2.0.1
>>>>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>>>>                 MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, 
>>>>> Component
>>>>> v2.0.1)
>>>>>                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, 
>>>>> Component
>>>>> v2.0.1)
>>>>>                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>>>>> v2.0.1)
>>>>>                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component 
>>>>> v2.0.1)
>>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
>>>>> -display-map hostname
>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>> registering
>>>>> framework plm components
>>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>>> loaded
>>>>> component isolated
>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>> component
>>>>> isolated has no register or open function
>>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>>> loaded
>>>>> component rsh
>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>> component rsh
>>>>> register function successful
>>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>>> loaded
>>>>> component slurm
>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>> component
>>>>> slurm register function successful
>>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>>> loaded
>>>>> component tm
>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>> component tm
>>>>> register function successful
>>>>> [a00551.science.domain:04104] mca: base: components_open: opening plm
>>>>> components
>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>> loaded
>>>>> component isolated
>>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>>> isolated
>>>>> open function successful
>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>> loaded
>>>>> component rsh
>>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>>> rsh open
>>>>> function successful
>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>> loaded
>>>>> component slurm
>>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>>> slurm
>>>>> open function successful
>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>> loaded
>>>>> component tm
>>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>>> tm open
>>>>> function successful
>>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm 
>>>>> components
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>>> component
>>>>> [isolated]
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of 
>>>>> component
>>>>> [isolated] set priority to 0
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>>> component
>>>>> [rsh]
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of 
>>>>> component
>>>>> [rsh] set priority to 10
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>>> component
>>>>> [slurm]
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>>> component
>>>>> [tm]
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of 
>>>>> component
>>>>> [tm] set priority to 75
>>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected 
>>>>> component
>>>>> [tm]
>>>>> [a00551.science.domain:04104] mca: base: close: component isolated 
>>>>> closed
>>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>>> isolated
>>>>> [a00551.science.domain:04104] mca: base: close: component rsh closed
>>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>>> rsh
>>>>> [a00551.science.domain:04104] mca: base: close: component slurm 
>>>>> closed
>>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>>> slurm
>>>>> [a00551.science.domain:04109] mca: base: components_register: 
>>>>> registering
>>>>> framework plm components
>>>>> [a00551.science.domain:04109] mca: base: components_register: found 
>>>>> loaded
>>>>> component rsh
>>>>> [a00551.science.domain:04109] mca: base: components_register: 
>>>>> component rsh
>>>>> register function successful
>>>>> [a00551.science.domain:04109] mca: base: components_open: opening plm
>>>>> components
>>>>> [a00551.science.domain:04109] mca: base: components_open: found 
>>>>> loaded
>>>>> component rsh
>>>>> [a00551.science.domain:04109] mca: base: components_open: component 
>>>>> rsh open
>>>>> function successful
>>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm 
>>>>> components
>>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Querying 
>>>>> component
>>>>> [rsh]
>>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Query of 
>>>>> component
>>>>> [rsh] set priority to 10
>>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Selected 
>>>>> component
>>>>> [rsh]
>>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error 
>>>>> Address
>>>>> already in use (98)
>>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in 
>>>>> file
>>>>> oob_usock_component.c at line 228
>>>>> Data for JOB [53688,1] offset 0
>>>>> 
>>>>> ========================   JOB MAP   ========================
>>>>> 
>>>>> Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>>        Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: 
>>>>> socket
>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>>>>> 0-1]],
>>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 
>>>>> 5[hwt
>>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 
>>>>> 0[core
>>>>> 8[hwt 0-1]], socket 0[core 9[hwt
>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>        Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: 
>>>>> socket
>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 
>>>>> 0-1]],
>>>>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 
>>>>> 15[hwt
>>>>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 
>>>>> 1[core
>>>>> 18[hwt 0-1]], socket 1[core 19[hwt
>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>> 
>>>>> Data for node: a00553.science.domain   Num slots: 1    Max slots: 0  
>>>>>  Num
>>>>> procs: 1
>>>>>        Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: 
>>>>> socket
>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>>>>> 0-1]],
>>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 
>>>>> 5[hwt
>>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 
>>>>> 0[core
>>>>> 8[hwt 0-1]], socket 0[core 9[hwt
>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>> 
>>>>> =============================================================
>>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job 
>>>>> [53688,1]
>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update 
>>>>> proc
>>>>> state command from [[53688,0],1]
>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>>> update_proc_state for job [53688,1]
>>>>> [1,0]<stdout>:a00551.science.domain
>>>>> [1,2]<stdout>:a00551.science.domain
>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update 
>>>>> proc
>>>>> state command from [[53688,0],1]
>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>>> update_proc_state for job [53688,1]
>>>>> [1,1]<stdout>:a00551.science.domain
>>>>> [a00551.science.domain:04109] mca: base: close: component rsh closed
>>>>> [a00551.science.domain:04109] mca: base: close: unloading component 
>>>>> rsh
>>>>> [a00551.science.domain:04104] mca: base: close: component tm closed
>>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>>> tm
>>>>> 
>>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Which version of Open MPI are you running ?
>>>>>> 
>>>>>> I noted that though you are asking three nodes and one task per 
>>>>>> node,
>>>>>> you have been allocated 2 nodes only.
>>>>>> I do not know if this is related to this issue.
>>>>>> 
>>>>>> Note if you use the machinefile, a00551 has two slots (since it
>>>>>> appears twice in the machinefile) but a00553 has 20 slots (since it
>>>>>> appears once in the machinefile, the number of slots is 
>>>>>> automatically
>>>>>> detected)
>>>>>> 
>>>>>> Can you run
>>>>>> mpirun --mca plm_base_verbose 10 ...
>>>>>> So we can confirm tm is used.
>>>>>> 
>>>>>> Before invoking mpirun, you might want to cleanup the ompi directory 
>>>>>> in
>>>>>> /tmp
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build 
>>>>>>> with
>>>>>>> tm support. Torque is correctly assigning nodes and I can run
>>>>>>> mpi-programs on single nodes just fine. the problem starts when
>>>>>>> processes are split between nodes.
>>>>>>> 
>>>>>>> For example, I create an interactive session with torque and start 
>>>>>>> a
>>>>>>> program by
>>>>>>> 
>>>>>>> qsub -I -n -l nodes=3:ppn=1
>>>>>>> mpirun --tag-output -display-map hostname
>>>>>>> 
>>>>>>> which leads to
>>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>>>>>> Address already in use (98)
>>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error 
>>>>>>> in
>>>>>>> file oob_usock_component.c at line 228
>>>>>>> Data for JOB [65415,1] offset 0
>>>>>>> 
>>>>>>> ========================   JOB MAP   ========================
>>>>>>> 
>>>>>>> Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 
>>>>>>> 2
>>>>>>>        Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound:
>>>>>>> socket
>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>>        Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound:
>>>>>>> socket
>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 
>>>>>>> 12[hwt
>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
>>>>>>> socket
>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 
>>>>>>> 17[hwt
>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>>> 
>>>>>>> Data for node: a00553.science.domain   Num slots: 1    Max slots: 
>>>>>>> 0
>>>>>>> Num
>>>>>>> procs: 1
>>>>>>>        Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound:
>>>>>>> socket
>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>> 
>>>>>>> =============================================================
>>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>>> [1,2]<stdout>:a00551.science.domain
>>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>>> 
>>>>>>> 
>>>>>>> if I login on a00551 and start using the hostfile generated by the
>>>>>>> PBS_NODEFILE, everything works:
>>>>>>> 
>>>>>>> (from within the interactive session)
>>>>>>> echo $PBS_NODEFILE
>>>>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>>>>> cat $PBS_NODEFILE
>>>>>>> a00551.science.domain
>>>>>>> a00553.science.domain
>>>>>>> a00551.science.domain
>>>>>>> 
>>>>>>> (from within the separate login)
>>>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain 
>>>>>>> -np 3
>>>>>>> --tag-output -display-map hostname
>>>>>>> 
>>>>>>> Data for JOB [65445,1] offset 0
>>>>>>> 
>>>>>>> ========================   JOB MAP   ========================
>>>>>>> 
>>>>>>> Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 
>>>>>>> 2
>>>>>>>        Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound:
>>>>>>> socket
>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>>        Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound:
>>>>>>> socket
>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 
>>>>>>> 12[hwt
>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
>>>>>>> socket
>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 
>>>>>>> 17[hwt
>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>>> 
>>>>>>> Data for node: a00553.science.domain   Num slots: 20   Max slots: 
>>>>>>> 0
>>>>>>> Num
>>>>>>> procs: 1
>>>>>>>        Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound:
>>>>>>> socket
>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>> 
>>>>>>> =============================================================
>>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>>> [1,2]<stdout>:a00553.science.domain
>>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>>> 
>>>>>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>>>>>> seriously considering this to be the problem of kerberos
>>>>>>> authentification that we have to work with, but I fail to see how 
>>>>>>> this
>>>>>>> should affect the sockets.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Oswin
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to