You can also run: ompi_info | grep 'plm: tm' (note the quotes, because you need to include the space)
If you see a line listing the TM PLM plugin, then you have Torque / PBS support built in to Open MPI. If you don't, then you don't. :-) > On Sep 7, 2016, at 11:01 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > I will double check the name. > If you did not configure with --disable-dlopen, then mpirun only links with > opal and orte. > At run time, these libs will dlopen the plugins (from the openmpi sub > directory, they are named mca_abc_xyz.so) > If you have support for tm, then one of the plugin will be linked with torque > libs > > Cheers, > > Gilles > > Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >> Hi Gilles, >> >> I do not have this library. Maybe this helps already... >> >> libmca_common_sm.so libmpi_mpifh.so libmpi_usempif08.so >> libompitrace.so libopen-rte.so >> libmpi_cxx.so libmpi.so libmpi_usempi_ignore_tkr.so >> libopen-pal.so liboshmem.so >> >> and mpirun does only link to libopen-pal/libopen-rte (aside the standard >> stuff) >> >> But still it is telling me that it has support for tm? libtorque is >> there and the headers are also there and since i have enabled >> tm...*sigh* >> >> Thanks again! >> >> Oswin >> >> On 2016-09-07 16:21, Gilles Gouaillardet wrote: >>> Note the torque library will only show up if you configure'd with >>> --disable-dlopen. Otherwise, you can ldd >>> /.../lib/openmpi/mca_plm_tm.so >>> >>> Cheers, >>> >>> Gilles >>> >>> Bennet Fauber <ben...@umich.edu> wrote: >>>> Oswin, >>>> >>>> Does the torque library show up if you run >>>> >>>> $ ldd mpirun >>>> >>>> That would indicate that Torque support is compiled in. >>>> >>>> Also, what happens if you use the same hostfile, or some hostfile as >>>> an explicit argument when you run mpirun from within the torque job? >>>> >>>> -- bennet >>>> >>>> >>>> >>>> >>>> On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause >>>> <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>>> Hi Gilles, >>>>> >>>>> Thanks for the hint with the machinefile. I know it is not equivalent >>>>> and i >>>>> do not intend to use that approach. I just wanted to know whether I >>>>> could >>>>> start the program successfully at all. >>>>> >>>>> Outside torque(4.2), rsh seems to be used which works fine, querying >>>>> a >>>>> password if no kerberos ticket is there >>>>> >>>>> Here is the output: >>>>> [zbh251@a00551 ~]$ mpirun -V >>>>> mpirun (Open MPI) 2.0.1 >>>>> [zbh251@a00551 ~]$ ompi_info | grep ras >>>>> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, >>>>> Component >>>>> v2.0.1) >>>>> MCA ras: simulator (MCA v2.1.0, API v2.0.0, >>>>> Component >>>>> v2.0.1) >>>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component >>>>> v2.0.1) >>>>> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component >>>>> v2.0.1) >>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output >>>>> -display-map hostname >>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>> registering >>>>> framework plm components >>>>> [a00551.science.domain:04104] mca: base: components_register: found >>>>> loaded >>>>> component isolated >>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>> component >>>>> isolated has no register or open function >>>>> [a00551.science.domain:04104] mca: base: components_register: found >>>>> loaded >>>>> component rsh >>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>> component rsh >>>>> register function successful >>>>> [a00551.science.domain:04104] mca: base: components_register: found >>>>> loaded >>>>> component slurm >>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>> component >>>>> slurm register function successful >>>>> [a00551.science.domain:04104] mca: base: components_register: found >>>>> loaded >>>>> component tm >>>>> [a00551.science.domain:04104] mca: base: components_register: >>>>> component tm >>>>> register function successful >>>>> [a00551.science.domain:04104] mca: base: components_open: opening plm >>>>> components >>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>> loaded >>>>> component isolated >>>>> [a00551.science.domain:04104] mca: base: components_open: component >>>>> isolated >>>>> open function successful >>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>> loaded >>>>> component rsh >>>>> [a00551.science.domain:04104] mca: base: components_open: component >>>>> rsh open >>>>> function successful >>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>> loaded >>>>> component slurm >>>>> [a00551.science.domain:04104] mca: base: components_open: component >>>>> slurm >>>>> open function successful >>>>> [a00551.science.domain:04104] mca: base: components_open: found >>>>> loaded >>>>> component tm >>>>> [a00551.science.domain:04104] mca: base: components_open: component >>>>> tm open >>>>> function successful >>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm >>>>> components >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>> component >>>>> [isolated] >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>>> component >>>>> [isolated] set priority to 0 >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>> component >>>>> [rsh] >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>>> component >>>>> [rsh] set priority to 10 >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>> component >>>>> [slurm] >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>>> component >>>>> [tm] >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>>> component >>>>> [tm] set priority to 75 >>>>> [a00551.science.domain:04104] mca:base:select:( plm) Selected >>>>> component >>>>> [tm] >>>>> [a00551.science.domain:04104] mca: base: close: component isolated >>>>> closed >>>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>>> isolated >>>>> [a00551.science.domain:04104] mca: base: close: component rsh closed >>>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>>> rsh >>>>> [a00551.science.domain:04104] mca: base: close: component slurm >>>>> closed >>>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>>> slurm >>>>> [a00551.science.domain:04109] mca: base: components_register: >>>>> registering >>>>> framework plm components >>>>> [a00551.science.domain:04109] mca: base: components_register: found >>>>> loaded >>>>> component rsh >>>>> [a00551.science.domain:04109] mca: base: components_register: >>>>> component rsh >>>>> register function successful >>>>> [a00551.science.domain:04109] mca: base: components_open: opening plm >>>>> components >>>>> [a00551.science.domain:04109] mca: base: components_open: found >>>>> loaded >>>>> component rsh >>>>> [a00551.science.domain:04109] mca: base: components_open: component >>>>> rsh open >>>>> function successful >>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm >>>>> components >>>>> [a00551.science.domain:04109] mca:base:select:( plm) Querying >>>>> component >>>>> [rsh] >>>>> [a00551.science.domain:04109] mca:base:select:( plm) Query of >>>>> component >>>>> [rsh] set priority to 10 >>>>> [a00551.science.domain:04109] mca:base:select:( plm) Selected >>>>> component >>>>> [rsh] >>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error >>>>> Address >>>>> already in use (98) >>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in >>>>> file >>>>> oob_usock_component.c at line 228 >>>>> Data for JOB [53688,1] offset 0 >>>>> >>>>> ======================== JOB MAP ======================== >>>>> >>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: >>>>> socket >>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>> 0-1]], >>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core >>>>> 5[hwt >>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket >>>>> 0[core >>>>> 8[hwt 0-1]], socket 0[core 9[hwt >>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: >>>>> socket >>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>>> 0-1]], >>>>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core >>>>> 15[hwt >>>>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket >>>>> 1[core >>>>> 18[hwt 0-1]], socket 1[core 19[hwt >>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>> >>>>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>>>> Num >>>>> procs: 1 >>>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: >>>>> socket >>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>> 0-1]], >>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core >>>>> 5[hwt >>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket >>>>> 0[core >>>>> 8[hwt 0-1]], socket 0[core 9[hwt >>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>> >>>>> ============================================================= >>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job >>>>> [53688,1] >>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update >>>>> proc >>>>> state command from [[53688,0],1] >>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >>>>> update_proc_state for job [53688,1] >>>>> [1,0]<stdout>:a00551.science.domain >>>>> [1,2]<stdout>:a00551.science.domain >>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update >>>>> proc >>>>> state command from [[53688,0],1] >>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >>>>> update_proc_state for job [53688,1] >>>>> [1,1]<stdout>:a00551.science.domain >>>>> [a00551.science.domain:04109] mca: base: close: component rsh closed >>>>> [a00551.science.domain:04109] mca: base: close: unloading component >>>>> rsh >>>>> [a00551.science.domain:04104] mca: base: close: component tm closed >>>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>>> tm >>>>> >>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Which version of Open MPI are you running ? >>>>>> >>>>>> I noted that though you are asking three nodes and one task per >>>>>> node, >>>>>> you have been allocated 2 nodes only. >>>>>> I do not know if this is related to this issue. >>>>>> >>>>>> Note if you use the machinefile, a00551 has two slots (since it >>>>>> appears twice in the machinefile) but a00553 has 20 slots (since it >>>>>> appears once in the machinefile, the number of slots is >>>>>> automatically >>>>>> detected) >>>>>> >>>>>> Can you run >>>>>> mpirun --mca plm_base_verbose 10 ... >>>>>> So we can confirm tm is used. >>>>>> >>>>>> Before invoking mpirun, you might want to cleanup the ompi directory >>>>>> in >>>>>> /tmp >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build >>>>>>> with >>>>>>> tm support. Torque is correctly assigning nodes and I can run >>>>>>> mpi-programs on single nodes just fine. the problem starts when >>>>>>> processes are split between nodes. >>>>>>> >>>>>>> For example, I create an interactive session with torque and start >>>>>>> a >>>>>>> program by >>>>>>> >>>>>>> qsub -I -n -l nodes=3:ppn=1 >>>>>>> mpirun --tag-output -display-map hostname >>>>>>> >>>>>>> which leads to >>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error >>>>>>> Address already in use (98) >>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error >>>>>>> in >>>>>>> file oob_usock_component.c at line 228 >>>>>>> Data for JOB [65415,1] offset 0 >>>>>>> >>>>>>> ======================== JOB MAP ======================== >>>>>>> >>>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: >>>>>>> 2 >>>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: >>>>>>> socket >>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: >>>>>>> socket >>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core >>>>>>> 12[hwt >>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], >>>>>>> socket >>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core >>>>>>> 17[hwt >>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>>> >>>>>>> Data for node: a00553.science.domain Num slots: 1 Max slots: >>>>>>> 0 >>>>>>> Num >>>>>>> procs: 1 >>>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: >>>>>>> socket >>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>> >>>>>>> ============================================================= >>>>>>> [1,0]<stdout>:a00551.science.domain >>>>>>> [1,2]<stdout>:a00551.science.domain >>>>>>> [1,1]<stdout>:a00551.science.domain >>>>>>> >>>>>>> >>>>>>> if I login on a00551 and start using the hostfile generated by the >>>>>>> PBS_NODEFILE, everything works: >>>>>>> >>>>>>> (from within the interactive session) >>>>>>> echo $PBS_NODEFILE >>>>>>> /var/lib/torque/aux//278.a00552.science.domain >>>>>>> cat $PBS_NODEFILE >>>>>>> a00551.science.domain >>>>>>> a00553.science.domain >>>>>>> a00551.science.domain >>>>>>> >>>>>>> (from within the separate login) >>>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain >>>>>>> -np 3 >>>>>>> --tag-output -display-map hostname >>>>>>> >>>>>>> Data for JOB [65445,1] offset 0 >>>>>>> >>>>>>> ======================== JOB MAP ======================== >>>>>>> >>>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: >>>>>>> 2 >>>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: >>>>>>> socket >>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: >>>>>>> socket >>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core >>>>>>> 12[hwt >>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], >>>>>>> socket >>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core >>>>>>> 17[hwt >>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>>> >>>>>>> Data for node: a00553.science.domain Num slots: 20 Max slots: >>>>>>> 0 >>>>>>> Num >>>>>>> procs: 1 >>>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: >>>>>>> socket >>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>>> >>>>>>> ============================================================= >>>>>>> [1,0]<stdout>:a00551.science.domain >>>>>>> [1,2]<stdout>:a00553.science.domain >>>>>>> [1,1]<stdout>:a00551.science.domain >>>>>>> >>>>>>> I am kind of lost whats going on here. Anyone having an idea? I am >>>>>>> seriously considering this to be the problem of kerberos >>>>>>> authentification that we have to work with, but I fail to see how >>>>>>> this >>>>>>> should affect the sockets. >>>>>>> >>>>>>> Best, >>>>>>> Oswin >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users