Hi Gilles, I do not have this library. Maybe this helps already...
libmca_common_sm.so libmpi_mpifh.so libmpi_usempif08.so libompitrace.so libopen-rte.so libmpi_cxx.so libmpi.so libmpi_usempi_ignore_tkr.so libopen-pal.so liboshmem.so
and mpirun does only link to libopen-pal/libopen-rte (aside the standard stuff)
But still it is telling me that it has support for tm? libtorque is there and the headers are also there and since i have enabled tm...*sigh*
Thanks again! Oswin On 2016-09-07 16:21, Gilles Gouaillardet wrote:
Note the torque library will only show up if you configure'd with --disable-dlopen. Otherwise, you can ldd /.../lib/openmpi/mca_plm_tm.so Cheers, Gilles Bennet Fauber <ben...@umich.edu> wrote:Oswin, Does the torque library show up if you run $ ldd mpirun That would indicate that Torque support is compiled in. Also, what happens if you use the same hostfile, or some hostfile as an explicit argument when you run mpirun from within the torque job? -- bennet On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:Hi Gilles,Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whether I couldstart the program successfully at all.Outside torque(4.2), rsh seems to be used which works fine, querying apassword if no kerberos ticket is there Here is the output: [zbh251@a00551 ~]$ mpirun -V mpirun (Open MPI) 2.0.1 [zbh251@a00551 ~]$ ompi_info | grep rasMCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Componentv2.0.1)MCA ras: simulator (MCA v2.1.0, API v2.0.0, Componentv2.0.1)MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname[a00551.science.domain:04104] mca: base: components_register: registeringframework plm components[a00551.science.domain:04104] mca: base: components_register: found loadedcomponent isolated[a00551.science.domain:04104] mca: base: components_register: componentisolated has no register or open function[a00551.science.domain:04104] mca: base: components_register: found loadedcomponent rsh[a00551.science.domain:04104] mca: base: components_register: component rshregister function successful[a00551.science.domain:04104] mca: base: components_register: found loadedcomponent slurm[a00551.science.domain:04104] mca: base: components_register: componentslurm register function successful[a00551.science.domain:04104] mca: base: components_register: found loadedcomponent tm[a00551.science.domain:04104] mca: base: components_register: component tmregister function successful [a00551.science.domain:04104] mca: base: components_open: opening plm components[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent isolated[a00551.science.domain:04104] mca: base: components_open: component isolatedopen function successful[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent rsh[a00551.science.domain:04104] mca: base: components_open: component rsh openfunction successful[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent slurm[a00551.science.domain:04104] mca: base: components_open: component slurmopen function successful[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent tm[a00551.science.domain:04104] mca: base: components_open: component tm openfunction successful[a00551.science.domain:04104] mca:base:select: Auto-selecting plm components [a00551.science.domain:04104] mca:base:select:( plm) Querying component[isolated][a00551.science.domain:04104] mca:base:select:( plm) Query of component[isolated] set priority to 0[a00551.science.domain:04104] mca:base:select:( plm) Querying component[rsh][a00551.science.domain:04104] mca:base:select:( plm) Query of component[rsh] set priority to 10[a00551.science.domain:04104] mca:base:select:( plm) Querying component[slurm][a00551.science.domain:04104] mca:base:select:( plm) Querying component[tm][a00551.science.domain:04104] mca:base:select:( plm) Query of component[tm] set priority to 75[a00551.science.domain:04104] mca:base:select:( plm) Selected component[tm][a00551.science.domain:04104] mca: base: close: component isolated closed [a00551.science.domain:04104] mca: base: close: unloading component isolated[a00551.science.domain:04104] mca: base: close: component rsh closed[a00551.science.domain:04104] mca: base: close: unloading component rsh [a00551.science.domain:04104] mca: base: close: component slurm closed [a00551.science.domain:04104] mca: base: close: unloading component slurm [a00551.science.domain:04109] mca: base: components_register: registeringframework plm components[a00551.science.domain:04109] mca: base: components_register: found loadedcomponent rsh[a00551.science.domain:04109] mca: base: components_register: component rshregister function successful [a00551.science.domain:04109] mca: base: components_open: opening plm components[a00551.science.domain:04109] mca: base: components_open: found loadedcomponent rsh[a00551.science.domain:04109] mca: base: components_open: component rsh openfunction successful[a00551.science.domain:04109] mca:base:select: Auto-selecting plm components [a00551.science.domain:04109] mca:base:select:( plm) Querying component[rsh][a00551.science.domain:04109] mca:base:select:( plm) Query of component[rsh] set priority to 10[a00551.science.domain:04109] mca:base:select:( plm) Selected component[rsh][a00551.science.domain:04109] [[53688,0],1] bind() failed on error Addressalready in use (98)[a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in fileoob_usock_component.c at line 228 Data for JOB [53688,1] offset 0 ======================== JOB MAP ======================== Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 1[core18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Numprocs: 1Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] =============================================================[a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1] [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update procstate command from [[53688,0],1] [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got update_proc_state for job [53688,1] [1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00551.science.domain[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update procstate command from [[53688,0],1] [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got update_proc_state for job [53688,1] [1,1]<stdout>:a00551.science.domain [a00551.science.domain:04109] mca: base: close: component rsh closed[a00551.science.domain:04109] mca: base: close: unloading component rsh[a00551.science.domain:04104] mca: base: close: component tm closed[a00551.science.domain:04104] mca: base: close: unloading component tmOn 2016-09-07 14:41, Gilles Gouaillardet wrote:Hi, Which version of Open MPI are you running ?I noted that though you are asking three nodes and one task per node,you have been allocated 2 nodes only. I do not know if this is related to this issue. Note if you use the machinefile, a00551 has two slots (since it appears twice in the machinefile) but a00553 has 20 slots (since itappears once in the machinefile, the number of slots is automaticallydetected) Can you run mpirun --mca plm_base_verbose 10 ... So we can confirm tm is used.Before invoking mpirun, you might want to cleanup the ompi directory in/tmp Cheers, Gilles Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:Hi,I am currently trying to set up OpenMPI in torque. OpenMPI is build withtm support. Torque is correctly assigning nodes and I can run mpi-programs on single nodes just fine. the problem starts when processes are split between nodes.For example, I create an interactive session with torque and start aprogram by qsub -I -n -l nodes=3:ppn=1 mpirun --tag-output -display-map hostname which leads to [a00551.science.domain:15932] [[65415,0],1] bind() failed on error Address already in use (98)[a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error infile oob_usock_component.c at line 228 Data for JOB [65415,1] offset 0 ======================== JOB MAP ========================Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]Data for node: a00553.science.domain Num slots: 1 Max slots: 0Num procs: 1 Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] ============================================================= [1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00551.science.domain [1,1]<stdout>:a00551.science.domain if I login on a00551 and start using the hostfile generated by the PBS_NODEFILE, everything works: (from within the interactive session) echo $PBS_NODEFILE /var/lib/torque/aux//278.a00552.science.domain cat $PBS_NODEFILE a00551.science.domain a00553.science.domain a00551.science.domain (from within the separate login)mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3--tag-output -display-map hostname Data for JOB [65445,1] offset 0 ======================== JOB MAP ========================Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]Data for node: a00553.science.domain Num slots: 20 Max slots: 0Num procs: 1 Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] ============================================================= [1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00553.science.domain [1,1]<stdout>:a00551.science.domain I am kind of lost whats going on here. Anyone having an idea? I am seriously considering this to be the problem of kerberosauthentification that we have to work with, but I fail to see how thisshould affect the sockets. Best, Oswin _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users