Thanjs for the ligs >From what i see now, it looks like a00551 is running both mpirun and orted, >though it should only run mpirun, and orted should run only on a00553
I will check the code and see what could be happening here Btw, what is the output of hostname hostname -f On a00551 ? Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled and running correctly on your cluster ? Cheers, Gilles Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >Hi Gilles, > >Thanks for the hint with the machinefile. I know it is not equivalent >and i do not intend to use that approach. I just wanted to know whether >I could start the program successfully at all. > >Outside torque(4.2), rsh seems to be used which works fine, querying a >password if no kerberos ticket is there > >Here is the output: >[zbh251@a00551 ~]$ mpirun -V >mpirun (Open MPI) 2.0.1 >[zbh251@a00551 ~]$ ompi_info | grep ras > MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component >v2.0.1) > MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component >v2.0.1) > MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component >v2.0.1) > MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output >-display-map hostname >[a00551.science.domain:04104] mca: base: components_register: >registering framework plm components >[a00551.science.domain:04104] mca: base: components_register: found >loaded component isolated >[a00551.science.domain:04104] mca: base: components_register: component >isolated has no register or open function >[a00551.science.domain:04104] mca: base: components_register: found >loaded component rsh >[a00551.science.domain:04104] mca: base: components_register: component >rsh register function successful >[a00551.science.domain:04104] mca: base: components_register: found >loaded component slurm >[a00551.science.domain:04104] mca: base: components_register: component >slurm register function successful >[a00551.science.domain:04104] mca: base: components_register: found >loaded component tm >[a00551.science.domain:04104] mca: base: components_register: component >tm register function successful >[a00551.science.domain:04104] mca: base: components_open: opening plm >components >[a00551.science.domain:04104] mca: base: components_open: found loaded >component isolated >[a00551.science.domain:04104] mca: base: components_open: component >isolated open function successful >[a00551.science.domain:04104] mca: base: components_open: found loaded >component rsh >[a00551.science.domain:04104] mca: base: components_open: component rsh >open function successful >[a00551.science.domain:04104] mca: base: components_open: found loaded >component slurm >[a00551.science.domain:04104] mca: base: components_open: component >slurm open function successful >[a00551.science.domain:04104] mca: base: components_open: found loaded >component tm >[a00551.science.domain:04104] mca: base: components_open: component tm >open function successful >[a00551.science.domain:04104] mca:base:select: Auto-selecting plm >components >[a00551.science.domain:04104] mca:base:select:( plm) Querying component >[isolated] >[a00551.science.domain:04104] mca:base:select:( plm) Query of component >[isolated] set priority to 0 >[a00551.science.domain:04104] mca:base:select:( plm) Querying component >[rsh] >[a00551.science.domain:04104] mca:base:select:( plm) Query of component >[rsh] set priority to 10 >[a00551.science.domain:04104] mca:base:select:( plm) Querying component >[slurm] >[a00551.science.domain:04104] mca:base:select:( plm) Querying component >[tm] >[a00551.science.domain:04104] mca:base:select:( plm) Query of component >[tm] set priority to 75 >[a00551.science.domain:04104] mca:base:select:( plm) Selected component >[tm] >[a00551.science.domain:04104] mca: base: close: component isolated >closed >[a00551.science.domain:04104] mca: base: close: unloading component >isolated >[a00551.science.domain:04104] mca: base: close: component rsh closed >[a00551.science.domain:04104] mca: base: close: unloading component rsh >[a00551.science.domain:04104] mca: base: close: component slurm closed >[a00551.science.domain:04104] mca: base: close: unloading component >slurm >[a00551.science.domain:04109] mca: base: components_register: >registering framework plm components >[a00551.science.domain:04109] mca: base: components_register: found >loaded component rsh >[a00551.science.domain:04109] mca: base: components_register: component >rsh register function successful >[a00551.science.domain:04109] mca: base: components_open: opening plm >components >[a00551.science.domain:04109] mca: base: components_open: found loaded >component rsh >[a00551.science.domain:04109] mca: base: components_open: component rsh >open function successful >[a00551.science.domain:04109] mca:base:select: Auto-selecting plm >components >[a00551.science.domain:04109] mca:base:select:( plm) Querying component >[rsh] >[a00551.science.domain:04109] mca:base:select:( plm) Query of component >[rsh] set priority to 10 >[a00551.science.domain:04109] mca:base:select:( plm) Selected component >[rsh] >[a00551.science.domain:04109] [[53688,0],1] bind() failed on error >Address already in use (98) >[a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in >file oob_usock_component.c at line 228 > Data for JOB [53688,1] offset 0 > > ======================== JOB MAP ======================== > > Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 > Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket >0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] > Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket >1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] > > Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num >procs: 1 > Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket >0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] > > ============================================================= >[a00551.science.domain:04104] [[53688,0],0] complete_setup on job >[53688,1] >[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc >state command from [[53688,0],1] >[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >update_proc_state for job [53688,1] >[1,0]<stdout>:a00551.science.domain >[1,2]<stdout>:a00551.science.domain >[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc >state command from [[53688,0],1] >[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >update_proc_state for job [53688,1] >[1,1]<stdout>:a00551.science.domain >[a00551.science.domain:04109] mca: base: close: component rsh closed >[a00551.science.domain:04109] mca: base: close: unloading component rsh >[a00551.science.domain:04104] mca: base: close: component tm closed >[a00551.science.domain:04104] mca: base: close: unloading component tm > >On 2016-09-07 14:41, Gilles Gouaillardet wrote: >> Hi, >> >> Which version of Open MPI are you running ? >> >> I noted that though you are asking three nodes and one task per node, >> you have been allocated 2 nodes only. >> I do not know if this is related to this issue. >> >> Note if you use the machinefile, a00551 has two slots (since it >> appears twice in the machinefile) but a00553 has 20 slots (since it >> appears once in the machinefile, the number of slots is automatically >> detected) >> >> Can you run >> mpirun --mca plm_base_verbose 10 ... >> So we can confirm tm is used. >> >> Before invoking mpirun, you might want to cleanup the ompi directory in >> /tmp >> >> Cheers, >> >> Gilles >> >> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>> Hi, >>> >>> I am currently trying to set up OpenMPI in torque. OpenMPI is build >>> with >>> tm support. Torque is correctly assigning nodes and I can run >>> mpi-programs on single nodes just fine. the problem starts when >>> processes are split between nodes. >>> >>> For example, I create an interactive session with torque and start a >>> program by >>> >>> qsub -I -n -l nodes=3:ppn=1 >>> mpirun --tag-output -display-map hostname >>> >>> which leads to >>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error >>> Address already in use (98) >>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in >>> file oob_usock_component.c at line 228 >>> Data for JOB [65415,1] offset 0 >>> >>> ======================== JOB MAP ======================== >>> >>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>> Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket >>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket >>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>> >>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>> Num >>> procs: 1 >>> Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket >>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> >>> ============================================================= >>> [1,0]<stdout>:a00551.science.domain >>> [1,2]<stdout>:a00551.science.domain >>> [1,1]<stdout>:a00551.science.domain >>> >>> >>> if I login on a00551 and start using the hostfile generated by the >>> PBS_NODEFILE, everything works: >>> >>> (from within the interactive session) >>> echo $PBS_NODEFILE >>> /var/lib/torque/aux//278.a00552.science.domain >>> cat $PBS_NODEFILE >>> a00551.science.domain >>> a00553.science.domain >>> a00551.science.domain >>> >>> (from within the separate login) >>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3 >>> --tag-output -display-map hostname >>> >>> Data for JOB [65445,1] offset 0 >>> >>> ======================== JOB MAP ======================== >>> >>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>> Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket >>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket >>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>> >>> Data for node: a00553.science.domain Num slots: 20 Max slots: 0 >>> Num >>> procs: 1 >>> Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket >>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>> >>> ============================================================= >>> [1,0]<stdout>:a00551.science.domain >>> [1,2]<stdout>:a00553.science.domain >>> [1,1]<stdout>:a00551.science.domain >>> >>> I am kind of lost whats going on here. Anyone having an idea? I am >>> seriously considering this to be the problem of kerberos >>> authentification that we have to work with, but I fail to see how this >>> should affect the sockets. >>> >>> Best, >>> Oswin >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users