Thanjs for the ligs

>From what i see now, it looks like a00551 is running both mpirun and orted, 
>though it should only run mpirun, and orted should run only on a00553

I will check the code and see what could be happening here

Btw, what is the output of
hostname
hostname -f
On a00551 ?

Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled 
and running correctly on your cluster ?

Cheers,

Gilles

Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>Hi Gilles,
>
>Thanks for the hint with the machinefile. I know it is not equivalent 
>and i do not intend to use that approach. I just wanted to know whether 
>I could start the program successfully at all.
>
>Outside torque(4.2), rsh seems to be used which works fine, querying a 
>password if no kerberos ticket is there
>
>Here is the output:
>[zbh251@a00551 ~]$ mpirun -V
>mpirun (Open MPI) 2.0.1
>[zbh251@a00551 ~]$ ompi_info | grep ras
>                  MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component 
>v2.0.1)
>                  MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component 
>v2.0.1)
>                  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>v2.0.1)
>                  MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output 
>-display-map hostname
>[a00551.science.domain:04104] mca: base: components_register: 
>registering framework plm components
>[a00551.science.domain:04104] mca: base: components_register: found 
>loaded component isolated
>[a00551.science.domain:04104] mca: base: components_register: component 
>isolated has no register or open function
>[a00551.science.domain:04104] mca: base: components_register: found 
>loaded component rsh
>[a00551.science.domain:04104] mca: base: components_register: component 
>rsh register function successful
>[a00551.science.domain:04104] mca: base: components_register: found 
>loaded component slurm
>[a00551.science.domain:04104] mca: base: components_register: component 
>slurm register function successful
>[a00551.science.domain:04104] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:04104] mca: base: components_register: component 
>tm register function successful
>[a00551.science.domain:04104] mca: base: components_open: opening plm 
>components
>[a00551.science.domain:04104] mca: base: components_open: found loaded 
>component isolated
>[a00551.science.domain:04104] mca: base: components_open: component 
>isolated open function successful
>[a00551.science.domain:04104] mca: base: components_open: found loaded 
>component rsh
>[a00551.science.domain:04104] mca: base: components_open: component rsh 
>open function successful
>[a00551.science.domain:04104] mca: base: components_open: found loaded 
>component slurm
>[a00551.science.domain:04104] mca: base: components_open: component 
>slurm open function successful
>[a00551.science.domain:04104] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:04104] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:04104] mca:base:select: Auto-selecting plm 
>components
>[a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>[isolated]
>[a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>[isolated] set priority to 0
>[a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>[rsh]
>[a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>[rsh] set priority to 10
>[a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>[slurm]
>[a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>[tm]
>[a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>[tm] set priority to 75
>[a00551.science.domain:04104] mca:base:select:(  plm) Selected component 
>[tm]
>[a00551.science.domain:04104] mca: base: close: component isolated 
>closed
>[a00551.science.domain:04104] mca: base: close: unloading component 
>isolated
>[a00551.science.domain:04104] mca: base: close: component rsh closed
>[a00551.science.domain:04104] mca: base: close: unloading component rsh
>[a00551.science.domain:04104] mca: base: close: component slurm closed
>[a00551.science.domain:04104] mca: base: close: unloading component 
>slurm
>[a00551.science.domain:04109] mca: base: components_register: 
>registering framework plm components
>[a00551.science.domain:04109] mca: base: components_register: found 
>loaded component rsh
>[a00551.science.domain:04109] mca: base: components_register: component 
>rsh register function successful
>[a00551.science.domain:04109] mca: base: components_open: opening plm 
>components
>[a00551.science.domain:04109] mca: base: components_open: found loaded 
>component rsh
>[a00551.science.domain:04109] mca: base: components_open: component rsh 
>open function successful
>[a00551.science.domain:04109] mca:base:select: Auto-selecting plm 
>components
>[a00551.science.domain:04109] mca:base:select:(  plm) Querying component 
>[rsh]
>[a00551.science.domain:04109] mca:base:select:(  plm) Query of component 
>[rsh] set priority to 10
>[a00551.science.domain:04109] mca:base:select:(  plm) Selected component 
>[rsh]
>[a00551.science.domain:04109] [[53688,0],1] bind() failed on error 
>Address already in use (98)
>[a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in 
>file oob_usock_component.c at line 228
>  Data for JOB [53688,1] offset 0
>
>  ========================   JOB MAP   ========================
>
>  Data for node: a00551        Num slots: 2    Max slots: 0    Num procs: 2
>       Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket 
>0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 
>0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
>0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>       Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket 
>1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 
>0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 
>1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 
>0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 
>0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>
>  Data for node: a00553.science.domain Num slots: 1    Max slots: 0    Num 
>procs: 1
>       Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket 
>0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 
>0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
>0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>
>  =============================================================
>[a00551.science.domain:04104] [[53688,0],0] complete_setup on job 
>[53688,1]
>[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc 
>state command from [[53688,0],1]
>[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got 
>update_proc_state for job [53688,1]
>[1,0]<stdout>:a00551.science.domain
>[1,2]<stdout>:a00551.science.domain
>[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc 
>state command from [[53688,0],1]
>[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got 
>update_proc_state for job [53688,1]
>[1,1]<stdout>:a00551.science.domain
>[a00551.science.domain:04109] mca: base: close: component rsh closed
>[a00551.science.domain:04109] mca: base: close: unloading component rsh
>[a00551.science.domain:04104] mca: base: close: component tm closed
>[a00551.science.domain:04104] mca: base: close: unloading component tm
>
>On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>> Hi,
>> 
>> Which version of Open MPI are you running ?
>> 
>> I noted that though you are asking three nodes and one task per node,
>> you have been allocated 2 nodes only.
>> I do not know if this is related to this issue.
>> 
>> Note if you use the machinefile, a00551 has two slots (since it
>> appears twice in the machinefile) but a00553 has 20 slots (since it
>> appears once in the machinefile, the number of slots is automatically
>> detected)
>> 
>> Can you run
>> mpirun --mca plm_base_verbose 10 ...
>> So we can confirm tm is used.
>> 
>> Before invoking mpirun, you might want to cleanup the ompi directory in 
>> /tmp
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>> Hi,
>>> 
>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build 
>>> with
>>> tm support. Torque is correctly assigning nodes and I can run
>>> mpi-programs on single nodes just fine. the problem starts when
>>> processes are split between nodes.
>>> 
>>> For example, I create an interactive session with torque and start a
>>> program by
>>> 
>>> qsub -I -n -l nodes=3:ppn=1
>>> mpirun --tag-output -display-map hostname
>>> 
>>> which leads to
>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>> Address already in use (98)
>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
>>> file oob_usock_component.c at line 228
>>>  Data for JOB [65415,1] offset 0
>>> 
>>>  ========================   JOB MAP   ========================
>>> 
>>>  Data for node: a00551      Num slots: 2    Max slots: 0    Num procs: 2
>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket
>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket
>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>> 
>>>  Data for node: a00553.science.domain       Num slots: 1    Max slots: 0    
>>> Num
>>> procs: 1
>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket
>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>> 
>>>  =============================================================
>>> [1,0]<stdout>:a00551.science.domain
>>> [1,2]<stdout>:a00551.science.domain
>>> [1,1]<stdout>:a00551.science.domain
>>> 
>>> 
>>> if I login on a00551 and start using the hostfile generated by the
>>> PBS_NODEFILE, everything works:
>>> 
>>> (from within the interactive session)
>>> echo $PBS_NODEFILE
>>> /var/lib/torque/aux//278.a00552.science.domain
>>> cat $PBS_NODEFILE
>>> a00551.science.domain
>>> a00553.science.domain
>>> a00551.science.domain
>>> 
>>> (from within the separate login)
>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
>>> --tag-output -display-map hostname
>>> 
>>>  Data for JOB [65445,1] offset 0
>>> 
>>>  ========================   JOB MAP   ========================
>>> 
>>>  Data for node: a00551      Num slots: 2    Max slots: 0    Num procs: 2
>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket
>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket
>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>> 
>>>  Data for node: a00553.science.domain       Num slots: 20   Max slots: 0    
>>> Num
>>> procs: 1
>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket
>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>> 
>>>  =============================================================
>>> [1,0]<stdout>:a00551.science.domain
>>> [1,2]<stdout>:a00553.science.domain
>>> [1,1]<stdout>:a00551.science.domain
>>> 
>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>> seriously considering this to be the problem of kerberos
>>> authentification that we have to work with, but I fail to see how this
>>> should affect the sockets.
>>> 
>>> Best,
>>> Oswin
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to