Hi,

Thanks for looking into it. Also thanks to rhc. I tried to be very consistent with the naming after being asked to do so by our it department.

[zbh251@a00551 ~]$ hostname
a00551.science.domain
[zbh251@a00551 ~]$ hostname -f
a00551.science.domain

this is afair the same name as given in the $PBS_HOSTFILE. Of course I do not know how this looks like over the tm interface. Is there an easy way I could query this? I do not know the internals, but shouldn't orted be spawned because through NUMA i have essently got two "virtual" nodes? or is OpenMPI handling this as a special case?

There is no previous version of mpi, this is a new setup with a hand compiled version of OpenMPI (because the standard package does not have --with-tm). Should I compile another version of OpenMPI and see whether this helps?


On 2016-09-07 16:15, r...@open-mpi.org wrote:
The usual cause of this problem is that the nodename in the
machinefile is given as a00551, while Torque is assigning the node
name as a00551.science.domain. Thus, mpirun thinks those are two
separate nodes and winds up spawning an orted on its own node.

You might try ensuring that your machinefile is using the exact same
name as provided in your allocation


On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote:

Thanjs for the ligs

From what i see now, it looks like a00551 is running both mpirun and orted, though it should only run mpirun, and orted should run only on a00553

I will check the code and see what could be happening here

Btw, what is the output of
hostname
hostname -f
On a00551 ?

Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled and running correctly on your cluster ?

Cheers,

Gilles

Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
Hi Gilles,

Thanks for the hint with the machinefile. I know it is not equivalent
and i do not intend to use that approach. I just wanted to know whether
I could start the program successfully at all.

Outside torque(4.2), rsh seems to be used which works fine, querying a
password if no kerberos ticket is there

Here is the output:
[zbh251@a00551 ~]$ mpirun -V
mpirun (Open MPI) 2.0.1
[zbh251@a00551 ~]$ ompi_info | grep ras
MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component
v2.0.1)
                MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
v2.0.1)
                MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
v2.0.1)
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
-display-map hostname
[a00551.science.domain:04104] mca: base: components_register:
registering framework plm components
[a00551.science.domain:04104] mca: base: components_register: found
loaded component isolated
[a00551.science.domain:04104] mca: base: components_register: component
isolated has no register or open function
[a00551.science.domain:04104] mca: base: components_register: found
loaded component rsh
[a00551.science.domain:04104] mca: base: components_register: component
rsh register function successful
[a00551.science.domain:04104] mca: base: components_register: found
loaded component slurm
[a00551.science.domain:04104] mca: base: components_register: component
slurm register function successful
[a00551.science.domain:04104] mca: base: components_register: found
loaded component tm
[a00551.science.domain:04104] mca: base: components_register: component
tm register function successful
[a00551.science.domain:04104] mca: base: components_open: opening plm
components
[a00551.science.domain:04104] mca: base: components_open: found loaded
component isolated
[a00551.science.domain:04104] mca: base: components_open: component
isolated open function successful
[a00551.science.domain:04104] mca: base: components_open: found loaded
component rsh
[a00551.science.domain:04104] mca: base: components_open: component rsh
open function successful
[a00551.science.domain:04104] mca: base: components_open: found loaded
component slurm
[a00551.science.domain:04104] mca: base: components_open: component
slurm open function successful
[a00551.science.domain:04104] mca: base: components_open: found loaded
component tm
[a00551.science.domain:04104] mca: base: components_open: component tm
open function successful
[a00551.science.domain:04104] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:04104] mca:base:select:( plm) Querying component
[isolated]
[a00551.science.domain:04104] mca:base:select:( plm) Query of component
[isolated] set priority to 0
[a00551.science.domain:04104] mca:base:select:( plm) Querying component
[rsh]
[a00551.science.domain:04104] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[a00551.science.domain:04104] mca:base:select:( plm) Querying component
[slurm]
[a00551.science.domain:04104] mca:base:select:( plm) Querying component
[tm]
[a00551.science.domain:04104] mca:base:select:( plm) Query of component
[tm] set priority to 75
[a00551.science.domain:04104] mca:base:select:( plm) Selected component
[tm]
[a00551.science.domain:04104] mca: base: close: component isolated
closed
[a00551.science.domain:04104] mca: base: close: unloading component
isolated
[a00551.science.domain:04104] mca: base: close: component rsh closed
[a00551.science.domain:04104] mca: base: close: unloading component rsh [a00551.science.domain:04104] mca: base: close: component slurm closed
[a00551.science.domain:04104] mca: base: close: unloading component
slurm
[a00551.science.domain:04109] mca: base: components_register:
registering framework plm components
[a00551.science.domain:04109] mca: base: components_register: found
loaded component rsh
[a00551.science.domain:04109] mca: base: components_register: component
rsh register function successful
[a00551.science.domain:04109] mca: base: components_open: opening plm
components
[a00551.science.domain:04109] mca: base: components_open: found loaded
component rsh
[a00551.science.domain:04109] mca: base: components_open: component rsh
open function successful
[a00551.science.domain:04109] mca:base:select: Auto-selecting plm
components
[a00551.science.domain:04109] mca:base:select:( plm) Querying component
[rsh]
[a00551.science.domain:04109] mca:base:select:( plm) Query of component
[rsh] set priority to 10
[a00551.science.domain:04109] mca:base:select:( plm) Selected component
[rsh]
[a00551.science.domain:04109] [[53688,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
Data for JOB [53688,1] offset 0

========================   JOB MAP   ========================

Data for node: a00551   Num slots: 2    Max slots: 0    Num procs: 2
        Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
        Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]

Data for node: a00553.science.domain    Num slots: 1    Max slots: 0    Num
procs: 1
        Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]

=============================================================
[a00551.science.domain:04104] [[53688,0],0] complete_setup on job
[53688,1]
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
state command from [[53688,0],1]
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
update_proc_state for job [53688,1]
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00551.science.domain
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
state command from [[53688,0],1]
[a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
update_proc_state for job [53688,1]
[1,1]<stdout>:a00551.science.domain
[a00551.science.domain:04109] mca: base: close: component rsh closed
[a00551.science.domain:04109] mca: base: close: unloading component rsh
[a00551.science.domain:04104] mca: base: close: component tm closed
[a00551.science.domain:04104] mca: base: close: unloading component tm

On 2016-09-07 14:41, Gilles Gouaillardet wrote:
Hi,

Which version of Open MPI are you running ?

I noted that though you are asking three nodes and one task per node,
you have been allocated 2 nodes only.
I do not know if this is related to this issue.

Note if you use the machinefile, a00551 has two slots (since it
appears twice in the machinefile) but a00553 has 20 slots (since it
appears once in the machinefile, the number of slots is automatically
detected)

Can you run
mpirun --mca plm_base_verbose 10 ...
So we can confirm tm is used.

Before invoking mpirun, you might want to cleanup the ompi directory in
/tmp

Cheers,

Gilles

Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
Hi,

I am currently trying to set up OpenMPI in torque. OpenMPI is build
with
tm support. Torque is correctly assigning nodes and I can run
mpi-programs on single nodes just fine. the problem starts when
processes are split between nodes.

For example, I create an interactive session with torque and start a
program by

qsub -I -n -l nodes=3:ppn=1
mpirun --tag-output -display-map hostname

which leads to
[a00551.science.domain:15932] [[65415,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
Data for JOB [65415,1] offset 0

========================   JOB MAP   ========================

Data for node: a00551   Num slots: 2    Max slots: 0    Num procs: 2
        Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
        Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]

Data for node: a00553.science.domain    Num slots: 1    Max slots: 0    Num
procs: 1
        Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]

=============================================================
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00551.science.domain
[1,1]<stdout>:a00551.science.domain


if I login on a00551 and start using the hostfile generated by the
PBS_NODEFILE, everything works:

(from within the interactive session)
echo $PBS_NODEFILE
/var/lib/torque/aux//278.a00552.science.domain
cat $PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain

(from within the separate login)
mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
--tag-output -display-map hostname

Data for JOB [65445,1] offset 0

========================   JOB MAP   ========================

Data for node: a00551   Num slots: 2    Max slots: 0    Num procs: 2
        Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
        Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]

Data for node: a00553.science.domain    Num slots: 20   Max slots: 0    Num
procs: 1
        Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]

=============================================================
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00553.science.domain
[1,1]<stdout>:a00551.science.domain

I am kind of lost whats going on here. Anyone having an idea? I am
seriously considering this to be the problem of kerberos
authentification that we have to work with, but I fail to see how this
should affect the sockets.

Best,
Oswin
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to