Hi,

Which version of Open MPI are you running ?

I noted that though you are asking three nodes and one task per node, you have 
been allocated 2 nodes only.
I do not know if this is related to this issue.

Note if you use the machinefile, a00551 has two slots (since it appears twice 
in the machinefile) but a00553 has 20 slots (since it appears once in the 
machinefile, the number of slots is automatically detected)

Can you run
mpirun --mca plm_base_verbose 10 ...
So we can confirm tm is used.

Before invoking mpirun, you might want to cleanup the ompi directory in /tmp

Cheers,

Gilles

Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>Hi,
>
>I am currently trying to set up OpenMPI in torque. OpenMPI is build with 
>tm support. Torque is correctly assigning nodes and I can run 
>mpi-programs on single nodes just fine. the problem starts when 
>processes are split between nodes.
>
>For example, I create an interactive session with torque and start a 
>program by
>
>qsub -I -n -l nodes=3:ppn=1
>mpirun --tag-output -display-map hostname
>
>which leads to
>[a00551.science.domain:15932] [[65415,0],1] bind() failed on error 
>Address already in use (98)
>[a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in 
>file oob_usock_component.c at line 228
>  Data for JOB [65415,1] offset 0
>
>  ========================   JOB MAP   ========================
>
>  Data for node: a00551        Num slots: 2    Max slots: 0    Num procs: 2
>       Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket 
>0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 
>0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
>0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>       Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket 
>1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 
>0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 
>1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 
>0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 
>0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>
>  Data for node: a00553.science.domain Num slots: 1    Max slots: 0    Num 
>procs: 1
>       Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket 
>0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 
>0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
>0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>
>  =============================================================
>[1,0]<stdout>:a00551.science.domain
>[1,2]<stdout>:a00551.science.domain
>[1,1]<stdout>:a00551.science.domain
>
>
>if I login on a00551 and start using the hostfile generated by the 
>PBS_NODEFILE, everything works:
>
>(from within the interactive session)
>echo $PBS_NODEFILE
>/var/lib/torque/aux//278.a00552.science.domain
>cat $PBS_NODEFILE
>a00551.science.domain
>a00553.science.domain
>a00551.science.domain
>
>(from within the separate login)
>mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3  
>--tag-output -display-map hostname
>
>  Data for JOB [65445,1] offset 0
>
>  ========================   JOB MAP   ========================
>
>  Data for node: a00551        Num slots: 2    Max slots: 0    Num procs: 2
>       Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket 
>0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 
>0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
>0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>       Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket 
>1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 
>0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 
>1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 
>0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 
>0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>
>  Data for node: a00553.science.domain Num slots: 20   Max slots: 0    Num 
>procs: 1
>       Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket 
>0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 
>0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
>0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>
>  =============================================================
>[1,0]<stdout>:a00551.science.domain
>[1,2]<stdout>:a00553.science.domain
>[1,1]<stdout>:a00551.science.domain
>
>I am kind of lost whats going on here. Anyone having an idea? I am 
>seriously considering this to be the problem of kerberos 
>authentification that we have to work with, but I fail to see how this 
>should affect the sockets.
>
>Best,
>Oswin
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to