Hi,
I am currently trying to set up OpenMPI in torque. OpenMPI is build with
tm support. Torque is correctly assigning nodes and I can run
mpi-programs on single nodes just fine. the problem starts when
processes are split between nodes.
For example, I create an interactive session with torque and start a
program by
qsub -I -n -l nodes=3:ppn=1
mpirun --tag-output -display-map hostname
which leads to
[a00551.science.domain:15932] [[65415,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
file oob_usock_component.c at line 228
Data for JOB [65415,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2
Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num
procs: 1
Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00551.science.domain
[1,1]<stdout>:a00551.science.domain
if I login on a00551 and start using the hostfile generated by the
PBS_NODEFILE, everything works:
(from within the interactive session)
echo $PBS_NODEFILE
/var/lib/torque/aux//278.a00552.science.domain
cat $PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain
(from within the separate login)
mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
--tag-output -display-map hostname
Data for JOB [65445,1] offset 0
======================== JOB MAP ========================
Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2
Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket
1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 20 Max slots: 0 Num
procs: 1
Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================
[1,0]<stdout>:a00551.science.domain
[1,2]<stdout>:a00553.science.domain
[1,1]<stdout>:a00551.science.domain
I am kind of lost whats going on here. Anyone having an idea? I am
seriously considering this to be the problem of kerberos
authentification that we have to work with, but I fail to see how this
should affect the sockets.
Best,
Oswin
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users