No, I didn't use Torque this time.

This issue is caused only when it is not in the managed
environment - namely, orte_managed_allocation is false
(and orte_set_slots is NULL).

Under the torque management, it works fine.

I hope you can understand the situation.

Tetsuya Mishima

> I'm sorry, but I'm really confused, so let me try to understand the
situation.
>
> You use Torque to get an allocation, so you are running in a managed
environment.
>
> You then use mpirun to start the job, but pass it a hostfile as shown
below.
>
> Somehow, ORTE believes that there is only one slot on each host, and you
believe the code you've identified is resetting the slot counts.
>
> Is that a correct summary of the situation?
>
> Thanks
> Ralph
>
> On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> > Hi Ralph,
> >
> > I encountered the hostfile issue again where slots are counted by
> > listing the node multiple times. This should be fixed by r29765
> > - Fix hostfile parsing for the case where RMs count slots ....
> >
> > The difference is using RM or not. At that time, I executed mpirun
through
> > Torque manager. This time I executed it directly from command line as
> > shown at the bottom, where node05,06 has 8 cores.
> >
> > Then, I checked source files arroud it and found that the line 151-160
in
> > plm_base_launch_support.c caused this issue. As node->slots is already
> > counted in hostfile.c @ r29765 even when node->slots_given is false,
> > I think this part of plm_base_launch_support.c would be unnecesarry.
> >
> > orte/mca/plm/base/plm_base_launch_support.c @ 30189:
> > 151             } else {
> > 152                 /* set any non-specified slot counts to 1 */
> > 153                 for (i=0; i < orte_node_pool->size; i++) {
> > 154                     if (NULL == (node =
> > (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
> > 155                         continue;
> > 156                     }
> > 157                     if (!node->slots_given) {
> > 158                         node->slots = 1;
> > 159                     }
> > 160                 }
> > 161             }
> >
> > Removing this part, it works very well, where the function of
> > orte_set_default_slots is still alive. I think this would be better for
> > the compatible extention of openmpi-1.7.3.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > [mishima@manage work]$ cat pbs_hosts
> > node05
> > node05
> > node05
> > node05
> > node05
> > node05
> > node05
> > node05
> > node06
> > node06
> > node06
> > node06
> > node06
> > node06
> > node06
> > node06
> > [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -cpus-per-proc
4
> > -report-bindings myprog
> > [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
socket
> > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
available
> > processors)
> > [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
available
> > processors)
> > Hello world from process 0 of 4
> > Hello world from process 1 of 4
> > Hello world from process 3 of 4
> > Hello world from process 2 of 4
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to