Hi Ralph,

I encountered the hostfile issue again where slots are counted by
listing the node multiple times. This should be fixed by r29765
- Fix hostfile parsing for the case where RMs count slots ....

The difference is using RM or not. At that time, I executed mpirun through
Torque manager. This time I executed it directly from command line as
shown at the bottom, where node05,06 has 8 cores.

Then, I checked source files arroud it and found that the line 151-160 in
plm_base_launch_support.c caused this issue. As node->slots is already
counted in hostfile.c @ r29765 even when node->slots_given is false,
I think this part of plm_base_launch_support.c would be unnecesarry.

orte/mca/plm/base/plm_base_launch_support.c @ 30189:
151             } else {
152                 /* set any non-specified slot counts to 1 */
153                 for (i=0; i < orte_node_pool->size; i++) {
154                     if (NULL == (node =
(orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
155                         continue;
156                     }
157                     if (!node->slots_given) {
158                         node->slots = 1;
159                     }
160                 }
161             }

Removing this part, it works very well, where the function of
orte_set_default_slots is still alive. I think this would be better for
the compatible extention of openmpi-1.7.3.

Regards,
Tetsuya Mishima

[mishima@manage work]$ cat pbs_hosts
node05
node05
node05
node05
node05
node05
node05
node05
node06
node06
node06
node06
node06
node06
node06
node06
[mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -cpus-per-proc 4
-report-bindings myprog
[node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node05.cluster:22287] MCW rank 3 is not bound (or bound to all available
processors)
[node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node05.cluster:22287] MCW rank 1 is not bound (or bound to all available
processors)
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 3 of 4
Hello world from process 2 of 4

Reply via email to