Hi Ralph,
I'm sorry that my explanation was not enough ... This is the summary of my situation: 1. I create a hostfile as shown below manually. 2. I use mpirun to start the job without Torque, which means I'm running in an un-managed environment. 3. Firstly, ORTE detects 8 slots on each host(maybe in "orte_ras_base_allocate"). node05: slots=8 max_slots=0 slots_inuse=0 node06: slots=8 max_slots=0 slots_inuse=0 4. Then, the code I identified is resetting the slot counts. node05: slots=1 max_slots=0 slots_inuse=0 node06: slots=1 max_slots=0 slots_inuse=0 5. Therefore, ORTE believes that there is only one slot on each host. Regards, Tetsuya Mishima > No, I didn't use Torque this time. > > This issue is caused only when it is not in the managed > environment - namely, orte_managed_allocation is false > (and orte_set_slots is NULL). > > Under the torque management, it works fine. > > I hope you can understand the situation. > > Tetsuya Mishima > > > I'm sorry, but I'm really confused, so let me try to understand the > situation. > > > > You use Torque to get an allocation, so you are running in a managed > environment. > > > > You then use mpirun to start the job, but pass it a hostfile as shown > below. > > > > Somehow, ORTE believes that there is only one slot on each host, and you > believe the code you've identified is resetting the slot counts. > > > > Is that a correct summary of the situation? > > > > Thanks > > Ralph > > > > On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > > Hi Ralph, > > > > > > I encountered the hostfile issue again where slots are counted by > > > listing the node multiple times. This should be fixed by r29765 > > > - Fix hostfile parsing for the case where RMs count slots .... > > > > > > The difference is using RM or not. At that time, I executed mpirun > through > > > Torque manager. This time I executed it directly from command line as > > > shown at the bottom, where node05,06 has 8 cores. > > > > > > Then, I checked source files arroud it and found that the line 151-160 > in > > > plm_base_launch_support.c caused this issue. As node->slots is already > > > counted in hostfile.c @ r29765 even when node->slots_given is false, > > > I think this part of plm_base_launch_support.c would be unnecesarry. > > > > > > orte/mca/plm/base/plm_base_launch_support.c @ 30189: > > > 151 } else { > > > 152 /* set any non-specified slot counts to 1 */ > > > 153 for (i=0; i < orte_node_pool->size; i++) { > > > 154 if (NULL == (node = > > > (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) { > > > 155 continue; > > > 156 } > > > 157 if (!node->slots_given) { > > > 158 node->slots = 1; > > > 159 } > > > 160 } > > > 161 } > > > > > > Removing this part, it works very well, where the function of > > > orte_set_default_slots is still alive. I think this would be better for > > > the compatible extention of openmpi-1.7.3. > > > > > > Regards, > > > Tetsuya Mishima > > > > > > [mishima@manage work]$ cat pbs_hosts > > > node05 > > > node05 > > > node05 > > > node05 > > > node05 > > > node05 > > > node05 > > > node05 > > > node06 > > > node06 > > > node06 > > > node06 > > > node06 > > > node06 > > > node06 > > > node06 > > > [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -cpus-per-proc > 4 > > > -report-bindings myprog > > > [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], > socket > > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > > [node05.cluster:22287] MCW rank 3 is not bound (or bound to all > available > > > processors) > > > [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket > > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > > [node05.cluster:22287] MCW rank 1 is not bound (or bound to all > available > > > processors) > > > Hello world from process 0 of 4 > > > Hello world from process 1 of 4 > > > Hello world from process 3 of 4 > > > Hello world from process 2 of 4 > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users