Hi Ralph,
I confirmed that it worked quite well for my purpose. Thank you very much. I would point out just a small thing. Since the debug information in the rank-file block is useful even when a host is initially detected, OPAL_OUTPUT_VERBOSE in the line 302 should be out of the else-clause as well. Regards, Tetsuya Mishima --- orte/util/hostfile/hostfile.rhc.c 2014-01-20 08:42:40.000000000 +0900 +++ orte/util/hostfile/hostfile.c 2014-01-20 08:51:26.000000000 +0900 @@ -299,14 +299,14 @@ } else { /* add a slot */ node->slots++; - OPAL_OUTPUT_VERBOSE((1, orte_ras_base_framework.framework_output, - "%s hostfile: node %s slots %d", - ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), node->name, node->slots)); } /* mark the slots as "given" since we take them as being the * number specified via the rankfile */ node->slots_given = true; + OPAL_OUTPUT_VERBOSE((1, orte_ras_base_framework.framework_output, + "%s hostfile: node %s slots %d", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), node-> name, node->slots)); /* skip to end of line */ while (!orte_util_hostfile_done && ORTE_HOSTFILE_NEWLINE != token) { > On Jan 19, 2014, at 1:36 AM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Thank you for your fix. I will try it tomorrow. > > > > Before that, although I could not understand everything, > > let me ask some questions about the new hostfile.c. > > > > 1. The line 244-248 is included in else-clause, which might cause > > memory leak(it seems to me). Should it be out of the clause? > > > > 244 if (NULL != node_alias) { > > 245 /* add to list of aliases for this node - only add if > > unique */ > > 246 opal_argv_append_unique_nosize(&node->alias, node_alias, > > false); > > 247 free(node_alias); > > 248 } > > Yes, although it shouldn't ever actually be true unless the node was previously seen anyway > > > > > 2. For the similar reason, should the line 306-314 be out of else-clause? > > Those lines actually shouldn't exist as we don't define an alias in that code block, so node_alias is always NULL > > > > > 3. I think that node->slots_given of hosts detected through rank-file > > should > > always be true to avoid override by orte_set_dafault_slots. Should the line > > 305 > > be out of else-clause as well? > > > > 305 node->slots_given = true; > > Yes - thanks, it was meant to be outside the clause > > > > > Regards, > > Tetsuya Mishima > > > >> I believe I now have this working correctly on the trunk and setup for > > 1.7.4. If you get a chance, please give it a try and confirm it solves the > > problem. > >> > >> Thanks > >> Ralph > >> > >> On Jan 17, 2014, at 2:16 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> > >>> Sorry for delay - I understood and was just occupied with something > > else for a while. Thanks for the follow-up. I'm looking at the issue and > > trying to decipher the right solution. > >>> > >>> > >>> On Jan 17, 2014, at 2:00 PM, tmish...@jcity.maeda.co.jp wrote: > >>> > >>>> > >>>> > >>>> Hi Ralph, > >>>> > >>>> I'm sorry that my explanation was not enough ... > >>>> This is the summary of my situation: > >>>> > >>>> 1. I create a hostfile as shown below manually. > >>>> > >>>> 2. I use mpirun to start the job without Torque, which means I'm > > running in > >>>> an un-managed environment. > >>>> > >>>> 3. Firstly, ORTE detects 8 slots on each host(maybe in > >>>> "orte_ras_base_allocate"). > >>>> node05: slots=8 max_slots=0 slots_inuse=0 > >>>> node06: slots=8 max_slots=0 slots_inuse=0 > >>>> > >>>> 4. Then, the code I identified is resetting the slot counts. > >>>> node05: slots=1 max_slots=0 slots_inuse=0 > >>>> node06: slots=1 max_slots=0 slots_inuse=0 > >>>> > >>>> 5. Therefore, ORTE believes that there is only one slot on each host. > >>>> > >>>> Regards, > >>>> Tetsuya Mishima > >>>> > >>>>> No, I didn't use Torque this time. > >>>>> > >>>>> This issue is caused only when it is not in the managed > >>>>> environment - namely, orte_managed_allocation is false > >>>>> (and orte_set_slots is NULL). > >>>>> > >>>>> Under the torque management, it works fine. > >>>>> > >>>>> I hope you can understand the situation. > >>>>> > >>>>> Tetsuya Mishima > >>>>> > >>>>>> I'm sorry, but I'm really confused, so let me try to understand the > >>>>> situation. > >>>>>> > >>>>>> You use Torque to get an allocation, so you are running in a managed > >>>>> environment. > >>>>>> > >>>>>> You then use mpirun to start the job, but pass it a hostfile as > > shown > >>>>> below. > >>>>>> > >>>>>> Somehow, ORTE believes that there is only one slot on each host, and > >>>> you > >>>>> believe the code you've identified is resetting the slot counts. > >>>>>> > >>>>>> Is that a correct summary of the situation? > >>>>>> > >>>>>> Thanks > >>>>>> Ralph > >>>>>> > >>>>>> On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>> > >>>>>>> > >>>>>>> Hi Ralph, > >>>>>>> > >>>>>>> I encountered the hostfile issue again where slots are counted by > >>>>>>> listing the node multiple times. This should be fixed by r29765 > >>>>>>> - Fix hostfile parsing for the case where RMs count slots .... > >>>>>>> > >>>>>>> The difference is using RM or not. At that time, I executed mpirun > >>>>> through > >>>>>>> Torque manager. This time I executed it directly from command line > > as > >>>>>>> shown at the bottom, where node05,06 has 8 cores. > >>>>>>> > >>>>>>> Then, I checked source files arroud it and found that the line > >>>> 151-160 > >>>>> in > >>>>>>> plm_base_launch_support.c caused this issue. As node->slots is > >>>> already > >>>>>>> counted in hostfile.c @ r29765 even when node->slots_given is > > false, > >>>>>>> I think this part of plm_base_launch_support.c would be > > unnecesarry. > >>>>>>> > >>>>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189: > >>>>>>> 151 } else { > >>>>>>> 152 /* set any non-specified slot counts to 1 */ > >>>>>>> 153 for (i=0; i < orte_node_pool->size; i++) { > >>>>>>> 154 if (NULL == (node = > >>>>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) { > >>>>>>> 155 continue; > >>>>>>> 156 } > >>>>>>> 157 if (!node->slots_given) { > >>>>>>> 158 node->slots = 1; > >>>>>>> 159 } > >>>>>>> 160 } > >>>>>>> 161 } > >>>>>>> > >>>>>>> Removing this part, it works very well, where the function of > >>>>>>> orte_set_default_slots is still alive. I think this would be better > >>>> for > >>>>>>> the compatible extention of openmpi-1.7.3. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Tetsuya Mishima > >>>>>>> > >>>>>>> [mishima@manage work]$ cat pbs_hosts > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node05 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> node06 > >>>>>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts > >>>> -cpus-per-proc > >>>>> 4 > >>>>>>> -report-bindings myprog > >>>>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], > >>>>> socket > >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >>>>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all > >>>>> available > >>>>>>> processors) > >>>>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>>>> socket > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >>>>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all > >>>>> available > >>>>>>> processors) > >>>>>>> Hello world from process 0 of 4 > >>>>>>> Hello world from process 1 of 4 > >>>>>>> Hello world from process 3 of 4 > >>>>>>> Hello world from process 2 of 4 > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users