Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Ralph Castain Sun, 8 Dec 2013 12:03:35 -0500 (EST)

I fixed this under the trunk (was an issue regardless of RM) and have scheduled 
it for 1.7.4.


Thanks!
Ralph

On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Thank you very much for your quick response.
> 
> I'm afraid to say that I found one more issuse...
> 
> It's not so serious. Please check it when you have a lot of time.
> 
> The problem is cpus-per-proc with -map-by option under Torque manager.
> It doesn't work as shown below. I guess you can get the same
> behaviour under Slurm manager.
> 
> Of course, if I remove -map-by option, it works quite well.
> 
> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
> qsub: waiting for job 8116.manage.cluster to start
> qsub: job 8116.manage.cluster ready
> 
> [mishima@node03 ~]$ cd ~/Ducom/testbed2
> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> -map-by socket mPre
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to:         CORE
>   Node:            node03
>   #processes:  2
>   #cpus:          1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
> 
> 
> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> mPre
> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> ocket 1[core 11[hwt 0]]:
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]:
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> socket 2[core 19[hwt 0]]:
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> socket 2[core 23[hwt 0]]:
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> socket 3[core 27[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> socket 3[core 31[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> 
> Regards,
> Tetsuya Mishima
> 
>> Fixed and scheduled to move to 1.7.4. Thanks again!
>> 
>> 
>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Thanks! That's precisely where I was going to look when I had time :-)
>> 
>> I'll update tomorrow.
>> Ralph
>> 
>> 
>> 
>> 
>> On Sun, Nov 17, 2013 at 7:01 PM,  <tmish...@jcity.maeda.co.jp>wrote:
>> 
>> 
>> Hi Ralph,
>> 
>> This is the continuous story of "Segmentation fault in oob_tcp.c of
>> openmpi-1.7.4a1r29646".
>> 
>> I found the cause.
>> 
>> Firstly, I noticed that your hostfile can work and mine can not.
>> 
>> Your host file:
>> cat hosts
>> bend001 slots=12
>> 
>> My host file:
>> cat hosts
>> node08
>> node08
>> ...(total 8 lines)
>> 
>> I modified my script file to add "slots=1" to each line of my hostfile
>> just before launching mpirun. Then it worked.
>> 
>> My host file(modified):
>> cat hosts
>> node08 slots=1
>> node08 slots=1
>> ...(total 8 lines)
>> 
>> Secondary, I confirmed that there's a slight difference between
>> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.
>> 
>> $ diff
>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
>> 394,401c394,399
>> <     if (got_count) {
>> <         node->slots_given = true;
>> <     } else if (got_max) {
>> <         node->slots = node->slots_max;
>> <         node->slots_given = true;
>> <     } else {
>> <         /* should be set by obj_new, but just to be clear */
>> <         node->slots_given = false;
>> ---
>>>     if (!got_count) {
>>>         if (got_max) {
>>>             node->slots = node->slots_max;
>>>         } else {
>>>             ++node->slots;
>>>         }
>> ....
>> 
>> Finally, I added the line 402 below just as a tentative trial.
>> Then, it worked.
>> 
>> cat -n orte/util/hostfile/hostfile.c:
>>    ...
>>    394      if (got_count) {
>>    395          node->slots_given = true;
>>    396      } else if (got_max) {
>>    397          node->slots = node->slots_max;
>>    398          node->slots_given = true;
>>    399      } else {
>>    400          /* should be set by obj_new, but just to be clear */
>>    401          node->slots_given = false;
>>    402          ++node->slots; /* added by tmishima */
>>    403      }
>>    ...
>> 
>> Please fix the problem properly, because it's just based on my
>> random guess. It's related to the treatment of hostfile where slots
>> information is not given.
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> 
> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________
> 
>> users mailing list
>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Reply via email to