Hmmm...that's strange. I only have 2 sockets on my system, but let me poke 
around a bit and see what might be happening.

On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Thanks. I didn't know the meaning of "socket:span".
> 
> But it still causes the problem, which seems socket:span doesn't work.
> 
> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
> qsub: waiting for job 8265.manage.cluster to start
> qsub: job 8265.manage.cluster ready
> 
> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> -map-by socket:span myprog
> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> ocket 1[core 11[hwt 0]]:
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]:
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> socket 2[core 19[hwt 0]]:
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> socket 2[core 23[hwt 0]]:
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> socket 3[core 27[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> socket 3[core 31[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> Hello world from process 0 of 8
> Hello world from process 3 of 8
> Hello world from process 1 of 8
> Hello world from process 4 of 8
> Hello world from process 6 of 8
> Hello world from process 5 of 8
> Hello world from process 2 of 8
> Hello world from process 7 of 8
> 
> Regards,
> Tetsuya Mishima
> 
>> No, that is actually correct. We map a socket until full, then move to
> the next. What you want is --map-by socket:span
>> 
>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> I had a time to try your patch yesterday using openmpi-1.7.4a1r29646.
>>> 
>>> It stopped the error but unfortunately "mapping by socket" itself
> didn't
>>> work
>>> well as shown bellow:
>>> 
>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
>>> qsub: waiting for job 8260.manage.cluster to start
>>> qsub: job 8260.manage.cluster ready
>>> 
>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
>>> -map-by socket myprog
>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>> ocket 1[core 11[hwt 0]]:
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]],
> socket
>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>> socket 1[core 15[hwt 0]]:
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> socket
>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>> socket 2[core 19[hwt 0]]:
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]],
> socket
>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>> socket 2[core 23[hwt 0]]:
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]],
> socket
>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>> socket 3[core 27[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]],
> socket
>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>> socket 3[core 31[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]:
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> socket
>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>> cket 0[core 7[hwt 0]]:
>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>> Hello world from process 2 of 8
>>> Hello world from process 1 of 8
>>> Hello world from process 3 of 8
>>> Hello world from process 0 of 8
>>> Hello world from process 6 of 8
>>> Hello world from process 5 of 8
>>> Hello world from process 4 of 8
>>> Hello world from process 7 of 8
>>> 
>>> I think this should be like this:
>>> 
>>> rank 00
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>> rank 01
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>> rank 02
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>> ...
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
>>>> I fixed this under the trunk (was an issue regardless of RM) and have
>>> scheduled it for 1.7.4.
>>>> 
>>>> Thanks!
>>>> Ralph
>>>> 
>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> Hi Ralph,
>>>>> 
>>>>> Thank you very much for your quick response.
>>>>> 
>>>>> I'm afraid to say that I found one more issuse...
>>>>> 
>>>>> It's not so serious. Please check it when you have a lot of time.
>>>>> 
>>>>> The problem is cpus-per-proc with -map-by option under Torque
> manager.
>>>>> It doesn't work as shown below. I guess you can get the same
>>>>> behaviour under Slurm manager.
>>>>> 
>>>>> Of course, if I remove -map-by option, it works quite well.
>>>>> 
>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
>>>>> qsub: waiting for job 8116.manage.cluster to start
>>>>> qsub: job 8116.manage.cluster ready
>>>>> 
>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2
>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
> -cpus-per-proc
>>> 4
>>>>> -map-by socket mPre
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>> A request was made to bind to that would result in binding more
>>>>> processes than cpus on a resource:
>>>>> 
>>>>> Bind to:         CORE
>>>>> Node:            node03
>>>>> #processes:  2
>>>>> #cpus:          1
>>>>> 
>>>>> You can override this protection by adding the "overload-allowed"
>>>>> option to your binding directive.
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
> -cpus-per-proc
>>> 4
>>>>> mPre
>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]],
>>> socket
>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>> ocket 1[core 11[hwt 0]]:
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]],
>>> socket
>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>> socket 1[core 15[hwt 0]]:
>>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]],
>>> socket
>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>> socket 2[core 19[hwt 0]]:
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]],
>>> socket
>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>> socket 2[core 23[hwt 0]]:
>>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]],
>>> socket
>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>> socket 3[core 27[hwt 0]]:
>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]],
>>> socket
>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>> socket 3[core 31[hwt 0]]:
>>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>> socket
>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>> cket 0[core 3[hwt 0]]:
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]],
>>> socket
>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>> cket 0[core 7[hwt 0]]:
>>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>> 
>>>>> Regards,
>>>>> Tetsuya Mishima
>>>>> 
>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
>>>>>> 
>>>>>> 
>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> 
>>>>>> Thanks! That's precisely where I was going to look when I had
> time :-)
>>>>>> 
>>>>>> I'll update tomorrow.
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,  <tmish...@jcity.maeda.co.jp>wrote:
>>>>>> 
>>>>>> 
>>>>>> Hi Ralph,
>>>>>> 
>>>>>> This is the continuous story of "Segmentation fault in oob_tcp.c of
>>>>>> openmpi-1.7.4a1r29646".
>>>>>> 
>>>>>> I found the cause.
>>>>>> 
>>>>>> Firstly, I noticed that your hostfile can work and mine can not.
>>>>>> 
>>>>>> Your host file:
>>>>>> cat hosts
>>>>>> bend001 slots=12
>>>>>> 
>>>>>> My host file:
>>>>>> cat hosts
>>>>>> node08
>>>>>> node08
>>>>>> ...(total 8 lines)
>>>>>> 
>>>>>> I modified my script file to add "slots=1" to each line of my
> hostfile
>>>>>> just before launching mpirun. Then it worked.
>>>>>> 
>>>>>> My host file(modified):
>>>>>> cat hosts
>>>>>> node08 slots=1
>>>>>> node08 slots=1
>>>>>> ...(total 8 lines)
>>>>>> 
>>>>>> Secondary, I confirmed that there's a slight difference between
>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.
>>>>>> 
>>>>>> $ diff
>>>>>> 
> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
>>>>>> 394,401c394,399
>>>>>> <     if (got_count) {
>>>>>> <         node->slots_given = true;
>>>>>> <     } else if (got_max) {
>>>>>> <         node->slots = node->slots_max;
>>>>>> <         node->slots_given = true;
>>>>>> <     } else {
>>>>>> <         /* should be set by obj_new, but just to be clear */
>>>>>> <         node->slots_given = false;
>>>>>> ---
>>>>>>>   if (!got_count) {
>>>>>>>       if (got_max) {
>>>>>>>           node->slots = node->slots_max;
>>>>>>>       } else {
>>>>>>>           ++node->slots;
>>>>>>>       }
>>>>>> ....
>>>>>> 
>>>>>> Finally, I added the line 402 below just as a tentative trial.
>>>>>> Then, it worked.
>>>>>> 
>>>>>> cat -n orte/util/hostfile/hostfile.c:
>>>>>>  ...
>>>>>>  394      if (got_count) {
>>>>>>  395          node->slots_given = true;
>>>>>>  396      } else if (got_max) {
>>>>>>  397          node->slots = node->slots_max;
>>>>>>  398          node->slots_given = true;
>>>>>>  399      } else {
>>>>>>  400          /* should be set by obj_new, but just to be clear */
>>>>>>  401          node->slots_given = false;
>>>>>>  402          ++node->slots; /* added by tmishima */
>>>>>>  403      }
>>>>>>  ...
>>>>>> 
>>>>>> Please fix the problem properly, because it's just based on my
>>>>>> random guess. It's related to the treatment of hostfile where slots
>>>>>> information is not given.
>>>>>> 
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> 
>>>>> 
>>> 
> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________
> 
>>> 
>>>>> 
>>>>>> users mailing list
>>>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to