Thanks - I'm just trying to reproduce one problem case so I can look at it. 
Given that I don't have access to a Torque machine, I need to "fake" it.


On Jun 20, 2014, at 9:15 AM, Brock Palen <bro...@umich.edu> wrote:

> In this case they are a single socket, but as you can see they could be 
> ether/or depending on the job.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jun 19, 2014, at 2:44 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> Sorry, I should have been clearer - I was asking if cores 8-11 are all on 
>> one socket, or span multiple sockets
>> 
>> 
>> On Jun 19, 2014, at 11:36 AM, Brock Palen <bro...@umich.edu> wrote:
>> 
>>> Ralph,
>>> 
>>> It was a large job spread across.  Our system allows users to ask for 
>>> 'procs' which are laid out in any format. 
>>> 
>>> The list:
>>> 
>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>>> 
>>> Shows that nyx5406 had 2 cores,  nyx5427 also 2,  nyx5411 had 11.
>>> 
>>> They could be spread across any number of sockets configuration.  We start 
>>> very lax "user requests X procs" and then the user can request more strict 
>>> requirements from there.  We support mostly serial users, and users can 
>>> colocate on nodes.
>>> 
>>> That is good to know, I think we would want to turn our default to 'bind to 
>>> core' except for our few users who use hybrid mode.
>>> 
>>> Our CPU set tells you what cores the job is assigned.  So in the problem 
>>> case provided, the cpuset/cgroup shows only cores 8-11 are available to 
>>> this job on this node.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jun 18, 2014, at 11:10 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> The default binding option depends on the number of procs - it is bind-to 
>>>> core for np=2, and bind-to socket for np > 2. You never said, but should I 
>>>> assume you ran 4 ranks? If so, then we should be trying to bind-to socket.
>>>> 
>>>> I'm not sure what your cpuset is telling us - are you binding us to a 
>>>> socket? Are some cpus in one socket, and some in another?
>>>> 
>>>> It could be that the cpuset + bind-to socket is resulting in some odd 
>>>> behavior, but I'd need a little more info to narrow it down.
>>>> 
>>>> 
>>>> On Jun 18, 2014, at 7:48 PM, Brock Palen <bro...@umich.edu> wrote:
>>>> 
>>>>> I have started using 1.8.1 for some codes (meep in this case) and it 
>>>>> sometimes works fine, but in a few cases I am seeing ranks being given 
>>>>> overlapping CPU assignments, not always though.
>>>>> 
>>>>> Example job, default binding options (so by-core right?):
>>>>> 
>>>>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, 
>>>>> and use TM to spawn.
>>>>> 
>>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>>>>> 
>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16065
>>>>> 0x00000200
>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16066
>>>>> 0x00000800
>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16067
>>>>> 0x00000200
>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16068
>>>>> 0x00000800
>>>>> 
>>>>> [root@nyx5398 ~]# cat 
>>>>> /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus 
>>>>> 8-11
>>>>> 
>>>>> So torque claims the CPU set setup for the job has 4 cores, but as you 
>>>>> can see the ranks were giving identical binding. 
>>>>> 
>>>>> I checked the pids they were part of the correct CPU set, I also checked, 
>>>>> orted:
>>>>> 
>>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16064
>>>>> 0x00000f00
>>>>> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064
>>>>> ignored unrecognized argument 16064
>>>>> 
>>>>> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00
>>>>> 8,9,10,11
>>>>> 
>>>>> Which is exactly what I would expect.
>>>>> 
>>>>> So ummm, i'm lost why this might happen?  What else should I check?  Like 
>>>>> I said not all jobs show this behavior.
>>>>> 
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> CAEN Advanced Computing
>>>>> XSEDE Campus Champion
>>>>> bro...@umich.edu
>>>>> (734)936-1985
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24672.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/06/24673.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/06/24675.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/06/24676.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24677.php

Reply via email to