Thanks - I'm just trying to reproduce one problem case so I can look at it. Given that I don't have access to a Torque machine, I need to "fake" it.
On Jun 20, 2014, at 9:15 AM, Brock Palen <bro...@umich.edu> wrote: > In this case they are a single socket, but as you can see they could be > ether/or depending on the job. > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 19, 2014, at 2:44 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Sorry, I should have been clearer - I was asking if cores 8-11 are all on >> one socket, or span multiple sockets >> >> >> On Jun 19, 2014, at 11:36 AM, Brock Palen <bro...@umich.edu> wrote: >> >>> Ralph, >>> >>> It was a large job spread across. Our system allows users to ask for >>> 'procs' which are laid out in any format. >>> >>> The list: >>> >>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >>>> [nyx5409:11][nyx5411:11][nyx5412:3] >>> >>> Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11. >>> >>> They could be spread across any number of sockets configuration. We start >>> very lax "user requests X procs" and then the user can request more strict >>> requirements from there. We support mostly serial users, and users can >>> colocate on nodes. >>> >>> That is good to know, I think we would want to turn our default to 'bind to >>> core' except for our few users who use hybrid mode. >>> >>> Our CPU set tells you what cores the job is assigned. So in the problem >>> case provided, the cpuset/cgroup shows only cores 8-11 are available to >>> this job on this node. >>> >>> Brock Palen >>> www.umich.edu/~brockp >>> CAEN Advanced Computing >>> XSEDE Campus Champion >>> bro...@umich.edu >>> (734)936-1985 >>> >>> >>> >>> On Jun 18, 2014, at 11:10 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> The default binding option depends on the number of procs - it is bind-to >>>> core for np=2, and bind-to socket for np > 2. You never said, but should I >>>> assume you ran 4 ranks? If so, then we should be trying to bind-to socket. >>>> >>>> I'm not sure what your cpuset is telling us - are you binding us to a >>>> socket? Are some cpus in one socket, and some in another? >>>> >>>> It could be that the cpuset + bind-to socket is resulting in some odd >>>> behavior, but I'd need a little more info to narrow it down. >>>> >>>> >>>> On Jun 18, 2014, at 7:48 PM, Brock Palen <bro...@umich.edu> wrote: >>>> >>>>> I have started using 1.8.1 for some codes (meep in this case) and it >>>>> sometimes works fine, but in a few cases I am seeing ranks being given >>>>> overlapping CPU assignments, not always though. >>>>> >>>>> Example job, default binding options (so by-core right?): >>>>> >>>>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, >>>>> and use TM to spawn. >>>>> >>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >>>>> [nyx5409:11][nyx5411:11][nyx5412:3] >>>>> >>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16065 >>>>> 0x00000200 >>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16066 >>>>> 0x00000800 >>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16067 >>>>> 0x00000200 >>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16068 >>>>> 0x00000800 >>>>> >>>>> [root@nyx5398 ~]# cat >>>>> /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus >>>>> 8-11 >>>>> >>>>> So torque claims the CPU set setup for the job has 4 cores, but as you >>>>> can see the ranks were giving identical binding. >>>>> >>>>> I checked the pids they were part of the correct CPU set, I also checked, >>>>> orted: >>>>> >>>>> [root@nyx5398 ~]# hwloc-bind --get --pid 16064 >>>>> 0x00000f00 >>>>> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064 >>>>> ignored unrecognized argument 16064 >>>>> >>>>> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00 >>>>> 8,9,10,11 >>>>> >>>>> Which is exactly what I would expect. >>>>> >>>>> So ummm, i'm lost why this might happen? What else should I check? Like >>>>> I said not all jobs show this behavior. >>>>> >>>>> Brock Palen >>>>> www.umich.edu/~brockp >>>>> CAEN Advanced Computing >>>>> XSEDE Campus Champion >>>>> bro...@umich.edu >>>>> (734)936-1985 >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/06/24672.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/06/24673.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/06/24675.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24676.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24677.php