I have started using 1.8.1 for some codes (meep in this case) and it sometimes works fine, but in a few cases I am seeing ranks being given overlapping CPU assignments, not always though.
Example job, default binding options (so by-core right?): Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and use TM to spawn. [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] [nyx5409:11][nyx5411:11][nyx5412:3] [root@nyx5398 ~]# hwloc-bind --get --pid 16065 0x00000200 [root@nyx5398 ~]# hwloc-bind --get --pid 16066 0x00000800 [root@nyx5398 ~]# hwloc-bind --get --pid 16067 0x00000200 [root@nyx5398 ~]# hwloc-bind --get --pid 16068 0x00000800 [root@nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus 8-11 So torque claims the CPU set setup for the job has 4 cores, but as you can see the ranks were giving identical binding. I checked the pids they were part of the correct CPU set, I also checked, orted: [root@nyx5398 ~]# hwloc-bind --get --pid 16064 0x00000f00 [root@nyx5398 ~]# hwloc-calc --intersect PU 16064 ignored unrecognized argument 16064 [root@nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00 8,9,10,11 Which is exactly what I would expect. So ummm, i'm lost why this might happen? What else should I check? Like I said not all jobs show this behavior. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985
signature.asc
Description: Message signed with OpenPGP using GPGMail