I have started using 1.8.1 for some codes (meep in this case) and it sometimes 
works fine, but in a few cases I am seeing ranks being given overlapping CPU 
assignments, not always though.

Example job, default binding options (so by-core right?):

Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and use 
TM to spawn.

[nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
[nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
[nyx5409:11][nyx5411:11][nyx5412:3]

[root@nyx5398 ~]# hwloc-bind --get --pid 16065
0x00000200
[root@nyx5398 ~]# hwloc-bind --get --pid 16066
0x00000800
[root@nyx5398 ~]# hwloc-bind --get --pid 16067
0x00000200
[root@nyx5398 ~]# hwloc-bind --get --pid 16068
0x00000800
      
[root@nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus 
8-11

So torque claims the CPU set setup for the job has 4 cores, but as you can see 
the ranks were giving identical binding. 

I checked the pids they were part of the correct CPU set, I also checked, orted:

[root@nyx5398 ~]# hwloc-bind --get --pid 16064
0x00000f00
[root@nyx5398 ~]# hwloc-calc --intersect PU 16064
ignored unrecognized argument 16064

[root@nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00
8,9,10,11

Which is exactly what I would expect.

So ummm, i'm lost why this might happen?  What else should I check?  Like I 
said not all jobs show this behavior.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to