>>> In your desired ordering you have rank 0 on (socket,core) (0,0) and 
>>> rank 1 on (0,2). Is there an architectural reason for that? Meaning 
>>> are cores 0 and 1 hardware threads in the same core, or is there a 
>>> cache level (say L2 or L3) connecting cores 0 and 1 separate from 
>>> cores 2 and 3? 

My thinking was that each MPI rank will be running 2 OpenMP threads and that 
there might be some benefit to having those threads execute on cores 0 and 1 
because those cores might share some level of the memory hierarchy.  No 
hardware threading is being used.

>>> hwloc's lstopo should give you that information if you don't have that 
>>> information handy. 

Here you go, first likwid output then hwloc, just for the first socket.

likwid output:
*************************************************************
Graphical:
*************************************************************
Socket 0:
+-----------------------------------------------------------------------------------------------------+
| +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ 
+-------+ +-------+ +-------+ |
| |   0   | |   1   | |   2   | |   3   | |   4   | |   5   | |   6   | |   7   
| |   8   | |   9   | |
| +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ 
+-------+ +-------+ +-------+ |
| +-----------------+ +-----------------+ +-----------------+ 
+-----------------+ +-----------------+ |
| |       32kB      | |       32kB      | |       32kB      | |       32kB      
| |       32kB      | |
| +-----------------+ +-----------------+ +-----------------+ 
+-----------------+ +-----------------+ |
| +-----------------+ +-----------------+ +-----------------+ 
+-----------------+ +-----------------+ |
| |      256kB      | |      256kB      | |      256kB      | |      256kB      
| |      256kB      | |
| +-----------------+ +-----------------+ +-----------------+ 
+-----------------+ +-----------------+ |
| 
+-------------------------------------------------------------------------------------------------+
 |
| |                                               30MB                          
                    | |
| 
+-------------------------------------------------------------------------------------------------+
 |
+-----------------------------------------------------------------------------------------------------+

hwloc output:

Machine (512GB)
  NUMANode L#0 (P#0 64GB) + Socket L#0 + L3 L#0 (30MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)

Thanks again

Reply via email to