Hi Ben,
Thanks for your feedback. As described here
(https://www.open-mpi.org/faq/?category=running#oversubscribing),
OpenMPI detects that I'm oversubscribing and runs in degraded mode
(yielding the processor). Anyway, I repeated the experiments setting
explicitly the yielding flag, and I obtained the same weird results:
$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36
taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36
taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93
Given these results, it seems that spin-waiting is not causing the
issue. I also agree that this should not be caused by HyperThreading,
given that 0-27 correspond to single HW threads on distinct cores, as
shown in the following output returned by the lstopo command:
Machine (128GB total)
NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
PCIBridge
PCI 8086:24f0
Net L#0 "ib0"
OpenFabrics L#1 "hfi1_0"
PCIBridge
PCI 14e4:1665
Net L#2 "eno1"
PCI 14e4:1665
Net L#3 "eno2"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#55)
On 26/03/2017 9:37, Ben Menadue wrote:
On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es
<mailto:jordi.guit...@bsc.es>> wrote:
However, what is puzzling me is the performance difference between
OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later
versions) in my experiments with oversubscription, i.e. 82 seconds
vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual threads
between cores. That taskset will bind each MPI process to the same set
of 28 logical CPUs (i.e. hardware threads), so if you’re running 36
ranks there then you must have migration happening. Indeed, even when
you only launch 28 MPI ranks, you’ll probably still see migration
between the cores — but likely a lot less. But as soon as you
oversubscribe and spin-wait rather than yield you’ll be very sensitive
to small changes in behaviour — any minor changes in OpenMPI’s
behaviour, while not visible under normal circumstances, will lead to
small changes in how and when the kernel task scheduler runs the
tasks, and this can then multiply dramatically when you have
synchronisation between the tasks via e.g. MPI calls.
Just as a purely hypothetical example, the newer versions /might/
spin-wait in a slightly tighter loop and this /might/ make the Linux
task scheduler less likely to switch between waiting threads. This
delay in switching tasks /could/ appear as increased latency in any
synchronising MPI call. But this is very speculative — it would be
very hard to draw any conclusion about what’s happening if there’s no
clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This
will make OpenMPI issue a sched_yield when waiting instead of
spin-waiting constantly. While it’s a performance hit when exactly- or
under-subscribing, I can see it helping a bit when there’s contention
for the cores from over-subscribing. In particular, a call sched_yield
relinquishes the rest of that process's current time slice, and allows
the task scheduler to run another waiting task (i.e. another of your
MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0
through 27 correspond to a single hardware thread on 28 distinct
cores. Just keep in mind that this might not always be the case — we
have at least one platform where where the logical processor number
enumerates the hardware threads before cores, so 0 to (n-1) are the n
threads of the first core, n to (2n-1) are of to the second, and so on.
Cheers,
Ben
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users