I was not expecting different results. I just wanted to respond to Ben's
suggestion, and demonstrate that the problem (the performance difference
between v.1.10.1 and v.1.10.2) is not caused by spin-waiting.
On 27/03/2017 17:00, r...@open-mpi.org wrote:
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed”
setting. So why would you expect different results?
On Mar 27, 2017, at 3:52 AM, Jordi Guitart <jordi.guit...@bsc.es
<mailto:jordi.guit...@bsc.es>> wrote:
Hi Ben,
Thanks for your feedback. As described here
(https://www.open-mpi.org/faq/?category=running#oversubscribing),
OpenMPI detects that I'm oversubscribing and runs in degraded mode
(yielding the processor). Anyway, I repeated the experiments setting
explicitly the yielding flag, and I obtained the same weird results:
$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np
36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in
seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np
36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in
seconds = 110.93
Given these results, it seems that spin-waiting is not causing the
issue. I also agree that this should not be caused by HyperThreading,
given that 0-27 correspond to single HW threads on distinct cores, as
shown in the following output returned by the lstopo command:
Machine (128GB total)
NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
PCIBridge
PCI 8086:24f0
Net L#0 "ib0"
OpenFabrics L#1 "hfi1_0"
PCIBridge
PCI 14e4:1665
Net L#2 "eno1"
PCI 14e4:1665
Net L#3 "eno2"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#55)
On 26/03/2017 9:37, Ben Menadue wrote:
On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es
<mailto:jordi.guit...@bsc.es>> wrote:
However, what is puzzling me is the performance difference between
OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later
versions) in my experiments with oversubscription, i.e. 82 seconds
vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual
threads between cores. That taskset will bind each MPI process to
the same set of 28 logical CPUs (i.e. hardware threads), so if
you’re running 36 ranks there then you must have migration
happening. Indeed, even when you only launch 28 MPI ranks, you’ll
probably still see migration between the cores — but likely a lot
less. But as soon as you oversubscribe and spin-wait rather than
yield you’ll be very sensitive to small changes in behaviour — any
minor changes in OpenMPI’s behaviour, while not visible under normal
circumstances, will lead to small changes in how and when the kernel
task scheduler runs the tasks, and this can then multiply
dramatically when you have synchronisation between the tasks via
e.g. MPI calls.
Just as a purely hypothetical example, the newer versions /might/
spin-wait in a slightly tighter loop and this /might/ make the Linux
task scheduler less likely to switch between waiting threads. This
delay in switching tasks /could/ appear as increased latency in any
synchronising MPI call. But this is very speculative — it would be
very hard to draw any conclusion about what’s happening if there’s
no clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command.
This will make OpenMPI issue a sched_yield when waiting instead of
spin-waiting constantly. While it’s a performance hit when exactly-
or under-subscribing, I can see it helping a bit when there’s
contention for the cores from over-subscribing. In particular, a
call sched_yield relinquishes the rest of that process's current
time slice, and allows the task scheduler to run another waiting
task (i.e. another of your MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0
through 27 correspond to a single hardware thread on 28 distinct
cores. Just keep in mind that this might not always be the case — we
have at least one platform where where the logical processor number
enumerates the hardware threads before cores, so 0 to (n-1) are the
n threads of the first core, n to (2n-1) are of to the second, and
so on.
Cheers,
Ben
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
WARNING / LEGAL TEXT: This message is intended only for the use of
the individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users