Hi,
> On 26 Mar 2017, at 1:13 am, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
> Here's an old post on this list where I cited a paper from the Intel
> Technology Journal.
Thanks for that link! I need to go through it in detail, but this paragraph did
jump out at me:
On a processor with Hyper-Threading Technology, executing HALT transitions the
processor from MT- mode to ST0- or ST1-mode, depending on which logical
processor executed the HALT. For example, if logical processor 0 executes HALT,
only logical processor 1 would be active; the physical processor would be in
ST1-mode and partitioned resources would be recombined giving logical processor
1 full use of all processor resources. If the remaining active logical
processor also executes HALT, the physical processor would then be able to go
to a lower-power mode.
Linux’s task scheduler will issue halt instructions when there are no runnable
tasks (unless you’re running in idle=pool — a nice way to make your datacentre
toasty warm). This suggests that as long as you don’t schedule tasks on the
second hardware thread, the first will have access to all the resources of the
CPU. Yes, you’ll halve your L1 etc as soon as an interrupt wakes the sleeping
hardware thread, but hopefully that doesn’t happen to often. Turning off one
hardware thread (via /sys/devices/system/cpu/cpu?/online) should force it to
issue halt instructions whenever that thread gets woken by an interrupt, since
that thread will then never have anything scheduled on it.
> On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es> wrote:
> However, what is puzzling me is the performance difference between OpenMPI
> 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my
> experiments with oversubscription, i.e. 82 seconds vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual threads between
cores. That taskset will bind each MPI process to the same set of 28 logical
CPUs (i.e. hardware threads), so if you’re running 36 ranks there then you must
have migration happening. Indeed, even when you only launch 28 MPI ranks,
you’ll probably still see migration between the cores — but likely a lot less.
But as soon as you oversubscribe and spin-wait rather than yield you’ll be very
sensitive to small changes in behaviour — any minor changes in OpenMPI’s
behaviour, while not visible under normal circumstances, will lead to small
changes in how and when the kernel task scheduler runs the tasks, and this can
then multiply dramatically when you have synchronisation between the tasks via
e.g. MPI calls.
Just as a purely hypothetical example, the newer versions might spin-wait in a
slightly tighter loop and this might make the Linux task scheduler less likely
to switch between waiting threads. This delay in switching tasks could appear
as increased latency in any synchronising MPI call. But this is very
speculative — it would be very hard to draw any conclusion about what’s
happening if there’s no clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will make
OpenMPI issue a sched_yield when waiting instead of spin-waiting constantly.
While it’s a performance hit when exactly- or under-subscribing, I can see it
helping a bit when there’s contention for the cores from over-subscribing. In
particular, a call sched_yield relinquishes the rest of that process's current
time slice, and allows the task scheduler to run another waiting task (i.e.
another of your MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0 through 27
correspond to a single hardware thread on 28 distinct cores. Just keep in mind
that this might not always be the case — we have at least one platform where
where the logical processor number enumerates the hardware threads before
cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) are of to
the second, and so on.
Cheers,
Ben
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users