Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Ben Menadue Sun, 26 Mar 2017 00:40:05 -0700

Hi,

> On 26 Mar 2017, at 1:13 am, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> Here's an old post on this list where I cited a paper from the Intel 
> Technology Journal.

Thanks for that link! I need to go through it in detail, but this paragraph did 
jump out at me:
On a processor with Hyper-Threading Technology, executing HALT transitions the 
processor from MT- mode to ST0- or ST1-mode, depending on which logical 
processor executed the HALT. For example, if logical processor 0 executes HALT, 
only logical processor 1 would be active; the physical processor would be in 
ST1-mode and partitioned resources would be recombined giving logical processor 
1 full use of all processor resources. If the remaining active logical 
processor also executes HALT, the physical processor would then be able to go 
to a lower-power mode. 

Linux’s task scheduler will issue halt instructions when there are no runnable 
tasks (unless you’re running in idle=pool — a nice way to make your datacentre 
toasty warm). This suggests that as long as you don’t schedule tasks on the 
second hardware thread, the first will have access to all the resources of the 
CPU. Yes, you’ll halve your L1 etc as soon as an interrupt wakes the sleeping 
hardware thread, but hopefully that doesn’t happen to often. Turning off one 
hardware thread (via /sys/devices/system/cpu/cpu?/online) should force it to 
issue halt instructions whenever that thread gets woken by an interrupt, since 
that thread will then never have anything scheduled on it.

> On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es> wrote:
> However, what is puzzling me is the performance difference between OpenMPI 
> 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my 
> experiments with oversubscription, i.e. 82 seconds vs. 111 seconds.

You’re oversubscribing while letting the OS migrate individual threads between 
cores. That taskset will bind each MPI process to the same set of 28 logical 
CPUs (i.e. hardware threads), so if you’re running 36 ranks there then you must 
have migration happening. Indeed, even when you only launch 28 MPI ranks, 
you’ll probably still see migration between the cores — but likely a lot less. 
But as soon as you oversubscribe and spin-wait rather than yield you’ll be very 
sensitive to small changes in behaviour — any minor changes in OpenMPI’s 
behaviour, while not visible under normal circumstances, will lead to small 
changes in how and when the kernel task scheduler runs the tasks, and this can 
then multiply dramatically when you have synchronisation between the tasks via 
e.g. MPI calls.

Just as a purely hypothetical example, the newer versions might spin-wait in a 
slightly tighter loop and this might make the Linux task scheduler less likely 
to switch between waiting threads. This delay in switching tasks could appear 
as increased latency in any synchronising MPI call. But this is very 
speculative — it would be very hard to draw any conclusion about what’s 
happening if there’s no clear causative change in the code.

Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will make 
OpenMPI issue a sched_yield when waiting instead of spin-waiting constantly. 
While it’s a performance hit when exactly- or under-subscribing, I can see it 
helping a bit when there’s contention for the cores from over-subscribing. In 
particular, a call sched_yield relinquishes the rest of that process's current 
time slice, and allows the task scheduler to run another waiting task (i.e. 
another of your MPI ranks) in its place.

So in fact this has nothing to do with HyperThreading — assuming 0 through 27 
correspond to a single hardware thread on 28 distinct cores. Just keep in mind 
that this might not always be the case — we have at least one platform where 
where the logical processor number enumerates the hardware threads before 
cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) are of to 
the second, and so on.

Cheers,
Ben

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Reply via email to