Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Jordi Guitart Mon, 27 Mar 2017 08:49:27 -0700

I was not expecting different results. I just wanted to respond to Ben'ssuggestion, and demonstrate that the problem (the performance differencebetween v.1.10.1 and v.1.10.2) is not caused by spin-waiting.


On 27/03/2017 17:00, r...@open-mpi.org wrote:

I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed”setting. So why would you expect different results?
On Mar 27, 2017, at 3:52 AM, Jordi Guitart <jordi.guit...@bsc.es<mailto:jordi.guit...@bsc.es>> wrote:
Hi Ben,
Thanks for your feedback. As described here(https://www.open-mpi.org/faq/?category=running#oversubscribing),OpenMPI detects that I'm oversubscribing and runs in degraded mode(yielding the processor). Anyway, I repeated the experiments settingexplicitly the yielding flag, and I obtained the same weird results:
$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time inseconds = 82.79$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time inseconds = 110.93
Given these results, it seems that spin-waiting is not causing theissue. I also agree that this should not be caused by HyperThreading,given that 0-27 correspond to single HW threads on distinct cores, asshown in the following output returned by the lstopo command:
Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
    Package L#0 + L3 L#0 (35MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#28)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#29)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#30)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#31)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#32)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#33)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#34)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#35)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#36)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#37)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#38)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#39)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#40)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#41)
    HostBridge L#0
      PCIBridge
        PCI 8086:24f0
          Net L#0 "ib0"
          OpenFabrics L#1 "hfi1_0"
      PCIBridge
        PCI 14e4:1665
          Net L#2 "eno1"
        PCI 14e4:1665
          Net L#3 "eno2"
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
                GPU L#4 "card0"
                GPU L#5 "controlD64"
  NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
      PU L#28 (P#14)
      PU L#29 (P#42)
    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
      PU L#30 (P#15)
      PU L#31 (P#43)
    L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
      PU L#32 (P#16)
      PU L#33 (P#44)
    L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
      PU L#34 (P#17)
      PU L#35 (P#45)
    L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
      PU L#36 (P#18)
      PU L#37 (P#46)
    L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
      PU L#38 (P#19)
      PU L#39 (P#47)
    L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
      PU L#40 (P#20)
      PU L#41 (P#48)
    L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
      PU L#42 (P#21)
      PU L#43 (P#49)
    L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
      PU L#44 (P#22)
      PU L#45 (P#50)
    L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
      PU L#46 (P#23)
      PU L#47 (P#51)
    L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
      PU L#48 (P#24)
      PU L#49 (P#52)
    L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
      PU L#50 (P#25)
      PU L#51 (P#53)
    L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
      PU L#52 (P#26)
      PU L#53 (P#54)
    L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
      PU L#54 (P#27)
      PU L#55 (P#55)

On 26/03/2017 9:37, Ben Menadue wrote:
On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es<mailto:jordi.guit...@bsc.es>> wrote:
However, what is puzzling me is the performance difference betweenOpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and laterversions) in my experiments with oversubscription, i.e. 82 secondsvs. 111 seconds.
You’re oversubscribing while letting the OS migrate individualthreads between cores. That taskset will bind each MPI process tothe same set of 28 logical CPUs (i.e. hardware threads), so ifyou’re running 36 ranks there then you must have migrationhappening. Indeed, even when you only launch 28 MPI ranks, you’llprobably still see migration between the cores — but likely a lotless. But as soon as you oversubscribe and spin-wait rather thanyield you’ll be very sensitive to small changes in behaviour — anyminor changes in OpenMPI’s behaviour, while not visible under normalcircumstances, will lead to small changes in how and when the kerneltask scheduler runs the tasks, and this can then multiplydramatically when you have synchronisation between the tasks viae.g. MPI calls.
Just as a purely hypothetical example, the newer versions /might/spin-wait in a slightly tighter loop and this /might/ make the Linuxtask scheduler less likely to switch between waiting threads. Thisdelay in switching tasks /could/ appear as increased latency in anysynchronising MPI call. But this is very speculative — it would bevery hard to draw any conclusion about what’s happening if there’sno clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command.This will make OpenMPI issue a sched_yield when waiting instead ofspin-waiting constantly. While it’s a performance hit when exactly-or under-subscribing, I can see it helping a bit when there’scontention for the cores from over-subscribing. In particular, acall sched_yield relinquishes the rest of that process's currenttime slice, and allows the task scheduler to run another waitingtask (i.e. another of your MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0through 27 correspond to a single hardware thread on 28 distinctcores. Just keep in mind that this might not always be the case — wehave at least one platform where where the logical processor numberenumerates the hardware threads before cores, so 0 to (n-1) are then threads of the first core, n to (2n-1) are of to the second, andso on.
Cheers,
Ben


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
WARNING / LEGAL TEXT: This message is intended only for the use ofthe individual or entity to which it is addressed and may containinformation which is privileged, confidential, proprietary, or exemptfrom disclosure under applicable law. If you are not the intendedrecipient or the person responsible for delivering the message to theintended recipient, you are strictly prohibited from disclosing,distributing, copying, or in any way using this message. If you havereceived this communication in error, please notify the sender anddestroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users





http://bsc.es/disclaimer

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Reply via email to