Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Jordi Guitart Mon, 27 Mar 2017 03:55:38 -0700

Hi Ben,

Thanks for your feedback. As described here(https://www.open-mpi.org/faq/?category=running#oversubscribing),OpenMPI detects that I'm oversubscribing and runs in degraded mode(yielding the processor). Anyway, I repeated the experiments settingexplicitly the yielding flag, and I obtained the same weird results:

$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93

Given these results, it seems that spin-waiting is not causing theissue. I also agree that this should not be caused by HyperThreading,given that 0-27 correspond to single HW threads on distinct cores, asshown in the following output returned by the lstopo command:


Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
    Package L#0 + L3 L#0 (35MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#28)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#29)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#30)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#31)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#32)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#33)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#34)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#35)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#36)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#37)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#38)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#39)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#40)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#41)
    HostBridge L#0
      PCIBridge
        PCI 8086:24f0
          Net L#0 "ib0"
          OpenFabrics L#1 "hfi1_0"
      PCIBridge
        PCI 14e4:1665
          Net L#2 "eno1"
        PCI 14e4:1665
          Net L#3 "eno2"
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
                GPU L#4 "card0"
                GPU L#5 "controlD64"
  NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
      PU L#28 (P#14)
      PU L#29 (P#42)
    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
      PU L#30 (P#15)
      PU L#31 (P#43)
    L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
      PU L#32 (P#16)
      PU L#33 (P#44)
    L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
      PU L#34 (P#17)
      PU L#35 (P#45)
    L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
      PU L#36 (P#18)
      PU L#37 (P#46)
    L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
      PU L#38 (P#19)
      PU L#39 (P#47)
    L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
      PU L#40 (P#20)
      PU L#41 (P#48)
    L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
      PU L#42 (P#21)
      PU L#43 (P#49)
    L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
      PU L#44 (P#22)
      PU L#45 (P#50)
    L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
      PU L#46 (P#23)
      PU L#47 (P#51)
    L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
      PU L#48 (P#24)
      PU L#49 (P#52)
    L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
      PU L#50 (P#25)
      PU L#51 (P#53)
    L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
      PU L#52 (P#26)
      PU L#53 (P#54)
    L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
      PU L#54 (P#27)
      PU L#55 (P#55)

On 26/03/2017 9:37, Ben Menadue wrote:

On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es<mailto:jordi.guit...@bsc.es>> wrote:
However, what is puzzling me is the performance difference betweenOpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and laterversions) in my experiments with oversubscription, i.e. 82 secondsvs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual threadsbetween cores. That taskset will bind each MPI process to the same setof 28 logical CPUs (i.e. hardware threads), so if you’re running 36ranks there then you must have migration happening. Indeed, even whenyou only launch 28 MPI ranks, you’ll probably still see migrationbetween the cores — but likely a lot less. But as soon as youoversubscribe and spin-wait rather than yield you’ll be very sensitiveto small changes in behaviour — any minor changes in OpenMPI’sbehaviour, while not visible under normal circumstances, will lead tosmall changes in how and when the kernel task scheduler runs thetasks, and this can then multiply dramatically when you havesynchronisation between the tasks via e.g. MPI calls.
Just as a purely hypothetical example, the newer versions /might/spin-wait in a slightly tighter loop and this /might/ make the Linuxtask scheduler less likely to switch between waiting threads. Thisdelay in switching tasks /could/ appear as increased latency in anysynchronising MPI call. But this is very speculative — it would bevery hard to draw any conclusion about what’s happening if there’s noclear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. Thiswill make OpenMPI issue a sched_yield when waiting instead ofspin-waiting constantly. While it’s a performance hit when exactly- orunder-subscribing, I can see it helping a bit when there’s contentionfor the cores from over-subscribing. In particular, a call sched_yieldrelinquishes the rest of that process's current time slice, and allowsthe task scheduler to run another waiting task (i.e. another of yourMPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0through 27 correspond to a single hardware thread on 28 distinctcores. Just keep in mind that this might not always be the case — wehave at least one platform where where the logical processor numberenumerates the hardware threads before cores, so 0 to (n-1) are the nthreads of the first core, n to (2n-1) are of to the second, and so on.
Cheers,
Ben


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users





http://bsc.es/disclaimer

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Reply via email to