I was not expecting different results. I just wanted to respond to Ben's suggestion, and demonstrate that the problem (the performance difference between v.1.10.1 and v.1.10.2) is not caused by spin-waiting.

On 27/03/2017 17:00, r...@open-mpi.org wrote:
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. So why would you expect different results?

On Mar 27, 2017, at 3:52 AM, Jordi Guitart <jordi.guit...@bsc.es <mailto:jordi.guit...@bsc.es>> wrote:

Hi Ben,

Thanks for your feedback. As described here (https://www.open-mpi.org/faq/?category=running#oversubscribing), OpenMPI detects that I'm oversubscribing and runs in degraded mode (yielding the processor). Anyway, I repeated the experiments setting explicitly the yielding flag, and I obtained the same weird results:

$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79 $HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93

Given these results, it seems that spin-waiting is not causing the issue. I also agree that this should not be caused by HyperThreading, given that 0-27 correspond to single HW threads on distinct cores, as shown in the following output returned by the lstopo command:

Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
    Package L#0 + L3 L#0 (35MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#28)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#29)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#30)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#31)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#32)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#33)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#34)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#35)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#36)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#37)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#38)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#39)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#40)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#41)
    HostBridge L#0
      PCIBridge
        PCI 8086:24f0
          Net L#0 "ib0"
          OpenFabrics L#1 "hfi1_0"
      PCIBridge
        PCI 14e4:1665
          Net L#2 "eno1"
        PCI 14e4:1665
          Net L#3 "eno2"
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
                GPU L#4 "card0"
                GPU L#5 "controlD64"
  NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
      PU L#28 (P#14)
      PU L#29 (P#42)
    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
      PU L#30 (P#15)
      PU L#31 (P#43)
    L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
      PU L#32 (P#16)
      PU L#33 (P#44)
    L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
      PU L#34 (P#17)
      PU L#35 (P#45)
    L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
      PU L#36 (P#18)
      PU L#37 (P#46)
    L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
      PU L#38 (P#19)
      PU L#39 (P#47)
    L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
      PU L#40 (P#20)
      PU L#41 (P#48)
    L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
      PU L#42 (P#21)
      PU L#43 (P#49)
    L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
      PU L#44 (P#22)
      PU L#45 (P#50)
    L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
      PU L#46 (P#23)
      PU L#47 (P#51)
    L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
      PU L#48 (P#24)
      PU L#49 (P#52)
    L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
      PU L#50 (P#25)
      PU L#51 (P#53)
    L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
      PU L#52 (P#26)
      PU L#53 (P#54)
    L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
      PU L#54 (P#27)
      PU L#55 (P#55)

On 26/03/2017 9:37, Ben Menadue wrote:
On 26 Mar 2017, at 2:22 am, Jordi Guitart <jordi.guit...@bsc.es <mailto:jordi.guit...@bsc.es>> wrote:
However, what is puzzling me is the performance difference between OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my experiments with oversubscription, i.e. 82 seconds vs. 111 seconds.

You’re oversubscribing while letting the OS migrate individual threads between cores. That taskset will bind each MPI process to the same set of 28 logical CPUs (i.e. hardware threads), so if you’re running 36 ranks there then you must have migration happening. Indeed, even when you only launch 28 MPI ranks, you’ll probably still see migration between the cores — but likely a lot less. But as soon as you oversubscribe and spin-wait rather than yield you’ll be very sensitive to small changes in behaviour — any minor changes in OpenMPI’s behaviour, while not visible under normal circumstances, will lead to small changes in how and when the kernel task scheduler runs the tasks, and this can then multiply dramatically when you have synchronisation between the tasks via e.g. MPI calls.

Just as a purely hypothetical example, the newer versions /might/ spin-wait in a slightly tighter loop and this /might/ make the Linux task scheduler less likely to switch between waiting threads. This delay in switching tasks /could/ appear as increased latency in any synchronising MPI call. But this is very speculative — it would be very hard to draw any conclusion about what’s happening if there’s no clear causative change in the code.

Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will make OpenMPI issue a sched_yield when waiting instead of spin-waiting constantly. While it’s a performance hit when exactly- or under-subscribing, I can see it helping a bit when there’s contention for the cores from over-subscribing. In particular, a call sched_yield relinquishes the rest of that process's current time slice, and allows the task scheduler to run another waiting task (i.e. another of your MPI ranks) in its place.

So in fact this has nothing to do with HyperThreading — assuming 0 through 27 correspond to a single hardware thread on 28 distinct cores. Just keep in mind that this might not always be the case — we have at least one platform where where the logical processor number enumerates the hardware threads before cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) are of to the second, and so on.

Cheers,
Ben


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to