Thank you for answering. I executed the test with the following command:

mpirun --mca btl self,sm,tcp --machinefile machine_file cg.B.128 in both setups. My machine file is composed of 128 lines (each machine hostname is repeated 16 times). There is just one container per machine and the container is configured with 16 cores. So, they are able to use "sm". Everything is set properly I used LXC[1], I dont observe any problem with the other benchmarks I executed.

on the network I observe 2 dropped packets over almost all interfaces of the participating nodes. I think this is normal becuase I observe the same thing when I use real machine and the perfomance in this case is much better.

[1] https://linuxcontainers.org/



On 07/28/2015 02:31 PM, Gilles Gouaillardet wrote:
Cristian,

If the message takes some extra time to land into the receiver, then MPI_Wait will take more time. or even worse, if the sender is late, the receiver will spend even more time in MPI_Wait.

First, how do you run 128 tasks on 16 nodes ?
if you do a simple mpirun, then you will use sm or vader btl.
containers can only use the tcp btl, even within the same physical node.
so I encourage you to mpirun --mca tcp,self -np 128 ...
and see if you observe any degradation.

I know very few about containers, but if I remember correctly, you can do stuff such as cgroup (cpu caping, network bandwidth caping, memory limit)
do you use such things ?
a possible explanation is a container reaches it's limit and is given a very low priority.

regardless the containers, you end up having 16 tasks sharing the same interconnect.
I can imagine that an unfair share can lead to this kind of behaviour.

on the network, did you measure zero or few errors ?
few errors take some extra time to be fixed, and if your application is communication intensive, these delays get propagated and you can end up with huge performance hit.

Cheers,

Gilles

On Tuesday, July 28, 2015, Cristian RUIZ <cristian.r...@inria.fr <mailto:cristian.r...@inria.fr>> wrote:

    Hello,

    I'm measuring the overhead of using Linux container for HPC
    applications. To do so I was comparing the execution time of NAS
    parallel benchmarks on two infrastructures:

    1) real: 16 real machines
    2) container: 16 containers distributed over 16 real machines

    Each machine used is equipped with two Intel Xeon E5-2630v3
    processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit
    Ethernet adapter.

    In my results, I found a particular performance degradation for
    CG.B benchmark:

        walltime numprocess      type      ci1      ci2 overhead
    1   6615085         16    native  6473340  6756830 1.1271473
    2   6349030         32    native  6315947  6382112 2.2187747
    3   5811724         64    native  5771509  5851938 0.8983445
    4   4002865        128    native  3966314  4039416 *180.7472715*
    5   4077885        256    native  4044667  4111103 *402.8036531

    *    walltime numprocess      type      ci1 ci2    overhead
    6   6540523         16 container  6458503  6622543 0.0000000
    7   6208159         32 container  6184888  6231431 0.0000000
    8   5759514         64 container  5719453  5799575 0.0000000
    9  11237935        128 container 10762906 11712963 0.0000000
    10 20503755        256 container 19830425 21177085 0.0000000

    (16 MPI processes per machine/container)

    When I use containers everything is fine before 128 MPI
processes. I got 180% and 400% performance degration with 128 and 256 MPI processes respectively. I repeated again the meaures
    and I had statistically the same results. So, I decided to
    generate a trace of the execution using TAU. I discovered that the
    source of the overhead is the MPI_wait() method that sometimes
    takes around 0.2 seconds and this happens around 20 times which
    adds around 4 seconds to the execution time. The method is called
    25992 times and in avarage takes between 50 and 300 usecs (values
    obtained with profiling).
This strange behavior was reported in this paper[1] (page 10) that says:

    "We can see two outstanding zones of MPI_Send and MPI_Wait. Such
    operations typically take few microseconds to less than a
    millisecond. Here they take 0.2 seconds"

    They attributed that strange behavior to package loss and network
    malfunctioning. In my experiments I measured the number of packets
    dropped and nothing unusual happened.
    I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both
    versions I got the same strange behavior. Any clues of what could
    be the source of that strange behavior? could you please suggest
    any method to
    debug this problem?


    Thank you in advance

    [1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf





_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27344.php

Reply via email to