Re: [OMPI users] strange behavior of MPI_wait() method

Gilles Gouaillardet Tue, 28 Jul 2015 10:03:14 -0400 (EDT)

thanks for clarifying there is only one container per host.

do you always run 16 tasks per host/container ?
or do you always run 16 hosts/containers ?


also, do lxc sets iptables when you start a container ?

Cheers,

Gilles

On Tuesday, July 28, 2015, Cristian RUIZ <cristian.r...@inria.fr> wrote:

>  Thank you for answering. I executed the test with the following command:
>
> mpirun  --mca btl self,sm,tcp --machinefile machine_file cg.B.128 in both
> setups. My machine file is composed of 128 lines (each machine hostname is
> repeated 16 times). There is just one container per machine and the
> container is configured with 16 cores. So, they are able to use "sm".
> Everything is set properly I used LXC[1], I dont observe any problem with
> the other benchmarks I executed.
>
> on the network I observe 2 dropped packets over almost all interfaces of
> the participating nodes. I think this is normal becuase I observe the same
> thing when I use real machine and the perfomance in this case is much
> better.
>
> [1] https://linuxcontainers.org/
>
>
>
> On 07/28/2015 02:31 PM, Gilles Gouaillardet wrote:
>
> Cristian,
>
>  If the message takes some extra time to land into the receiver, then
> MPI_Wait will take more time.
> or even worse, if the sender is late, the receiver will spend even more
> time in MPI_Wait.
>
>  First, how do you run 128 tasks on 16 nodes ?
> if you do a simple mpirun, then you will use sm or vader btl.
> containers can only use the tcp btl, even within the same physical node.
> so I encourage you to mpirun --mca tcp,self -np 128 ...
> and see if you observe any degradation.
>
>  I know very few about containers, but if I remember correctly, you can do
> stuff such as cgroup (cpu caping, network bandwidth caping, memory limit)
> do you use such things ?
> a possible explanation is a container reaches it's limit and is given a
> very low priority.
>
>  regardless the containers, you end up having 16 tasks sharing the same
> interconnect.
> I can imagine that an unfair share can lead to this kind of behaviour.
>
>  on the network, did you measure zero or few errors ?
> few errors take some extra time to be fixed, and if your application is
> communication intensive, these delays get propagated and you can end up
> with huge performance hit.
>
> Cheers,
>
>  Gilles
>
> On Tuesday, July 28, 2015, Cristian RUIZ <cristian.r...@inria.fr
> <javascript:_e(%7B%7D,'cvml','cristian.r...@inria.fr');>> wrote:
>
>>  Hello,
>>
>> I'm measuring the overhead of using Linux container for HPC applications.
>> To do so I was comparing the execution time of NAS parallel benchmarks on
>> two infrastructures:
>>
>> 1) real: 16 real machines
>> 2) container: 16 containers distributed over 16 real machines
>>
>> Each machine used is equipped with two Intel Xeon E5-2630v3 processors
>> (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.
>>
>> In my results, I found a particular performance degradation for CG.B
>> benchmark:
>>
>>     walltime numprocess      type      ci1      ci2    overhead
>> 1   6615085         16    native  6473340  6756830   1.1271473
>> 2   6349030         32    native  6315947  6382112   2.2187747
>> 3   5811724         64    native  5771509  5851938   0.8983445
>> 4   4002865        128    native  3966314  4039416 *180.7472715*
>> 5   4077885        256    native  4044667  4111103
>>
>> *402.8036531 *    walltime numprocess      type      ci1      ci2
>> overhead
>> 6   6540523         16 container  6458503  6622543   0.0000000
>> 7   6208159         32 container  6184888  6231431   0.0000000
>> 8   5759514         64 container  5719453  5799575   0.0000000
>> 9  11237935        128 container 10762906 11712963   0.0000000
>> 10 20503755        256 container 19830425 21177085   0.0000000
>>
>> (16 MPI processes per machine/container)
>>
>> When I use containers everything is fine before 128 MPI processes.  I got
>> 180% and 400% performance degration with 128  and 256 MPI processes
>> respectively. I repeated again the meaures and I had statistically the same
>> results. So, I decided to generate a trace of the execution using TAU. I
>> discovered that the source of the overhead is the MPI_wait() method that
>> sometimes takes around 0.2 seconds and this happens around 20 times which
>> adds around 4 seconds to the execution time. The method is called 25992
>> times and in avarage takes between 50 and 300 usecs (values obtained with
>> profiling).
>> This strange behavior was reported in this paper[1] (page 10)  that says:
>>
>> "We can see two outstanding zones of MPI_Send and MPI_Wait. Such
>> operations typically take few microseconds to less than a millisecond. Here
>> they take 0.2 seconds"
>>
>> They attributed that strange behavior to package loss and network
>> malfunctioning. In my experiments I measured the number of packets dropped
>> and nothing unusual happened.
>> I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I got
>> the same strange behavior. Any clues of what could be the source of that
>> strange behavior? could you please suggest any method to
>> debug this problem?
>>
>>
>> Thank you in advance
>>
>> [1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf
>>
>>
>>
>>
>
> _______________________________________________
> users mailing listus...@open-mpi.org 
> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/07/27344.php
>
>
>

Re: [OMPI users] strange behavior of MPI_wait() method

Reply via email to