Cristian, If the message takes some extra time to land into the receiver, then MPI_Wait will take more time. or even worse, if the sender is late, the receiver will spend even more time in MPI_Wait.
First, how do you run 128 tasks on 16 nodes ? if you do a simple mpirun, then you will use sm or vader btl. containers can only use the tcp btl, even within the same physical node. so I encourage you to mpirun --mca tcp,self -np 128 ... and see if you observe any degradation. I know very few about containers, but if I remember correctly, you can do stuff such as cgroup (cpu caping, network bandwidth caping, memory limit) do you use such things ? a possible explanation is a container reaches it's limit and is given a very low priority. regardless the containers, you end up having 16 tasks sharing the same interconnect. I can imagine that an unfair share can lead to this kind of behaviour. on the network, did you measure zero or few errors ? few errors take some extra time to be fixed, and if your application is communication intensive, these delays get propagated and you can end up with huge performance hit. Cheers, Gilles On Tuesday, July 28, 2015, Cristian RUIZ <cristian.r...@inria.fr> wrote: > Hello, > > I'm measuring the overhead of using Linux container for HPC applications. > To do so I was comparing the execution time of NAS parallel benchmarks on > two infrastructures: > > 1) real: 16 real machines > 2) container: 16 containers distributed over 16 real machines > > Each machine used is equipped with two Intel Xeon E5-2630v3 processors > (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. > > In my results, I found a particular performance degradation for CG.B > benchmark: > > walltime numprocess type ci1 ci2 overhead > 1 6615085 16 native 6473340 6756830 1.1271473 > 2 6349030 32 native 6315947 6382112 2.2187747 > 3 5811724 64 native 5771509 5851938 0.8983445 > 4 4002865 128 native 3966314 4039416 *180.7472715* > 5 4077885 256 native 4044667 4111103 > > *402.8036531 * walltime numprocess type ci1 ci2 > overhead > 6 6540523 16 container 6458503 6622543 0.0000000 > 7 6208159 32 container 6184888 6231431 0.0000000 > 8 5759514 64 container 5719453 5799575 0.0000000 > 9 11237935 128 container 10762906 11712963 0.0000000 > 10 20503755 256 container 19830425 21177085 0.0000000 > > (16 MPI processes per machine/container) > > When I use containers everything is fine before 128 MPI processes. I got > 180% and 400% performance degration with 128 and 256 MPI processes > respectively. I repeated again the meaures and I had statistically the same > results. So, I decided to generate a trace of the execution using TAU. I > discovered that the source of the overhead is the MPI_wait() method that > sometimes takes around 0.2 seconds and this happens around 20 times which > adds around 4 seconds to the execution time. The method is called 25992 > times and in avarage takes between 50 and 300 usecs (values obtained with > profiling). > This strange behavior was reported in this paper[1] (page 10) that says: > > "We can see two outstanding zones of MPI_Send and MPI_Wait. Such > operations typically take few microseconds to less than a millisecond. Here > they take 0.2 seconds" > > They attributed that strange behavior to package loss and network > malfunctioning. In my experiments I measured the number of packets dropped > and nothing unusual happened. > I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I got > the same strange behavior. Any clues of what could be the source of that > strange behavior? could you please suggest any method to > debug this problem? > > > Thank you in advance > > [1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf > > > >