thanks for clarifying there is only one container per host. do you always run 16 tasks per host/container ? or do you always run 16 hosts/containers ?
also, do lxc sets iptables when you start a container ? Cheers, Gilles On Tuesday, July 28, 2015, Cristian RUIZ <cristian.r...@inria.fr> wrote: > Thank you for answering. I executed the test with the following command: > > mpirun --mca btl self,sm,tcp --machinefile machine_file cg.B.128 in both > setups. My machine file is composed of 128 lines (each machine hostname is > repeated 16 times). There is just one container per machine and the > container is configured with 16 cores. So, they are able to use "sm". > Everything is set properly I used LXC[1], I dont observe any problem with > the other benchmarks I executed. > > on the network I observe 2 dropped packets over almost all interfaces of > the participating nodes. I think this is normal becuase I observe the same > thing when I use real machine and the perfomance in this case is much > better. > > [1] https://linuxcontainers.org/ > > > > On 07/28/2015 02:31 PM, Gilles Gouaillardet wrote: > > Cristian, > > If the message takes some extra time to land into the receiver, then > MPI_Wait will take more time. > or even worse, if the sender is late, the receiver will spend even more > time in MPI_Wait. > > First, how do you run 128 tasks on 16 nodes ? > if you do a simple mpirun, then you will use sm or vader btl. > containers can only use the tcp btl, even within the same physical node. > so I encourage you to mpirun --mca tcp,self -np 128 ... > and see if you observe any degradation. > > I know very few about containers, but if I remember correctly, you can do > stuff such as cgroup (cpu caping, network bandwidth caping, memory limit) > do you use such things ? > a possible explanation is a container reaches it's limit and is given a > very low priority. > > regardless the containers, you end up having 16 tasks sharing the same > interconnect. > I can imagine that an unfair share can lead to this kind of behaviour. > > on the network, did you measure zero or few errors ? > few errors take some extra time to be fixed, and if your application is > communication intensive, these delays get propagated and you can end up > with huge performance hit. > > Cheers, > > Gilles > > On Tuesday, July 28, 2015, Cristian RUIZ <cristian.r...@inria.fr > <javascript:_e(%7B%7D,'cvml','cristian.r...@inria.fr');>> wrote: > >> Hello, >> >> I'm measuring the overhead of using Linux container for HPC applications. >> To do so I was comparing the execution time of NAS parallel benchmarks on >> two infrastructures: >> >> 1) real: 16 real machines >> 2) container: 16 containers distributed over 16 real machines >> >> Each machine used is equipped with two Intel Xeon E5-2630v3 processors >> (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. >> >> In my results, I found a particular performance degradation for CG.B >> benchmark: >> >> walltime numprocess type ci1 ci2 overhead >> 1 6615085 16 native 6473340 6756830 1.1271473 >> 2 6349030 32 native 6315947 6382112 2.2187747 >> 3 5811724 64 native 5771509 5851938 0.8983445 >> 4 4002865 128 native 3966314 4039416 *180.7472715* >> 5 4077885 256 native 4044667 4111103 >> >> *402.8036531 * walltime numprocess type ci1 ci2 >> overhead >> 6 6540523 16 container 6458503 6622543 0.0000000 >> 7 6208159 32 container 6184888 6231431 0.0000000 >> 8 5759514 64 container 5719453 5799575 0.0000000 >> 9 11237935 128 container 10762906 11712963 0.0000000 >> 10 20503755 256 container 19830425 21177085 0.0000000 >> >> (16 MPI processes per machine/container) >> >> When I use containers everything is fine before 128 MPI processes. I got >> 180% and 400% performance degration with 128 and 256 MPI processes >> respectively. I repeated again the meaures and I had statistically the same >> results. So, I decided to generate a trace of the execution using TAU. I >> discovered that the source of the overhead is the MPI_wait() method that >> sometimes takes around 0.2 seconds and this happens around 20 times which >> adds around 4 seconds to the execution time. The method is called 25992 >> times and in avarage takes between 50 and 300 usecs (values obtained with >> profiling). >> This strange behavior was reported in this paper[1] (page 10) that says: >> >> "We can see two outstanding zones of MPI_Send and MPI_Wait. Such >> operations typically take few microseconds to less than a millisecond. Here >> they take 0.2 seconds" >> >> They attributed that strange behavior to package loss and network >> malfunctioning. In my experiments I measured the number of packets dropped >> and nothing unusual happened. >> I used two versions of OpenMPI 1.6.5 and 1.8.5 and in both versions I got >> the same strange behavior. Any clues of what could be the source of that >> strange behavior? could you please suggest any method to >> debug this problem? >> >> >> Thank you in advance >> >> [1] https://hal.inria.fr/hal-00919507/file/smpi_pmbs13.pdf >> >> >> >> > > _______________________________________________ > users mailing listus...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27344.php > > >