Hello, The measurement has been done with kernel 3.8.2.
Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 x86_64 x86_64 GNU/Linux What information would you like to see on the kernel ? Regards Benoit On 20/03/2013 01:29, "Eric W. Biederman" <ebied...@xmission.com> wrote: >Serge Hallyn <serge.hal...@ubuntu.com> writes: > >> Hi, >> >> Benoit was kind enough to follow up on some scalability issues with >> larger (but not huge imo) numbers of containers. Running a script >> to simply time the creation of veth pairs on a rather large (iiuc) >> machine, he got the following numbers (time is for creation of the >> full number, not latest increment - so 1123 seconds to create 5000 >> veth pairs) > >A kernel version and a profile would be interesting. > >At first glance it looks like things are dramatically slowing down as >things get longer which should not happen. > >There used to be quadratic issues in proc and sysfs that should have >been reduced to O(NlogN) as of 3.4 or so. A comparison to the dummy >device which is a touch simpler than veth and is more frequently >benchmarked could also be revealing. > >>> >Quoting Benoit Lourdelet (blour...@juniper.net): >>> >> Hello Serge, >>> >> >>> >> I put together a small table, running your script for various >>>values : >>> >> >>> >> Time are in seconds, >>> >> >>> >> Number of veth, time to create, time to delete: >>> >> >>> >> 500 18 26 >>> >> >>> >> 1000 57 70 >>> >> >>> >> 2000 193 250 >>> >> >>> >> 3000 435 510 >>> >> >>> >> 4000 752 824 >>> >> >>> >> 5000 1123 1185 >> >>> >>> Benoit >> >> Ok. Ran some tests on a tiny cloud instance. When I simply run 2k >>tasks in >> unshared new network namespaces, it flies by. >> >> #!/bin/sh >> rm -f /tmp/timings3 >> date | tee -a /tmp/timings3 >> for i in `seq 1 2000`; do >> nsexec -n -- /bin/sleep 1000 & >> if [ $((i % 100)) -eq 0 ]; then >> echo $i | tee -a /tmp/timings3 >> date | tee -a /tmp/timings3 >> fi >> done >> >> (all scripts run under sudo, and nsexec can be found at >> https://code.launchpad.net/~serge-hallyn/+junk/nsexec)) >> >> So that isn't an issue. >> >> When I run a script to just time veth pair creations like Benoit ran, >> creating 2000 veth pairs and timing the results for each 100, the time >> does degrade, from 1 second for the first 100 up to 8 seconds for the >> last 100. >> >> (that script for me is: >> >> #!/bin/sh >> rm -f /tmp/timings >> for i in `seq 1 2000`; do >> ip link add type veth >> if [ $((i % 100)) -eq 0 ]; then >> echo $i | tee -a /tmp/timings >> date | tee -a /tmp/timings >> ls /sys/class/net > /dev/null >> fi >> done >> ) >> >> But when I actually pass veth instances to those unshared network >> namespaces: >> >> #!/bin/sh >> rm -f /tmp/timings2 >> echo 0 | tee -a /tmp/timings2 >> date | tee -a /tmp/timings2 >> for i in `seq 1 2000`; do >> nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 & >> ip link add type veth >> dev2=`ls -d /sys/class/net/veth* | tail -1` >> dev=`basename $dev2` >> pid=`cat /tmp/pid.$i` >> ip link set $dev netns $pid >> if [ $((i % 100)) -eq 0 ]; then >> echo $i | tee -a /tmp/timings2 >> date | tee -a /tmp/timings2 >> fi >> rm -f /tmp/pid.* >> done >> >> it goes from 4 seconds for the first hundred to 16 seconds for >> the last hundred - a worse regression than simply creating the >> veths. Though I guess that could be accounted for simply by >> sysfs actions when a veth is moved from the old netns to the >> new? > >And network stack actions. Creating one end of the veth in the desired >network namespace is likely desirable. "ip link add type veth peer netns >..." > >rcu in the past has also played a critical role, as what the network >configuration is when devices are torn down. > >For device movement and device teardown there is at least one >synchronize_rcu, which at scale can slow things down. But if the >syncrhonize_rcu dominates it should be mostly a constant factor cost not >something that gets worse with each device creation. > >Oh and to start with I would specify the name of each network device to >create. Last I looked coming up with a network device name is a O(N) >operation in the number of device names. > >Just to see what I am seeing in 3.9-rc1 I did: > ># time for i in $(seq 1 2000) ; do ip link add a$i type veth peer name >b$i; done >real 0m23.607s >user 0m0.656s >sys 0m18.132s > ># time for i in $(seq 1 2000) ; do ip link del a$i ; done >real 2m8.038s >user 0m0.964s >sys 0m18.688s > >Which is tremendously better than you are reporting below for device >creation. >Now the deletes are still slow because it is hard to back that kind of >delete, having a bunch of network namespaces exit all at once would >likely be much faster as they can be batched and the syncrhonize_rcu >calls drastically reduced. > >What is making you say there is a regression? A regression compared to >what? > >Hmm. > ># time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name >b$i; done >real 2m11.007s >user 0m3.508s >sys 1m55.452s > >Ok there is most definitely something non-linear about the cost of >creating network devices. > >I am happy to comment from previous experience but I'm not volunteering >to profile and fix this one. > >Eric > > >> 0 >> Tue Mar 19 20:15:26 UTC 2013 >> 100 >> Tue Mar 19 20:15:30 UTC 2013 >> 200 >> Tue Mar 19 20:15:35 UTC 2013 >> 300 >> Tue Mar 19 20:15:41 UTC 2013 >> 400 >> Tue Mar 19 20:15:47 UTC 2013 >> 500 >> Tue Mar 19 20:15:54 UTC 2013 >> 600 >> Tue Mar 19 20:16:02 UTC 2013 >> 700 >> Tue Mar 19 20:16:09 UTC 2013 >> 800 >> Tue Mar 19 20:16:17 UTC 2013 >> 900 >> Tue Mar 19 20:16:26 UTC 2013 >> 1000 >> Tue Mar 19 20:16:35 UTC 2013 >> 1100 >> Tue Mar 19 20:16:46 UTC 2013 >> 1200 >> Tue Mar 19 20:16:57 UTC 2013 >> 1300 >> Tue Mar 19 20:17:08 UTC 2013 >> 1400 >> Tue Mar 19 20:17:21 UTC 2013 >> 1500 >> Tue Mar 19 20:17:33 UTC 2013 >> 1600 >> Tue Mar 19 20:17:46 UTC 2013 >> 1700 >> Tue Mar 19 20:17:59 UTC 2013 >> 1800 >> Tue Mar 19 20:18:13 UTC 2013 >> 1900 >> Tue Mar 19 20:18:29 UTC 2013 >> 2000 >> Tue Mar 19 20:18:48 UTC 2013 >> >> -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/