Hi Han, Thanks for taking the time to do some profiling work on this. In response to this thread, I'd like to get a couple of points across which hopefully you'll find helpful.
It's important to note that OVS 2.0 is the first release with multiple thread support. In this release we focused on getting the basic structure down and correctness. There's going to be a ton of interesting work which can be done to optimize from this point forward. Specifically we're going to look at structural changes to further improve the performance and it's ability so scale to multiple threads. Also we'll invest in improving the efficiency of each thread, possibly by implementing something like RCU. Given that it's early days, it's not surprising that these things aren't as efficient as they could be yet. With regards to the fmbs. These are a temporary solution which clearly aren't ideally. I'm working on some patches which should be ready this week which remove them entirely, and instead handle flow installation from the miss handler threads directly. Obviously, this is going to have a significant impact on the performance characteristics of the switch, so it may be worth watching out for these patches to be merged. There's a tradeoff between CPU utilization and latency. When a new flow miss enters userspace, we can either choose to hold onto it in hopes of forming a batch, or process it immediately reducing the time required for new connections. We've chosen the latter approach for a couple of reasons. First, latency is extremely important for many applications, so it's worth optimizing for. It'd be interesting to try the netperf TCP_CRR test with your 10ms delay patch. My suspicion is that it'd be significantly worse than the more responsive approach. Also, as the system becomes more loaded, batching will happen naturally anyways so the cost of this system is relatively low when under load. Second, any system with a large number of threads is likely used more-or-less exclusively as a switch. For hypervisors, users can easily reduce the number of threads and thus reducing the CPU utilization if they're worried about it. Given that in most cases these highly threaded systems are used just to forward traffic, the amount of CPU they're using should come second to their actual performance moving packets around. Rhetorically, would you rather have a top of rack with high latency and 500% utilization, or low latency and 1000%? Ethan On Mon, Dec 2, 2013 at 7:33 PM, Zhou, Han <hzh...@ebay.com> wrote: > Hi Alex, > > Thanks for your kind feedback. > > On Tuesday, December 03, 2013 3:01 AM, Alex Wang wrote: >> This is the case when the rate of incoming upcalls is slower than the >> "dispatcher" reading speed. After "dispatcher" breaks out the for loop, it >> is >> necessary to wake up "handler" threads with upcalls, since the processing >> latency matters. >> >> A batch mode may help, but more research needs to be done on reducing >> the latency. Have you done any experiment on related issues? > > Yes I tried replacing ovs_mutex_cond_wait() with pthread_cond_timedwait() in > handler with 10ms timeout, and removed the final cond_signal loop in > dispatcher. > It reduced CPU cost significantly for vswitchd with the previous hping3 test > with same > throughput; but in idle situation with simple ping test it shows the timeout > mechanism introduces up to 20ms latency in idle situation. It is hard to > strike a > balance with this approach ... > > BTW, I noticed that there is a wasted cond_signal in current code introduced > in a previous patch: "ofproto-dpif-upcall: reduce number of wakeup" and > suggested > a patch: > http://openvswitch.org/pipermail/dev/2013-November/034427.html > Could you take a look and probably merge with your patch for improving > fairness? > >> >> > > 2. why ovs-vswitchd occupies so much CPU in short-lived flow test before >> > my >> > > change? And why it drops so dramatically? What's the contention between >> > > ovs-vswitchd and upcall_handler? >> >> Yes, we also know that. And we will start solving this soon. >> > It seems fmbs are still handled by ovs-vswitchd instead of multi-threading? > What's the work division between miss-handler and ovs-vswitchd (current and > future)? > Anyway, it is great that you are solving this. > >> > > A better solution for this bottleneck of dispatcher, in my opinion, >> > > could be >> > that >> > > each handler thread receives upcalls assigned to them from kernel >> > > directly >> > thus >> > > no conditional wait and signal involved, which avoid unnecessary context >> > switch >> > > and futex scalling problem in multicore env. The selection of handler >> > > can be >> > > done by kernel with same kind of hash, but put into different queues >> > > per-handler, and this way packet order is preserved. Can this be a valid >> > > proposal? >> >> Yes, I agree, this sounds like the direction we will go in the long term. >> But for >> now, we are focusing on partially addressing this in userspace. Since: >> >> - we want to address the fairness issue as well. And it is much easier to >> model the solution in userspace first. >> - the goal is to guarantee the upcall handling fairness even under DOS type >> attack. > > Understand, and we will also have more test on the behavior when dispatcher > becomes > the bottleneck. > > Best regards, > Han _______________________________________________ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss