Hi, We would like to resume our earlier discussion about how to support a simple, generic and efficient procedure for controllers to resync all OF forwarding state with OVS after a reconnect while maintaining non-stop forwarding (see http://openvswitch.org/pipermail/dev/2016-January/064925.html and following).
To briefly recap the earlier discussion, we have two main approaches: A) A new OF experimenter procedure to resync state in three steps: 1. Controller marks the current state in OVS as stale 2. Controller downloads/refreshes the latest state 3. Controller tells switch to cleanup all remaining stale state The proposed procedure is described in more detail in https://docs.google.com/document/d/1JBwARjUKDH_r9LK_Zg92WjquAxHrOLcqze1W60rV3j4 This procedure has been implemented and used between Ericsson's controller and OF switches for some years. A patch for OVS 2.5 is available and could be rebased to master. B) Use the OF1.4 bundle mechanism as follows: 1. Controller opens a bundle for resync 2. Clear all flows, groups and meters in the bundle 3. Download latest state within the bundle 4. Commit the bundle to atomically swap the new state into the data path The OF 1.4 bundle was implemented in OVS 2.5 but only for flows. Support for the bundle extension to OF 1.3 was added on master later. Groups and meters are not supported yet. While we agree in principle that the bundle mechanism (with added support for groups and meters) would be a possible approach to the resync problem, our concern is that it was actually designed for a different use case, namely atomic incremental updates to the OF pipeline, and that the characteristics of the two approaches are very different in the resync scenario when a large volume of OpenFlow state is involved. To analyze and quantify the characteristics difference, we have done some benchmarking comparing the two approaches. Due to the limitation of the current bundle implementation we had to limit to the tests to flow entries. All tests were run on a VM with 6 cores and 3 GB RAM without traffic. The tests were run using scripts executed with ovs-ofctl adding flows from a file. With the proposed hitless resync procedure we were able to resync 1 million flow entries without increase in memory usage. Using the bundle procedure the VM ran out of memory for 1M and 500K flow entries. Only for 250K flow entries we were able to obtain comparable measurements. At 250K flow entries the ovs-vswitchd process occupies 455 MB virtual memory. Measurements for resyncing 250K flow entries: Metric Resync - OF1.3 Bundle - OF1.4 Flow update time ~40 sec ~7 sec Flow update rate ~6.25K/s ~35K/s ovs-vswitchd CPU usage ~140% ~100% ovs-vswitchd virtual memory peak 457 Mbyte 1905 Mbyte Refreshing the 250K flow entries using the proposed resync procedure requires 40 seconds at ~140% CPU usage with stable memory at 457 MB. The download rate is ~6250 flows/s. The scan for stale flow entries at the end of the resync procedure takes the vswitchd process around 200 ms. Refreshing the 250K flow entries using the bundle mechanism increases the vswitchd memory linearly up to 1.9 GB, significantly more than the 910 MB one would expect for accommodating two versions of each rule at the moment of the atomic activation. Somewhat to our surprise the download and activation of the 250K bundled flow entries takes only 7 seconds at 100% CPU load, much faster than the non-bundled download. Instrumenting the code with some additional log entries showed that the download of the bundle takes about 5 seconds, while the activation consumes the remaining 2 seconds. The bundled download rate is ~50K flows/s. It appears that installing 250K flow entries individually in ofproto_dpif carries a significant processing overhead compared to the atomic activation of the same 250K entries in a bundle. What is the reason for this? Can this be improved by batching these updates internally? Conclusion: In their current form the two approaches indeed exhibit radically different characteristics. The bundle mechanism is more than 5 times faster but it (temporarily) occupies 4 times the residual memory. Given that in many cases the delta between actual and desired flow state in OVS is small after a re-connect, we believe that speed of the cleanup may not be so crucial and that the ability to do it in-place without requiring a lot of extra memory resources (reserved huge-pages in the case of a DPDK datapath?) speaks in favor of the proposed resync procedure. We would therefore like to ask the OVS community to reassess the proposed experimenter resync procedure in the light of the presented empiric data. Regards, Jan _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev