On Mon, Jan 05, 2015 at 02:08:41PM -0800, Jarno Rajahalme wrote: > > On Dec 29, 2014, at 2:27 PM, Ben Pfaff <b...@nicira.com> wrote: > > > On Wed, Dec 17, 2014 at 10:30:42AM -0800, Jarno Rajahalme wrote: > >> So far the compressed flow data in struct miniflow has been in 32-bit > >> words with a 63-bit map, allowing for a maximum size of struct flow of > >> 252 bytes. With the forthcoming Geneve options this is not sufficient > >> any more. > >> > >> This patch solves the problem by changing the miniflow data to 64-bit > >> words, doubling the flow max size to 504 bytes. Since the word size > >> is doubled, there is some loss in compression efficiency. To counter > >> this some of the flow fields have been reordered to keep related > >> fields together (e.g., the source and destination IP addresses share > >> the same 64-bit word). > >> > >> This change should speed up flow data processing on 64-bit CPUs, which > >> may help counterbalance the impact of making the struct flow bigger in > >> the future. > >> > >> Classifier lookup stage boundaries are also changed to 64-bit > >> alignment, as the current algorithm depends on each miniflow word to > >> not be split between ranges. This has resulted in new padding (part > >> of the 'mpls_lse' field). > >> > >> The 'dp_hash' field is also moved to packet metadata to eliminate > >> otherwise needed padding there. This allows the L4 to fit into one > >> 64-bit word, and also makes matches on 'dp_hash' more efficient as > >> misses can be found already on stage 1. > >> > >> Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> > > > This seems mostly straightforward. Are there particular parts you'd > > like me to look over carefully? > > > > Maybe the changes to the miniflow push macros, which get a bit more > complicated...
OK, I'll do that in a bit. > > I have a suggestion for miniflow_extract(). It is getting more > > complex over time, and becoming more difficult to understand. I > > wonder whether a simpler approach would be just as fast and easier to > > understand. Suppose that, instead of constructing a miniflow > > initially, we instead construct a regular "struct flow" on the stack. > > All the zeroing and then later checking for nonzero values is what > > drove us earlier to move to building a miniflow directly, so we'd want > > to avoid that. But we can do that by not initializing the flow at > > all, and just keeping track in a map of the u32s (or u64s) we've > > initialized, and then copying those fields into a miniflow based on > > the map we assembled. (I made the same suggestion in November, but I > > didn't get a reply, so I think that you must have missed it in all the > > email that you get.) > > This is definitely worth experimenting with, but if may be that the > cost of scatter writing to most (?) of the cachelines is almost as > costly as memsetting them in the beginning. That said, parsing > geneve options will need support for out-of-order processing, so > what you propose here might be a good way of doing that, too. Where in memory was the "struct flow" that we were previously zeroing? I would guess that if we put our temporary struct flow on the stack, then there wouldn't be cache contention for it since stacks are (essentially) thread-local and accessed frequently enough to be cache-hot. When we copy that into a heap-based miniflow, that should access a minimal number of cache lines. I don't know whether, when we previously zeroed an entire struct flow, that struct flow was on the stack or the heap; presumably initializing a heap-based struct flow is more expensive. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev