Very helpful and interesting. Thank you. On Fri, Jan 26, 2018 at 11:16 AM Dave Barach (dbarach) <dbar...@cisco.com> wrote:
> Dear David, > > > > A bit of history. We worked on vpp for a decade before making any serious > effort to multi-thread it. The first scheme that I tried was to break up > the graph into reconfigurable pipeline stages. Effective partitioning of > the graph is highly workload-dependent, and it can change in a heartbeat. > the resulting system runs at the speed of the slowest pipeline stage. > > > > In terms of easily measured inter-thread handoff cost, it’s not awful. 2-3 > clocks/pkt. Handing vectors of packets between threads can cause a festival > of cache coherence traffic, and it can easily undo the positive effects of > ddio (packet data DMA into the cache hierarchy). > > > > We actually use the scheme you describe in a very fine-grained way: dual > and quad loop graph dispatch functions process 2 or 4 packets at the same > time. Until we run out of registers, a superscalar CPU can “do the same > thing to 2 or 4 packets at the same time” pretty effectively. Including > memory hierarchy stalls, vpp averages more than two instructions retired > per clock cycle. > > > > At the graph node level, I can’t see how to leverage this technique. > Presenting [identical] vectors to 2 (or more) nodes running on multiple > threads would mean (1) the parallelized subgraph would run at the speed of > the slowest node. (2) you’d pay the handoff costs already discussed above, > and (3) you’d need an expensive algorithm to make sure that all vector > replicas were finished before reentering sequential processing. (4) None of > the graph nodes we’ve ever constructed are free of ordering constraints. > Every node alters packet state in a meaningful way, or they wouldn’t be > worth having. (😉)… > > > > We’ve had considerable success with flow-hashing across a set of identical > graph replicas [worker threads], even when available hardware RSS hashing > is not useful [think about NATted UDP traffic]. > > > > Hope this is of some interest. > > > > Thanks… Dave > > > > *From:* vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] *On > Behalf Of *David Bainbridge > *Sent:* Friday, January 26, 2018 12:39 PM > *To:* vpp-dev@lists.fd.io > *Subject:* [vpp-dev] VPP Graph Optimization > > > > I have just started to read up on VPP/FD.io, and I have a question about > graph optimization and was wondering if (as I suspect) this has already > been thought about and either planned or decided against. > > > > The documentation I found on VPP essentially says that VPP uses batch > processing and processes all packets in a vector on one step before > proceeding to the next step. The claim is this provides overall better > throughput because of instruction caching. > > > > I was wondering if optimization of the graph to understand where > concurrency can be leveraged has been considered, as well as where you > could process the vector by two steps with an offset. If this is possible, > then steps could be pinned to cores and perhaps both concurrency and > instruction caching could be leveraged. > > > > For example assume the following graph: > > > > [image: image.png] > > > > In this graph, steps B,C can be done concurrently as they don't "modify" > the vector. Steps D, E can't be done concurrently, but as they don't > require look back/forward they can be done in offset. > > > > What I am suggesting is, if there are enough cores, then steps could be > pinned to cores to achieve the benefits of instruction caching, and after > step A is complete, steps B,C could be done concurrently. After B,C are > complete then D can be started and as D completes processing on a packet if > can then be processed by E (i.e., the entire vector does not need to be > processed by D before processing by E is started). > > > > I make no argument that this doesn't increase complexity and also > introduces coordination costs that don't exists today. To be fair, offset > processing could be viewed as splitting the original large vector into > smaller vectors and processing the smaller vectors from start to finish > (almost dynamic optimization based on dynamic vector resizing). > > Just curious to hear others thoughts and if some of this has been thought > through or experimented with. As I said, just thinking off the cuff and > wondering; not fully thought through. > > > > avèk respè, > > /david > > >
_______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev