Very helpful and interesting. Thank you.

On Fri, Jan 26, 2018 at 11:16 AM Dave Barach (dbarach) <dbar...@cisco.com>
wrote:

> Dear David,
>
>
>
> A bit of history. We worked on vpp for a decade before making any serious
> effort to multi-thread it. The first scheme that I tried was to break up
> the graph into reconfigurable pipeline stages. Effective partitioning of
> the graph is highly workload-dependent, and it can change in a heartbeat.
> the resulting system runs at the speed of the slowest pipeline stage.
>
>
>
> In terms of easily measured inter-thread handoff cost, it’s not awful. 2-3
> clocks/pkt. Handing vectors of packets between threads can cause a festival
> of cache coherence traffic, and it can easily undo the positive effects of
> ddio (packet data DMA into the cache hierarchy).
>
>
>
> We actually use the scheme you describe in a very fine-grained way: dual
> and quad loop graph dispatch functions process 2 or 4 packets at the same
> time. Until we run out of registers, a superscalar CPU can “do the same
> thing to 2 or 4 packets at the same time” pretty effectively. Including
> memory hierarchy stalls, vpp averages more than two instructions retired
> per clock cycle.
>
>
>
> At the graph node level, I can’t see how to leverage this technique.
> Presenting [identical] vectors to 2 (or more) nodes running on multiple
> threads would mean (1) the parallelized subgraph would run at the speed of
> the slowest node. (2) you’d pay the handoff costs already discussed above,
> and (3) you’d need an expensive algorithm to make sure that all vector
> replicas were finished before reentering sequential processing. (4) None of
> the graph nodes we’ve ever constructed are free of ordering constraints.
> Every node alters packet state in a meaningful way, or they wouldn’t be
> worth having. (😉)…
>
>
>
> We’ve had considerable success with flow-hashing across a set of identical
> graph replicas [worker threads], even when available hardware RSS hashing
> is not useful [think about NATted UDP traffic].
>
>
>
> Hope this is of some interest.
>
>
>
> Thanks… Dave
>
>
>
> *From:* vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] *On
> Behalf Of *David Bainbridge
> *Sent:* Friday, January 26, 2018 12:39 PM
> *To:* vpp-dev@lists.fd.io
> *Subject:* [vpp-dev] VPP Graph Optimization
>
>
>
> I have just started to read up on VPP/FD.io, and I have a question about
> graph optimization and was wondering if (as I suspect) this has already
> been thought about and either planned or decided against.
>
>
>
> The documentation I found on VPP essentially says that VPP uses batch
> processing and processes all packets in a vector on one step before
> proceeding to the next step. The claim is this provides overall better
> throughput because of instruction caching.
>
>
>
> I was wondering if optimization of the graph to understand where
> concurrency can be leveraged has been considered, as well as where you
> could process the vector by two steps with an offset. If this is possible,
> then steps could be pinned to cores and perhaps both concurrency and
> instruction caching could be leveraged.
>
>
>
> For example assume the following graph:
>
>
>
> [image: image.png]
>
>
>
> In this graph, steps B,C can be done concurrently as they don't "modify"
> the vector. Steps D, E can't be done concurrently, but as they don't
> require look back/forward they can be done in offset.
>
>
>
> What I am suggesting is, if there are enough cores, then steps could be
> pinned to cores to achieve the benefits of instruction caching, and after
> step A is complete, steps B,C could be done concurrently. After B,C are
> complete then D can be started and as D completes processing on a packet if
> can then be processed by E (i.e., the entire vector does not need to be
> processed by D before processing by E is started).
>
>
>
> I make no argument that this doesn't increase complexity and also
> introduces coordination costs that don't exists today. To be fair, offset
> processing could be viewed as splitting the original large vector into
> smaller vectors and processing the smaller vectors from start to finish
> (almost dynamic optimization based on dynamic vector resizing).
>
> Just curious to hear others thoughts and if some of this has been thought
> through or experimented with. As I said, just thinking off the cuff and
> wondering; not fully thought through.
>
>
>
> avèk respè,
>
> /david
>
>
>
_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Reply via email to