mandatory slide of laundry analogy for pipelining https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html
On Tue, 26 Jul 2022 at 12:41, Lawrence Wobker <ljwob...@gmail.com> wrote: > >> "Pipeline" in the context of networking chips is not a terribly >> well-defined term. In some chips, you'll have a pipeline that is built >> from very rigid hardware logic blocks -- the first block does exactly one >> part of the packet forwarding, then hands the packet (or just the header >> and metadata) to the second block, which does another portion of the >> forwarding. You build the pipeline out of as many blocks as you need to >> solve your particular networking problem, and voila! > > > > "Pipeline", in the context of networking chips, is not a terribly > well-defined term! In some chips, you'll have an almost-literal pipeline > that is built from very rigid hardware logic blocks. The first block does > exactly one part of the packet forwarding, then hands the packet (or just > the header and metadata) to the second block, which does another portion of > the forwarding. You build the pipeline out of as many blocks as you need > to solve your particular networking problem, and voila! > The advantages here is that you can make things very fast and power > efficient, but they aren't all that flexible, and deity help you if you > ever need to do something in a different order than your pipeline! > > You can also build a "pipeline" out of software functions - write up some > Python code (because everyone loves Python, right?) where function A calls > function B and so on. At some level, you've just build a pipeline out of > different software functions. This is going to be a lot slower (C code > will be faster but nowhere near as fast as dedicated hardware) but it's WAY > more flexible. You can more or less dynamically build your "pipeline" on a > packet-by-packet basis, depending on what features and packet data you're > dealing with. > > "Microcode" is really just a term we use for something like "really > optimized and limited instruction sets for packet forwarding". Just like > an x86 or an ARM has some finite set of instructions that it can execute, > so do current networking chips. The larger that instruction space is and > the more combinations of those instructions you can store, the more > flexible your code is. Of course, you can't make that part of the chip > bigger without making something else smaller, so there's another tradeoff. > > MOST current chips are really a hybrid/combination of these two extremes. > You have some set of fixed logic blocks that do exactly One Set Of Things, > and you have some other logic blocks that can be reconfigured to do A Few > Different Things. The degree to which the programmable stuff is > programmable is a major input to how many different features you can do on > the chip, and at what speeds. Sometimes you can use the same hardware > block to do multiple things on a packet if you're willing to sacrifice some > packet rate and/or bandwidth. The constant "law of physics" is that you > can always do a given function in less power/space/cost if you're willing > to optimize for that specific thing -- but you're sacrificing flexibility > to do it. The more flexibility ("programmability") you want to add to a > chip, the more logic and memory you need to add. > > From a performance standpoint, on current "fast" chips, many (but > certainly not all) of the "pipelines" are designed to forward one packet > per clock cycle for "normal" use cases. (Of course we sneaky vendors get > to decide what is normal and what's not, but that's a separate issue...) > So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that > means that it can forward 1.25 billion packets per second. Note that this > does NOT mean that I can forward a packet in "a > one-point-two-five-billionth of a second" -- but it does mean that every > clock cycle I can start on a new packet and finish another one. The length > of the pipeline impacts the latency of the chip, although this part of the > latency is often a rounding error compared to the number of times I have to > read and write the packet into different memories as it goes through the > system. > > So if this pipeline can do 1.25 billion PPS and I want to be able to > forward 10BPPS, I can build a chip that has 8 of these pipelines and get my > performance target that way. I could also build a "pipeline" that > processes multiple packets per clock, if I have one that does 2 > packets/clock then I only need 4 of said pipelines... and so on and so > forth. The exact details of how the pipelines are constructed and how much > parallelism I built INSIDE a pipeline as opposed to replicating pipelines > is sort of Gooky Implementation Details, but it's a very very important > part of doing the chip level architecture as those sorts of decisions drive > lots of Other Important Decisions in the silicon design... > > --lj >