You're getting to the core of the question (sorry, I could not resist...) -- 
and again the complexity is as much in the terminology as anything else.

In EZChip, at least as we used it on the ASR9k, the chip had a bunch of 
processing cores, and each core performed some of the work on each packet.  I 
honestly don't know if the cores themselves were different or if they were all 
the same physical design, but they were definitely attached to different 
memories, and they definitely ran different microcode.  These cores were 
allocated to separate stages, and had names along the lines of {parse, search, 
encap, transmit} etc.  I'm sure these aren't 100% correct but you get the 
point.  Importantly, there were NOT the same number of cores for each stage, so 
when a packet went from stage A to stage B there was some kind of mux in 
between.  If you knew precisely that each stage had the same number of cores, 
you could choose to arrange it such that the packet always followed a 
"straight-line" through the processing pipe, which would make some parts of the 
implementation cheaper/easier.  

You're correct that the instruction set for stuff like this is definitely not 
ARM (nor x86 nor anything else standard) because the problem space you're 
optimizing for is a lot smaller that what you'd have on a more general purpose 
CPU.

The (enormous) challenge for running the same ucode on multiple targets is that 
networking has exceptionally high performance requirements  -- billions of 
packets per second is where this whole thread started!  Fortunately, we also 
have a much smaller problem space to solve than general purpose compute, 
although in a lot of places that's because we vendors have told operators 
"Look, if you want something that can forward a couple hundred terabits in a 
single system, you're going to have to constrain what features you need, 
because otherwise the current hardware just can't do it".  

To get that kind of performance without breaking the bank requires -- or at 
least has required up until this point in time -- some very tight integration 
between the hardware forwarding design and the microcode.  I was at Barefoot 
when P4 was released, and Tofino was the closest thing I've seen to a "general 
purpose network ucode machine" -- and even that was still very much optimized 
in terms of how the hardware was designed and built, and it VERY much required 
the P4 programmer to have a deep understanding of what hardware resources were 
available.  When you write a P4 program and compile it for an x86 machine, you 
can basically create as many tables and lookup stages as you want -- you just 
have to eat more CPU and memory accesses for more complex programs and they run 
slower.  But on a chip like Tofino (or any other NPU-like target) you're going 
to have finite limits on how many processing stages and memory tables exist... 
so it's more the case that when your program gets bigger it no longer "just 
runs slower" but rather it "doesn't run at all". 

The industry would greatly benefit from some magical abstraction layer that 
would let people write forwarding code that's both target-independent AND 
high-performance, but at least so far the performance penalty for making such 
code target independent has been waaaaay more than the market is willing to 
bear.

--lj

-----Original Message-----
From: Saku Ytti <s...@ytti.fi> 
Sent: Sunday, August 7, 2022 4:44 AM
To: ljwob...@gmail.com
Cc: Jeff Tantsura <jefftant.i...@gmail.com>; NANOG <nanog@nanog.org>; Jeff 
Doyle <jdo...@juniper.net>
Subject: Re: 400G forwarding - how does it work?

On Sat, 6 Aug 2022 at 17:08, <ljwob...@gmail.com> wrote:


> For a while, GSR and CRS type systems had linecards where each card had a 
> bunch of chips that together built the forwarding pipeline.  You had chips 
> for the L1/L2 interfaces, chips for the packet lookups, chips for the 
> QoS/queueing math, and chips for the fabric interfaces.  Over time, we 
> integrated more and more of these things together until you (more or less) 
> had a linecard where everything was done on one or two chips, instead of a 
> half dozen or more.  Once we got here, the next step was to build linecards 
> where you actually had multiple independent things doing forwarding -- on the 
> ASR9k we called these "slices".  This again multiplies the performance you 
> can get, but now both the software and the operators have to deal with the 
> complexity of having multiple things running code where you used to only have 
> one.  Now let's jump into the 2010's where the silicon integration allows you 
> to put down multiple cores or pipelines on a single chip, each of these is 
> now (more or less) it's own forwarding entity.  So now you've got yet ANOTHER 
> layer of abstraction.  If I can attempt to draw out the tree, it looks like 
> this now:

> 1) you have a chassis or a system, which has a bunch of linecards.
> 2) each of those linecards has a bunch of NPUs/ASICs
> 3) each of those NPUs has a bunch of cores/pipelines

Thank you for this. I think we may have some ambiguity here. I'll ignore 
multichassis designs, as those went out of fashion, for now.
And describe only 'NPU' not express/brcm style pipeline.

1) you have a chassis with multiple linecards
2) each linecard has 1 or more forwarding packages
3) each package has 1 or more NPUs (Juniper calls these slices, unsure if 
EZchip vocabulary is same here)
4) each NPU has 1 or more identical cores (well, I can't really name any with 1 
core, I reckon, NPU like GPU pretty inherently has many many cores, and unlike 
some in this thread, I don't think they ever are ARM instruction set, that 
makes no sense, you create instruction set targeting the application at hand 
which ARM instruction set is not, but maybe some day we have some 
forwarding-IA, allowing customers to provide ucode that runs on multiple 
targets, but this would reduce pace of innovation)

Some of those NPU core architectures are flat, like Trio, where a single core 
handles the entire packet. Where other core architectures, like FP are 
matrices, where you have multiple lines and packet picks 1 of the lines and 
traverses each core in line. (FP has much more cores in line, compared to 
leaba/pacific stuff)

--
  ++ytti

Reply via email to