> > It wasn't a CPU analysis because switching ASICs != CPUs. > > I am aware of the x86 architecture, but know little of network ASICs, > so I was deliberately trying to not apply my x86 knowledge here, in > case it sent me down the wrong path. You made references towards > typical CPU features; >
A CPU is 'jack of all trades, master of none'. An ASIC is 'master of one specific thing'. If a given feature or design paradigm found in a CPU fits with the use case the ASIC is being designed for, there's no reason it cannot be used. On Mon, Jul 25, 2022 at 2:52 PM James Bensley <jwbensley+na...@gmail.com> wrote: > Thanks for the responses Chris, Saku… > > On Mon, 25 Jul 2022 at 15:17, Chris Adams <c...@cmadams.net> wrote: > > > > Once upon a time, James Bensley <jwbensley+na...@gmail.com> said: > > > The obvious answer is that it's not magic and my understanding is > > > fundamentally flawed, so please enlighten me. > > > > So I can't answer to your specific question, but I just wanted to say > > that your CPU analysis is simplistic and doesn't really match how CPUs > > work now. > > It wasn't a CPU analysis because switching ASICs != CPUs. > > I am aware of the x86 architecture, but know little of network ASICs, > so I was deliberately trying to not apply my x86 knowledge here, in > case it sent me down the wrong path. You made references towards > typical CPU features; > > > For example, it might take 4 times as long to process the first packet, > > but as long as the hardware can handle 4 packets in a queue, you'll get > > a packet result every cycle after that, without dropping anything. So > > maybe the first result takes 12 cycles, but then you can keep getting a > > result every 3 cycles as long as the pipeline is kept full. > > Yes, in the x86/x64 CPU world keeping the instruction cache and data > cache hot indeed results in optimal performance, and as you say modern > CPUs use parallel pipelines amongst other techniques like branch > prediction, SIMD, (N)UMA, and so on, but I would assume (because I > don’t know) that not all of the x86 feature set map nicely to packet > processing in ASICs (VPP uses these techniques on COTS CPUs, to > emulate a fixed pipeline, rather than run to completion model). > > You and Saku both suggest that heavy parallelism is the magic source; > > > Something can be "line rate" but not push the first packet > > through in the shortest time. > > On Mon, 25 Jul 2022 at 15:16, Saku Ytti <s...@ytti.fi> wrote: > > I.e. say JNPR Trio PPE has many threads, and only one thread is > > running, rest of the threads are waiting for answers from memory. That > > is, once we start pushing packets through the device, it takes a long > > ass time (like single digit microseconds) before we see any packets > > out. 1000x longer than your calculated single digit nanoseconds. > > In principal I accept this idea. But lets try and do the maths, I'd > like to properly understand; > > The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my > example scenario was a single J2 chip in a 12x400G device. If each > port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every > 6.08 nanoseconds coming in. What kind of parallelism is required to > stop from ingress dropping? > > It takes say 5 microseconds to process and forward a packet (seems > reasonable looking at some Arista data sheets which use J2 variants), > which means we need to be operating on 5,000ns / 6.08ns == 822 packets > per port simultaneously, so 9868 packets are being processed across > all 12 ports simultaneously, to stop ingress dropping on all > interfaces. > > I think the latest generation Trio has 160 PPEs per PFE, but I’m not > sure how many threads per PPE. Older generations had 20 > threads/contexts per PPE, so if it hasn’t increased that would make > for 3200 threads in total. That is a 1.6Tbps FD chip, although not > apples to apples of course, Trio is run to completion too. > > The Nokia FP5 has 1,200 cores (I have no idea how many threads per > core) and is rated for 4.8Tbps FD. Again doing something quite > different to a J2 chip, again its RTC. > > J2 is a partially-fixed pipeline but slightly programmable if I have > understood correctly, but definitely at the other end of the spectrum > compared to RTC. So are we to surmise that a J2 chip has circa 10k > parallel pipelines, in order to process 9868 packets in parallel? > > I have no frame of reference here, but in comparison to Gen 6 Trio of > NP5, that seems very high to me (to the point where I assume I am > wrong). > > Cheers, > James. >