FYI https://community.juniper.net/blogs/nicolas-fevrier/2022/07/27/voq-and-dnx-pipeline
Cheers, Jeff > On Jul 25, 2022, at 15:59, Lincoln Dale <l...@interlink.com.au> wrote: > > >> On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+na...@gmail.com> >> wrote: > >> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwob...@gmail.com> wrote: >> > This is the parallelism part. I can take multiple instances of these >> > memory/logic pipelines, and run them in parallel to increase the >> > throughput. >> ... >> > I work on/with a chip that can forwarding about 10B packets per second… so >> > if we go back to the order-of-magnitude number that I’m doing about “tens” >> > of memory lookups for every one of those packets, we’re talking about >> > something like a hundred BILLION total memory lookups… and since memory >> > does NOT give me answers in 1 picoseconds… we get back to pipelining and >> > parallelism. >> >> What level of parallelism is required to forward 10Bpps? Or 2Bpps like >> my J2 example :) > > I suspect many folks know the exact answer for J2, but it's likely under NDA > to talk about said specific answer for a given thing. > > Without being platform or device-specific, the core clock rate of many > network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a > goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that > doesn't mean a latency of 1 clock ingress-to-egress but rather that every > clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS > packet rate is achieved by having enough pipelines in parallel to achieve > that. > The number here is often "1" or "0.5" so you can work the number backwards. > (e.g. it emits a packet every clock, or every 2nd clock). > > It's possible to build an ASIC/NPU to run a faster clock rate, but gets back > to what I'm hand-waving describing as "goldilocks". Look up power vs > frequency and you'll see its non-linear. > Just as CPUs can scale by adding more cores (vs increasing frequency), ~same > holds true on network silicon, and you can go wider, multiple pipelines. But > its not 10K parallel slices, there's some parallel parts, but there are > multiple 'stages' on each doing different things. > > Using your CPU comparison, there are some analogies here that do work: > - you have multiple cpu cores that can do things in parallel -- analogous to > pipelines > - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some > DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory > - most modern CPUs are out-of-order execution, where under-the-covers, a > cache-miss or DRAM fetch has a disproportionate hit on performance, so its > hidden away from you as much as possible by speculative execution out-of-order > -- no direct analogy to this one - it's unlikely most forwarding > pipelines do speculative execution like a general purpose CPU does - but they > definitely do 'other work' while waiting for a lookup to happen > > A common-garden x86 is unlikely to achieve such a rate for a few different > reasons: > - packets-in or packets-out go via DRAM then you need sufficient DRAM (page > opens/sec, DRAM bandwidth) to sustain at least one write and one read per > packet. Look closer at DRAM and see its speed, Pay attention to page > opens/sec, and what that consumes. > - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM > of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least > potentially saves you that DRAM write+read per packet > - ... but then do e.g. a LPM lookup, and best case that is back to a memory > access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes > it isn't. > - ... do more things to the packet (urpf lookups, counters) and it's yet > more lookups. > > Software can achieve high rates, but note that a typical ASIC/NPU does on the > order of >100 separate lookups per packet, and 100 counter updates per packet. > Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in > software on generic CPUs is also a series of tradeoffs. > > > cheers, > > lincoln. >