As Lincoln said - all of us directly working with BCM/other silicon vendors have signed numerous NDAs. However if you ask a well crafted question - there’s always a way to talk about it ;-)
In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider). On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M). In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM) keeps growing - OAM, INT, local optimizations are primary users of it. Cheers, Jeff > On Jul 25, 2022, at 15:59, Lincoln Dale <l...@interlink.com.au> wrote: > > >> On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+na...@gmail.com> >> wrote: > >> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwob...@gmail.com> wrote: >> > This is the parallelism part. I can take multiple instances of these >> > memory/logic pipelines, and run them in parallel to increase the >> > throughput. >> ... >> > I work on/with a chip that can forwarding about 10B packets per second… so >> > if we go back to the order-of-magnitude number that I’m doing about “tens” >> > of memory lookups for every one of those packets, we’re talking about >> > something like a hundred BILLION total memory lookups… and since memory >> > does NOT give me answers in 1 picoseconds… we get back to pipelining and >> > parallelism. >> >> What level of parallelism is required to forward 10Bpps? Or 2Bpps like >> my J2 example :) > > I suspect many folks know the exact answer for J2, but it's likely under NDA > to talk about said specific answer for a given thing. > > Without being platform or device-specific, the core clock rate of many > network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a > goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that > doesn't mean a latency of 1 clock ingress-to-egress but rather that every > clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS > packet rate is achieved by having enough pipelines in parallel to achieve > that. > The number here is often "1" or "0.5" so you can work the number backwards. > (e.g. it emits a packet every clock, or every 2nd clock). > > It's possible to build an ASIC/NPU to run a faster clock rate, but gets back > to what I'm hand-waving describing as "goldilocks". Look up power vs > frequency and you'll see its non-linear. > Just as CPUs can scale by adding more cores (vs increasing frequency), ~same > holds true on network silicon, and you can go wider, multiple pipelines. But > its not 10K parallel slices, there's some parallel parts, but there are > multiple 'stages' on each doing different things. > > Using your CPU comparison, there are some analogies here that do work: > - you have multiple cpu cores that can do things in parallel -- analogous to > pipelines > - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some > DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory > - most modern CPUs are out-of-order execution, where under-the-covers, a > cache-miss or DRAM fetch has a disproportionate hit on performance, so its > hidden away from you as much as possible by speculative execution out-of-order > -- no direct analogy to this one - it's unlikely most forwarding > pipelines do speculative execution like a general purpose CPU does - but they > definitely do 'other work' while waiting for a lookup to happen > > A common-garden x86 is unlikely to achieve such a rate for a few different > reasons: > - packets-in or packets-out go via DRAM then you need sufficient DRAM (page > opens/sec, DRAM bandwidth) to sustain at least one write and one read per > packet. Look closer at DRAM and see its speed, Pay attention to page > opens/sec, and what that consumes. > - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM > of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least > potentially saves you that DRAM write+read per packet > - ... but then do e.g. a LPM lookup, and best case that is back to a memory > access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes > it isn't. > - ... do more things to the packet (urpf lookups, counters) and it's yet > more lookups. > > Software can achieve high rates, but note that a typical ASIC/NPU does on the > order of >100 separate lookups per packet, and 100 counter updates per packet. > Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in > software on generic CPUs is also a series of tradeoffs. > > > cheers, > > lincoln. >