The Broadcom KBP -- often called an "external TCAM" is really closer to a 
completely separate NPU than just an external TCAM.  "Back in the day" we used 
external TCAMs to store forwarding state (FIB tables, ACL tables, whatever) on 
devices that were pretty much just a bunch of TCAM memory and an interface for 
the "main" NPU to ask for a lookup.  Today the modern KBP devices have WAY more 
functionality, they have lots of different databases and tables available, 
which can be sliced and diced into different widths and depths.  They can store 
lots of different kinds of state, from counters to LPM prefixes and ACLs.  At 
risk of correcting Ohta-san, note that most ACLs are implemented using TCAMs 
with wildcard/masking support, as opposed to an exact match lookup.  Exact 
match lookups are generally used for things that do not require masking or 
wildcard bits: MAC addresses and MPLS label values are the canonical examples 
here.  

The SRAM memories used in fast networking chips are almost always built such 
that they provide one lookup per clock, although hardware designers often use 
multiple banks of these to increase the number of *effective* lookups per 
clock.  TCAMs are also generally built such that they provide one lookup/result 
per clock, but again you can stack up multiple devices to increase this.

Many hardware designs also allow for more flexibility in how the various 
memories are utilized by the software -- almost everyone is familiar with the 
idea of "I can have a million entries of X bits, or half a million entries of 
2*X bits".  If the hardware and software complexity was free, we'd design 
memories that could be arbitrarily chopped into exactly the sizes we need, but 
that complexity is Absolutely Not Free.... so we end up picking a few discrete 
sizes and the software/forwarding code has to figure out how to use those bits 
efficiently.  And you can bet your life that as soon as you have a memory that 
can function using either 80b or 160b entries, you will immediately come across 
a use case that really really needs to use entries of 81b.

FYI: There's nothing particularly magical about 40b memory widths.  When 
building these chips you can (more or less) pick whatever width of SRAM you 
want to build, and the memory libraries that you use spit out the corresponding 
physical design.

Ohta-san correctly mentions that a critical part of the performance analysis is 
how fast the different parts of the pipeline can talk to each other.  Note that 
this concept applies whether we're talking about the connection between very 
small blocks within the ASIC/NPU, or the interface between the NPU and an 
external KBP/TCAM, or for that matter between multiple NPUs/fabric chips within 
a system.  At some point you'll always be constrained by whatever the slowest 
link in the pipeline is, so balancing all that stuff out is Yet One More Thing 
for the system designer to deal with.



--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail....@nanog.org> On Behalf Of Masataka 
Ohta
Sent: Wednesday, July 27, 2022 9:09 AM
To: nanog@nanog.org
Subject: Re: 400G forwarding - how does it work?

James Bensley wrote:

> The BCM16K documentation suggests that it uses TCAM for exact matching 
> (e.g.,for ACLs) in something called the "Database Array"
> (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in 
> something called the "User Data Array" (with 16M 32b entries?).

Which documentation?

According to:

        https://docs.broadcom.com/docs/16000-DS1-PUB

figure 1 and related explanations:

        Database records 40b: 2048k/1024k.
        Table width configurable as 80/160/320/480/640 bits.
        User Data Array for associated data, width configurable as
        32/64/128/256 bits.

means that header extracted by 88690 is analyzed by 16K finally resulting in 
40b (a lot shorter than IPv6 addresses, still may be enough for IPv6 backbone 
to identify sites) information by "database"
lookup, which is, obviously by CAM because 40b is painful for SRAM, converted 
to "32/64/128/256 bits data".

> 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which 
> is within the access time of TCAM and SRAM

As high speed TCAM and SRAM should be pipelined, cycle time, which matters, is 
shorter than access time.

Finally, it should be pointed out that most, if not all, performance figures 
such as MIPS and Flops are merely guaranteed not to be exceeded.

In this case, if so deep packet inspections by lengthy header for some 
complicated routing schemes or to satisfy NSA requirements are required, 
communication speed between 88690 and 16K will be the limitation factor for PPS 
resulting in a lot less than maximum possible PPS.

                                                Masataka Ohta

Reply via email to