If your 95th percentile utilization is at 80% capacity...s/80/60/s/60/40/
I would suggest that the reason each of you have a different number is because there's a different best number for each case. Looking for any single number to fit all cases, rather than understanding the underlying process, is unlikely to yield good results. First, different people have different requirements. Some people need lowest possible cost, some people need lowest cost per volume of bitsdelivered, some people need lowest cost per burst capacity, some need low latency, some need low jitter, some want good customer service, some want
flexible payment terms, and undoubtedly there are a thousand other possible qualities.Second, this is a binary digital network. It's never 80% full, it's never
60% full, and it's never 40% full. It's always exactly 100% full or exactly 0% full. If SNMP tells you that you've moved 800 megabits in a second on a one-gigabit pipe, then, modulo any bad implementations ofSNMP, your pipe was 100% full for eight-tenths of that second. SNMP does
not "hide" anything. Applying any percentile function to your data, on the other hand, does hide data. Specifically, it discards all of your data except a single point, irreversibly. So if you want to know anything about your network, you won't be looking at percentiles.Having your circuit be 100% full is a good thing, presuming you're paying
for it and the traffic has some value to you. Having it be 100% full as much of the time as possible is a good thing, because that gives you a high ratio of value to cost. Dropping packets, on the other hand, is likely to be a bad thing, both because each packet putatively had value, and because many dropped packets are likely to be resent, and a resent packet is one you've paid for twice, and that's precluded the sending of another new, paid-for packet in that timeframe. The cost of not droppingpackets is not having buffers overflow, and the cost of not having buffers overflow is either having deep buffers, which means high latency, or having
customers with a predictable flow of traffic. Which brings me to item three. In my experience, the single biggest contributor to buffer overflow is having in-feeding (or downstream customer) circuits which are of burst capacity too close to that of theout-feeding (or upstream transit) circuits. Let's say that your outbound
circuit is a gigabit, you have two inbound circuits that are a gigabit and run at 100% utilization 10% of the time each, and you have a megabitof buffer memory allocated to the outbound circuit. 1% of the time, both of the inbound circuits will be at 100% utilization simultaneously. When that's happening, you'll have data flowing in at the rate of two gigabits per second, which will fill the buffer in one twentieth of a second, if it
persists. And, just like Rosencrantz and Guildenstern flipping coins, such a run will inevitably persist longer than you'd desire, frequently enough. On the other hand, if you have twenty inbound circuits of 100megabits each, which are transmitting at 100% of capacity 10% of the time
each, you're looking at exactly the same amount of data, however it arrives _much more predictably_, since the 2-gigabit inflow would onlyoccur 0.0000000000000000001% of the time, rather than 1% of the time. And it would also be proportionally unlikely to persist for the longer periods
of time necessary to overflow the buffer.Thus Kevin's ESnet customers, who are much more likely to be 10gb or 40gb downstream circuits feeding into his 40gb upstream circuits, are much more
likely to overflow buffers, than a consumer Internet provider who'sfeeding 1mb circuits into a gigabit circuit, even if the aggregation ratio
of the latter is hundreds of times higher.So, in summary: Your dropped packet counters are the ones to be looking at as a measure of goodput, more than your utilization counters. And keep the size of your aggregation pipes as much bigger than the size of the pipes you
aggregate into them as you can afford to. As always, my apologies to those of you for whom this is unnecessarilyremedial, for using NANOG bandwidth and a portion of your Sunday morning.
-Bill
PGP.sig
Description: This is a digitally signed message part