Re: 64 bit counters again

Terry Lambert Mon, 14 Jan 2002 11:17:41 -0800

[ ... Moved to -net ... ]

"James E. Housley" wrote:
> I am just trying to count bytes in and out, too keep track of usage and
> head off a large overage and a larger bill then necessary.  Counting
> packets is worthless.  But just do the math.  With a GigE NIC, at what
> data rate do you start overflowing the counters too quickly.  I suppose
> there is another possibility, that the ti GigE driver is counting the
> data multiple times.  But I don't think so, because at 200Mbits/sec the
> counter should overflow in 172 seconds.  And this machine is easily
> doing this most of the day.


Do you get billed on retransmits?  I'm pretty sure that they are
not counted, unless you have a Tigon II, and have rewritten the
firmware, or have a non-disclosure with another vendor, a
license for their firmware, and have rewritten the firmware.

I think the place to count this stuff is at the router.

If your router *is* the FreeBSD box, then it makes sense for you
to do the counting; but it doesn't make sense for the rest of
us to do the counting.

Is your problem with packet granularity that it doesn't give
you better than an estimate based on an average packet size
of the amount of data you send?

You can keep a modular counter based on kilobytes (or even on
megabytes), where keep an exact byte count at that level
of granularity, but don't reflect that granularity out into
the counter itself.

In other words, make it accurate, but not precise.

> That all sounds reasonable.  And it make sense to move the counters
> under existing locks.  But, 32-bit machines are going to be around for
> awhile longer and fast network connections are going to get faster and
> more common.  Maybe the counters should be completely removed from the
> 32-bit arch.s since they give such misleading results and only have them
> on the 64-bit machines.  That way no one will be confused by the data.
> 
> Of course I am not completely serious about removing the counters, but
> it is not hard to make them very wrong.

The problem with this is that it appears that you have a very
specific problem domain that, if fully mapped, will damage the
performance for the rest of us: a system in-the-neighborhood-of
gigabit throughput, for which every byte is counted against you
as part of a cost metric (most people at that level have an
optical cable in from a NAP, and really don't care about bytes
transited because they are one of the top tier backbone
providers).

I would be much more comfortable with you slowing you down, and
not the rest of us.  To my mind, the ability to meter based on
this type of metric acts against flat rate pipes, and comes
down on the wrong side of the technology wall between the users,
who want to buy based on size of water pipe, and the providers,
who want to charge based on how much water goes through the pipe,
so that they can get their tax on every drop.  In other words,
if I had my way, it would be technologically impossible to meter
based on a metric like this (it's the one merit to a direct ATM
interconnect, IMO: inability to even store accounting records
fast enough without a supercomputer).

"...Of course I am not completely serious about removing the
 counters..."  8-)

Frankly, I have a hard time believing that you really have the
problem that you think you have.  Specifically, I have worked
on Gigabit equipment, and while it's nice and impressive
sounding to be able to say "GigaBit!" in an "I've just had my
cake frosted!" excited voice, in practice, there's not a real
colocation center on the planet that would let you talk out
their pipes at anywhere near that rate, and you would be really
hard put to find one that could talk fast enough to allow you
to pump a fully saturated 100Mbit interface out of your box.

I helped put a single Gigabit box in front of three of the top
ten porn sites in the U.S., and even while the damn thing was
starting up under load and before startup had fully completed,
the thing never got over 7% load, and in operation, it ran at
around 4% load steady-state, and that's *CPU load* on a 1GHz
Pentium, not even network load (which was a hell of a lot less).

--

This is getting way off topic, but here is a business case
illustration.

Are you perhaps doing what the Q/A people at a previous job were
doing, and stress-testing the crap out of a machine on a Gigabit
LAN, at or near wire speeds, when in the field, the equipment
is *NEVER*, *EVER* going to have to handle anywhere near even
1/40th of the load you are placing on it?


While it is natural -- even, in some ways, admirable -- for
Q/A people to want to test their products to destruction,
you are going to manufacture sev-1 bugs where none will exist
in deployment at customer sites, and these putative "show
stoppers" will cost you in time to market and other areas
where you really can't afford to be pissing away time over
nothing (e.g. if your sales force gets wind of them, they
will lose confidence in the products ability to make your
customers happy, when in fact no such problem really exists).

While customers are likey to set up a test network like
yours, and stress test it, they are unlikely to be able to
duplicate your load.  In practice, this means anything that
can't be repeated with standard test tools (e.g. http_load,
etc.) in under 24 hours will not show up, even under their
"stress test" scenarios.

Customers care about equipment not failing, not about equipment
being infallable.

For example, if you have a problem that occurs once a day
at that level of amplification, in the field it will perhaps
occur once every month and a half, assuming that your customer
keeps the load up at that level, and the problem is unrelated
to resource starvation (e.g. a small memory leak in an uncommon
failure mode, where an allocation is not freed, when the
machine is under stress).  In other words, if someone were to
DDOS them for a month and a half at a dual OC3 facility like
Exodus or UUNet in San Jose, and they did absolutely nothing to
stop or curb the attack, then you might expect to see the problem
in the field in a month.

If your problem occurs once a week, then that grows to almost
a year before the problem is seen in the field, assuming that
there are no upgrades or anything else requiring a "bugfix"...
so half-life that: you have six months to come up with a fix
for it, and push it off as a "security upgrade" -- technically,
it is one -- assuming the customer plugs in your box and then
forgets it.  If they reboot or reconfigure, requiring a reboot,
then the clock starts all over again.

In fact, this assumes linear amplification: "multiply the data
rate by 40, and you multiply the failure rate by 40"; in the
real world, this relationship is exponential: something you see
at the stress breaking point of your product will be almost
impossible to repeat in the field.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message

Re: 64 bit counters again

Reply via email to