On 15.04.2013 23:43, Poul-Henning Kamp wrote:
In message <516c515a.9090...@freebsd.org>, Alexander Motin writes:

I propose to switch that
statistics from using binuptime() to getbinuptime() to solve the problem
globally.

No objections here, but I wonder if you were able to compare the results
somehow before and after the change so we have some hard numbers to show
that we don't lose much by applying the change.

I haven't tested it statistically, but I haven't noticed any visual
difference in gstat output with its 0.1ms displayed resolution.

I have tested it statistically, back when I wrote GEOM:  It leads
to very significant statistical bias.

Just about the only thing in devstat that has any predictive power
with respect to filesystem performance, is the latency, which measures
how long time it takes to satisfy each I/O request.

If you run gstat(8), this is the "ms/*" numbers:  milliseconds per
this or that.

The rest of what's in devstat, with the exception of the queue-length
("L(q)") has almost no predictive power, and is IMO, practically
pointless.  In particular the %busy is totally misleading and I
deeply regret that I didn't fight to kill it back then.

If you switch to getbinuptime(), the latency measurements will only
be precise if the I/O operations take much longer than the timecounter
update period, which is not guaranteed to be 1000 Hz btw.

For measuring how much USB-sticks suck, that will work fine.

For tuning anything on a non-ridiculous SSD device or modern
harddisks, it will be useless because of the bias you introduce is
*not* one which averages out over many operations.

Could you please explain why? Unless disk I/O somehow aliased to hardclock(), each of them should get random error from 0 to max(1ms, 1s/HZ). With large number of I/Os that error should be hidden when calculating average time. I am not talking about microseconds, but I think fraction of millisecond should be realistic to get.

The fundamental problem is that on a busy system, getbinuptime()
does not get called at random times, it will be heavily affected
by the I/O traffic, because of the interrupts, the bus-traffic
itself, the cache-effects of I/O transfers and the context-switches
by the processes causing the I/O.

I'm sorry, but I am not sure I understand above paragraphs. Do you want to say that in some realistic conditions (not counting entering debugger with disabled interrupts, etc) hardclock() can be delayed more then some significant percent of its period and that depends of I/O traffic itself? Or you want to say that disk I/Os somehow aliased with hardclock(), making impossible to hide error by averaging?

So yes, you can switch to getbinuptime(), but the only statistical
responsible way to do so, would be to supress latency measurements
on all I/O operations which complete in less than 5-10 timecounter
interrupts.

Sure, getbinuptime() won't allow to answer how many requests completed within 0.5ms, but present API doesn't allow to calculate that any way, providing only total/average times. And why "_5-10_ timecounter interrupts"?

Apart from some practical issues implementing it, the numbers
that came out would be pretty useless.

The right idea is probably to bucketize the latencies, so that
rather than having to keep track of devstat in real time to find
out, you could get a histogram at any time showing past
performance something like:

        Latency distribution:

                <5msec:              92.12 %
                <10msec:      0.17 %
                <20msec:      1.34 %
                <50msec:      6.37 %
                >50msec:      0.00 %

Doing that with getbinuptime() would be statistically defensible
provided the top bucket is "<5msec" and it would very clearly tell
people if they have I/O trouble or not, which IMO is what people
want to know.

The cost 20 64bit counters in struct devstat (N|R|W|E)*5*8 = 160
bytes, but since devstat is already 288 bytes, that isn't a major
catastropy.

I agree that such functionality could be interesting. The only worry is which buckets should be there. For modern HDDs above buckets could be fine. For high-end SSD it may go about microseconds then milliseconds. I have doubt that 5 buckets will be universal enough, unless separated by factor of 5-10.

The ability to measure latency precisly should be retained, but it
could be made a sysctl enabled debugging facility.

The %busy crap should be killed, all it does is confuse people.

I agree that it heavily lies, especially for cached writes, but at least it allows to make some very basic estimates. The value has valid explanation and the only problem is that users are misinterpreting it.

--
Alexander Motin
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Reply via email to