> In article <[EMAIL PROTECTED]>,
> Dominic Marks  <[EMAIL PROTECTED]> wrote:
> > On Mon, Feb 04, 2002 at 01:21:25PM -0800, John Polstra wrote:
> > > I'm trying to understand the timecounter code, and in particular the
> > > reason for the "microuptime went backwards" messages which I see on
> > > just about every machine I have, whether running -stable or -current.
> > 
> > I see them everywhere with -CURRENT, but not at all with -STABLE. This is
> > with two seperate machines. Perhaps that may add clues.
> 
> I'm looking for something less empirical than that.  When somebody
> says this problem is caused by too much interrupt latency, I assume
> they have a mental model of what is going wrong when this excessive
> latency occurs.

It's not necessarily caused by interrupt latency.  Here's the assumption 
that's being made.

There is a ring of timecounter structures, of some size.  In testing,
I've used sizes of a thousand or more, but still seen this problem.

There is a pointer to the "current" timecounter structure.

When the "current" time is updated, the following procedure is followed:

 - Find the "next" timecounter in the ring.
 - Update its contents with the new current time.
 - Move the "current" pointer.

When one wishes to read the current time, one proceeds as follows:

 - Get the "current" pointer and save it locally.
 - Read the timecounter structure via the local "current" pointer.

Since the operations on the "current" pointer are atomic, there is no 
need to lock the structure.

There are a couple of possible problems with this mechanism.

One is that the ring "catches up" with your saved copy of the
"current" pointer, ie. inbetween fetching the pointer and reading the
timecounter contents, the "next" pointer passes over you again in such
a fashion that you get garbage out of the structure.

Another is that there is a race between multiple updaters of the
timecounter; if two parties are both updating the "next" timecounter
along with another party trying to get the "current" time, this could
cause corruption.

All that interrupt latency will do is make the updates late; I can't
actually see how it could cause corruption.  Corruption has to be
caused by mishandling of the timecounter ring in some fashion.

Note that you can probably eliminate the ring loop theory by
allocating a very large number of entries in the ring by setting
NTIMECOUNTER (kern/kern_tc.c) higher.  The structures are small; try
100,000 or so.

If you can reproduce under these circumstances, try adding some checks
to make sure the "current" timecounter pointer is behaving
monotonically; just save the last timecounter pointer in microtime()
et. al.

Another test worth performing is to look at the tco_delta function for
the timecounter and make sure that it returns a sane value, and one
that doesn't behave out of synch with the interrupt handler that updates
the timecounter proper.  If you save the delta value in the timecounter 
and zero it when it's updated, you can catch this.

You can rule this out by using getmicroptime() rather than
microuptime(); it may return the same value twice, which isn't
desirable, but that would be better than nothing.

Hope this helps a bit.

Regards,
Mike

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Reply via email to