"J.C. Roberts" <[EMAIL PROTECTED]> writes:

> Since I don't have the skill to fix it myself, my goal is simply to
> figure out when "The" alpha bug entered the tree. If I can just figure
> out the `when' hopefully someone a lot smarter than me can figure out
> the `what' of the problem. Basically I'm going to turn loose a half
> dozen alpha systems compiling various versions of OpenBSD until I find
> where the bug stops occurring.

Good luck. I spent two months doing this (out of 8-9 months of chasing
the bug). What will you change? gcc?  binutils? libc used to build
gcc/binutils? /usr/bin/config?

The bug isn't necessarily in the kernel code.

> As far as I can tell, the bug smells like a race condition of some sort
> and if my wild guess is correct, it will be difficult to reproduce
> consistently. With some (but not all) race conditions, you can increase
> the chance of triggering them by increasing loads. Since I want the race
> condition to occur, what is the best way stress to the systems while
> also doing make build?

Good luck. I never found a reliable way to reproduce it. Sometimes it
showed up seconds after boot, sometimes after a few weeks uptime.

> http://www.holm.cc/stress/
> http://www.openbsd.org/cgi-bin/cvsweb/ports/sysutils/stress/

Stress tests never increased the probability of the bug popping up.
It often popped up when the machine was completly idle.

> I simply don't know and I'm only guessing but the prime suspects for
> where the race might live seem to be physical memory management,
> PAL/interrupt handling or even the scheduler. 

Yawn.

> Are there better ways to stress the system?
> Are there better ways to increase the odds of a race occurring?

No.

> Since I needed to find a starting point, I went searching and reading
> through the archives of misc@, tech@, alpha@ and bugs@ even the netbsd
> archives in hopes of finding a "patient zero" where the bug was first
> reported. I found something interesting, namely a (more than once)
> reported bug that looks very similar to "The" alpha bug. The primary
> difference is you get "cpu_switch_queuescan" rather than "cpu_switch" in
> the trace output.

cpu_switch is just where it shows most often nowadays. When I debugged it
it was all over the place. Any debug printf I added to detect the condition
that caused the crash just moved the bug to another place.

> 2003-10-01 21:40:00
> http://marc.theaimsgroup.com/?l=openbsd-alpha&m=106504464724168&w=2
> 
> 2003-08-03 12:00:14
> http://marc.theaimsgroup.com/?l=openbsd-alpha&m=105999853009839&w=2

It was definitely happening before that. At least since summer 2002, or
even earlier.

> >From other bug reports in the archive I know 3.8, 3.7 and 3.6 are all
> affected by "The" alpha bug if my hunch is correct and the bugs linked
> above are related to "The" alpha bug, then I should start the
> compile-a-thon at OpenBSD v3.3 and work backwards.

Good luck. Since there is no way to reproduce the problem, there is also no
way to know that you have successfully found the bug unless you run your
every complie for at least a few weeks with normal load.

//art

Reply via email to