Re: My problems with stability on -current

Doug Barton Mon, 09 May 2011 19:15:06 -0700

New symptom, today (still running r221566) I compiled a small port, thatworked without any freezes or interactivity problems. Then I tried

compiling a larger port (java/openjdk6 if anyone cares) and still no
interactivity problems, but I got the "system wedge requiring power
cycle" problem I was seeing previously that I tracked to the one-shot
timer update.


More below.

On 05/07/2011 02:43, Alexander Motin wrote:

Doug Barton wrote:

On 05/05/2011 13:55, Alexander Motin wrote:

I see several possibly unrelated problems there:
   - crashes are always crashes. They should be debugged.
   - calcru going backwards could have the same roots as lost wall clock
time.


I think you're right about that. What usually happens when the load
maxes out is that the system visibly freezes for a minute or 2, and when
it comes back to life the log is flooded with calcru messages. If it
stays up long enough after that the wall clock drift becomes noticeable.
This is in spite of running ntpd.


These system freezes are very suspicious. Most time counters need only
few seconds to overflow, some even less. So freeze for few minutes will
easily overflow most of them. So the freezes are probably the cause of
time problems, but the question now is what the cause of freezes. You
should try to investigate what is going on during freezes. Does the
system do anything, are there any interrupts working (`vmstat -i` just
before and after), are there any interrupt storms, etc?


Here is the output on a mostly-idle system, shortly after reboot:

vmstat -i
interrupt                          total       rate
irq1: atkbd0                        1784          0
irq9: acpi0                            1          0
irq14: ata0                       213355         89
irq15: ata1                           58          0
irq17: wpi0                        74331         31
irq20: hpet0 uhci0+               787767        331
irq22: uhci2                       21453          9
irq256: hdac0                         11          0
Total                            1098760        462

At a more opportune time I'll try crashing it again and get another result.

If there are some problems with timer interrupts, timecounters
could wrap unnoticed that will cause random time jumps.
   - interactivity problems. I can't prove it is unrelated, but have no
real ideas now.

I would start from most obvious problems. I need to know more about
crashes. As usual: how to trigger, stack backtraces, etc.


Triggering is easy, I can start a buildworld with -j2, and a build of
ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system
will reboot. I posted a panic message relative to r220282, (-current
archives, 4/4) but kib said it didn't make any sense. Usually I don't
get a panic at all.


Could you hint me the thread?


Go to http://www.FreeBSD.org/
Click 'mailing lists'
Click 'listed in the FreeBSD Handbook.'
Click freebsd-current
Click freebsd-current Archives
Click April 2011
search for r220282
Voila! :)

What's about time problems, I would try to collect more data:
   - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
dmesg outputs;


http://people.freebsd.org/~dougb/dougb-current-r221566.txt

   - what eventtimer is used now and does it helps to switch to another
one with kern.eventtimer.timer sysctl?


When I was trying to track down the problems last summer I vaguely
remember trying RTC, but eventually we realized that the real problem
was throttling, so I stopped specifying RTC and let it go back to the
default. What do you suggest I try?


As I see, now you are using HPET (chosen automatically). I would try
switch to the LAPIC. Just make sure to disable C-states if you are
enabled them to be sure that LAPIC timer won't stop.


Ok, so kern.eventtimer.timer="LAPIC" in /boot/loader.conf should do
that, right?

I don't use C-states (in part as a result of previous investigation) butI do use powerd as such:

powerd_flags="-a adaptive -b adaptive -n adaptive"

   - does the timer runs in periodic or one-shot mode and does it helps to
switch to another one?


How could I tell, and how would I switch?


`sysctl kern.eventtimer.periodic`.


kern.eventtimer.periodic: 0

And read eventtimers(4) please.


I did that, but I don't see anything in there as to which choice is
one-shot, and how to change to periodic. I assume 0 is the default,
which I also assume is one-shot. Does setting that to 1 change to
periodic? Also, can I safely do this while the system is running, or
should it be in /boot/loader.conf as well?

   - if full CPU load makes time to stop, try to track what is going on
with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
CPU load in one-shot mode you should have stable timer interrupt rate
about hz+stathz.


Ok, I'll do that tomorrow, tired now.

   - if timer interrupts are not working well, you can build kernel with
options                KTR
options                ALQ
options                KTR_ALQ
options                KTR_COMPILE=(KTR_SPARE2)
options                KTR_ENTRIES=131072
options                KTR_MASK=(KTR_SPARE2)
to track event timers operation and use ktrdump to save the trace when
problem exist (preferably when it begins).

And let's experiment with fresh CURRENT.


Done and done. I'm up to r221566, and I added those options to my kernel
config. I ran ktrdump -cH -o ktrdumpfile and posted the results here:
http://people.freebsd.org/~dougb/ktrdumpfile.txt  This was shortly after
boot, with no load. Not sure if it helps, but there you go.


Dump looks fine, but I need dump specifically for the time of the
problem. As soon as time probably can't be trusted here, it would be
nice to make dump as localized as possible: clear buffer with `sysctl
debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with
`sysctl debug.ktr.mask=0` and do the dump.


Ok, I'll give that a try after work.


Thanks,

Doug

--

        Nothin' ever doesn't change, but nothin' changes much.
                        -- OK Go

        Breadth of IT experience, and depth of knowledge in the DNS.
        Yours for the right price.  :)  http://SupersetSolutions.com/

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[email protected]"

Re: My problems with stability on -current

Reply via email to