On Sat, 17 Jul 2010 14:35, Markus Gebert wrote:
In Message-Id: <f744f475-3d2b-4bc6-856a-a5d302aa8...@hostpoint.ch>
On 13.07.2010, at 16:02, Markus Gebert wrote:
Unfortunately, I have not been able to get anything useful out the svn
commit logs, which could explain this. Maybe someone else has an idea
what could have changed between 7 and 8 to break it, and again between
8 and CURRENT to magically fix it again.
I tracked this down further. I couldn't easily downgrade my 8.1
installation to see when the problem was introduced because the zpool
version used is 14. So I tried to figure out, when the problem was
solved in CURRENT.
I started with the first possible revision that can boot off my v14 pool
(r201143, Dec 28, zfs v14 commit). With this revision, I was able to
trigger the MCE.
Then I took some later revision (rev206010, Apr 1, chosen randomly), and
I couldn't reproduce the problem. I started narrowing the revisions down
until I found out, that while on r202386 I'm still able to trigger the
MCE, r202387 seems to solve the problem on CURRENT:
http://svn.freebsd.org/viewvc/base?view=revision&revision=202387
Since John Baldwin mentioned this problem could be timing related, it
seems reasonable, that a clock-related change could be fix it. But this
commit seems to have been MFC'd to 8-STABLE and 8.1 (at least as far as
I can tell) along with some other changes to amd64 specific code. I
thought that maybe these other changes that have been MFC'd could have
reintroduced the problem later on, but so far I could not reproduce the
problem with newer CURRENT revisions. So, I actually nailed this one
done to a single commit on CURRENT, but still cannot tell what the
actual difference is compared to 8-STABLE/8.1.
Any ideas how to proceed?
Adding to this I remembered some specific commits that caught my attention
when they happened. Specifically they were to mca.c (locate mca) on my
machine provided the file paths and svn log provided the commit log.
When you said April and I seen the log it rang a bell.
These may be of interest to you:
------------------------------------------------------------------------
r210079 | jhb | 2010-07-14 17:10:14 -0400 (Wed, 14 Jul 2010) | 13 lines
MFC 208507,208556,208621:
Add support for corrected machine check interrupts. CMCI is a new local
APIC interrupt that fires when a threshold of corrected machine check
events is reached. CMCI also includes a count of events when reporting
corrected errors in the bank's status register. Note that individual
banks may or may not support CMCI. If they do, each bank includes its own
threshold register that determines when the interrupt fires. Currently
the code uses a very simple strategy where it doubles the threshold on
each interrupt until it succeeds in throttling the interrupt to occur only
once a minute (this interval can be tuned via sysctl). The threshold is
also adjusted on each hourly poll which will lower the threshold once
events stop occurring.
------------------------------------------------------------------------
r206183 | alc | 2010-04-05 12:11:42 -0400 (Mon, 05 Apr 2010) | 6 lines
MFC r204907, r204913, r205402, r205573, r205573
Implement AMD's recommended workaround for Erratum 383 on Family 10h
processors.
Enable machine check exceptions by default.
------------------------------------------------------------------------
And a list of mca.c's within the stable/8 src tree:
/usr/src/sbin/mca/mca.c
/usr/src/sys/amd64/amd64/mca.c
/usr/src/sys/dev/aha/aha_mca.c
/usr/src/sys/dev/buslogic/bt_mca.c
/usr/src/sys/dev/ep/if_ep_mca.c
/usr/src/sys/i386/i386/mca.c
/usr/src/sys/ia64/ia64/mca.c
Regards & Good luck,
--
jhell
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"