Bjorn, It looks like we need to revisit the patch that makes the NMI watchdog use the 2nd counter (PERFREVTSEL1/PMC1) for architectural perfmon.
The Intel Core Duo processor has a bug (AE49) with the enable bit of the 2nd counter not working. The Core 2 Duo/Quad processors also implement architectural perfmon but they do not have the bug. Furthermore, they implement PEBS which requires the use of the 1st counter. I think we need to distinguish the 2 situtations. For Core Duo use the 1st counter, for other architectural perfmon processor use the 2nd. That means we cannot simply rely on X86_FEATURE_ARCH_PERFMON we need to check family/model number. Thanks. On Tue, Aug 28, 2007 at 11:30:35AM -0700, Daniel Walker wrote: > On Tue, 2007-08-28 at 10:05 -0700, Stephane Eranian wrote: > > Daniel, > > > > On Tue, Aug 28, 2007 at 07:34:44AM -0700, Daniel Walker wrote: > > > On Tue, 2007-08-28 at 02:12 -0700, Stephane Eranian wrote: > > > > Daniel, > > > > > > > > On Mon, Aug 27, 2007 at 04:07:54PM -0700, Daniel Walker wrote: > > > > > On Mon, 2007-08-27 at 15:55 -0700, Stephane Eranian wrote: > > > > > > > > > > > Yet the model name looks strange. So we need to run one more test, > > > > > > as the fam/model is not enough. What we need to check is whether or > > > > > > not this processor implements architectural perfmon or not. > > > > > > > > > > > > Could you please compile and run the attached program and send me > > > > > > the output? > > > > > > > > > > The output below is all the output .. > > > > > > > > > > eax=0x7280201: version=1 num_cnt=2 > > > > > > > > > Then you have a Core Duo processor and the commit from Bjorn should > > > > fix the problem. If it does not, then there is something else wrong. > > > > Unfortunately, I do not have a Core Duo machine to try and reproduce. > > > > > > There must be something else wrong, cause the problem persists .. As I > > > said in past emails to Bjorn, I tested his commit in git, as well as the > > > latest git all with the same issue (as well as bisecting git).. > > > > > > If the hardware is buggy then we need some way to determine that.. > > > > > Could you instrument check_nmi_watchdog() to verify that you terminate > > this function? Normally there is a safety mechanism in there. > > > > Another possibility is that you get flooded with NMI interrupts and > > do not make forward progress. > > Here is some output from my boot log, > > md: raid1 personality registered for level 1 > device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: [EMAIL PROTECTED] > ip_tables: (C) 2000-2006 Netfilter Core Team > TCP cubic registered > NET: Registered protocol family 1 > NET: Registered protocol family 17 > Testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)! > CPU#1: NMI appears to be stuck (0->0)! > CPU#2: NMI appears to be stuck (0->0)! > CPU#3: NMI appears to be stuck (0->0)! > Starting balanced_irq > <hangs> > > So it does appear to make it out of the check_nmi_watchdog() function > even tho there is a problem with the watchdog .. "Starting balance_irq" > is in balanced_irq_init(). It usually hangs at this spot , but with > initcall_debug on it once hung a little further down .. > > It seems like there might be some other interrupt going off too early, > that causes the system to hang.. > > As you can see from the log above, this system is a quad.. Two physical > cpus each with two cores. > > Daniel -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/