On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostik...@gmail.com> wrote:
> On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostik...@gmail.com> > > wrote: > > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov < > kostik...@gmail.com> > > > > wrote: > > > > > Do you have INVARIANTS enabled? If not, I am curious if enabling > them > > > > > would convert that rare page fault into rare "CPU %d has more MC > banks" > > > > > assert. > > > > > > > > > > Also might be the output of the > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > > > > /dev/cpuctl$x; done > > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the > production > > > > servers. However, I can turn those three KASSERTs into VERIFYs and > see > > > > what happens. Here is what your command shows on the server that > > > panicked: > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m > 0x179 > > > > /dev/cpuctl$x; done | uniq -c > > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > > > It probably explains it, but it would be more telling if you left the > > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit > set. > > > > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > > don't. > > > > > > > > > > I suspect that your machine has two sockets, and processor in one > socket > > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP > > > is not quite symmetric, perhaps processors were from different bins? > I found 2 other servers that exhibit the same problem: the first 16 cores have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold 6142 CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have other examples of X11DPU motherboards that don't exhibit the problem, but they all have both different CPUs and different BIOS revisions. So I can't be sure whether the bug follows the CPU model or the BIOS version. > > > > > > > Could be. Is there some MSR that reports a more specific version number? > There are CPUID %eax=1 values returned in %eax, but then it requires > some interpretation. > # cpucontrol -i 1 /dev/cpuctl$x > for $x iterating over the cpus. > Apart from the Local APIC ID field, that returns the same value for all processors. Your second patch doesn't cause any obvious problems on my dev system. _______________________________________________ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"