Hello,
Recently I installed Debian Linux 6 (Squeeze, kernel 2.6.32-5-amd64 #1
SMP) via netinst on an IBM eServer platform. The system has dual AMD
Opteron processors.
While transferring lots of data from the original server this server was
expected to replace, I noticed errors appearing repeatedly every 4
minutes or so in the ssh sessions:
Message from syslogd@jupiter at Jul 24 07:30:07 ...
kernel:[43618.440106] Northbridge Error, node 0
Message from syslogd@jupiter at Jul 24 07:30:07 ...
kernel:[43618.440304] Invalid GART PTE entry during table walk.
The errors appeared regularly, and it seemed only during very large data
transfers across the network. As soon as the file transfers (using
rsync) were completed, the errors stopped appearing. These messages show
on all ssh sessions I had open to that server.
After some searching, I found a Linux kernel patch from Borislav Petkov
at AMD where the exact error message was listed.
I also searched the Debian lists and found this bug report (600487) but
that seemed related to X which I don't use on this particular machine,
plus, the symptoms I see are triggered by data transfers via the network
interface.
The following document from AMD however gave me the best information,
but doesn't yet explain why the errors appear in the ssh sessions, much
less why this appears during bulk data transfers. AMD states these
messages should be suppressed.
http://support.amd.com/us/Processor_TechDocs/26094.PDF
On Page 333 I read:
------------------------------
12.10.1 GART Table Walk Error Reporting
This error is typically caused by a software graphics driver that
improperly reserves or allocates aperture pages in the GART, resulting
in benign visual artifacts
which are often undetected on other platforms.Setting MC4_CTL[10] allows
software developers to
debug this error; the resulting benign machine check errors can,
however, confuse an end user. For
this reason, AMD recommends that the BIOS developers disable this
function by setting bit 10 of
MC4_CTL_MASK register (MSR C001_0048h) to a value of 1. This bit must be
set before
MC4_CTL[10] bit is set. AMD also recommends adding a setup option to the
BIOS setup menu. The
following should be displayed in the setup option:
Gart Table Walk Error MC reporting: Disabled/Enabled.
The default setting is disabled. The device driver developer may enable
this function for
implementation and testing purposes. Also, a help message should be
added with this setup option.
An example of the help message is:
This option should remain disabled for normal operation.
-----------------------------------
It doesn't seem to be a real problem to me, but does anyone here have
any further knowledge on this issue?
root@jupiter:~# lspci
00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
(rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
(rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB OHCI
(rev 0b)
01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB OHCI
(rev 0b)
01:05.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
01:06.0 Mass storage controller: Silicon Image, Inc. SiI 3512
[SATALink/SATARaid] Serial ATA Controller (rev 01)
02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)
02:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)
grep -i gart /var/log/syslog, the times listed coincide with the data
transfers I performed:
Jul 23 19:23:16 jupiter kernel: [ 0.558019] PCI-DMA: using GART IOMMU.
Jul 23 19:23:16 jupiter kernel: [ 0.572745] Linux agpgart interface
v0.103
Jul 24 01:45:57 jupiter kernel: [22969.172109] Invalid GART PTE entry
during table walk.
Jul 24 01:53:13 jupiter kernel: [23405.177159] Invalid GART PTE entry
during table walk.
Jul 24 01:59:48 jupiter kernel: [23800.198627] Invalid GART PTE entry
during table walk.
Jul 24 02:07:56 jupiter kernel: [24288.213272] Invalid GART PTE entry
during table walk.
Jul 24 02:16:05 jupiter kernel: [24777.264098] Invalid GART PTE entry
during table walk.
Jul 24 02:25:04 jupiter kernel: [25316.292251] Invalid GART PTE entry
during table walk.
Jul 24 02:33:59 jupiter kernel: [25850.396117] Invalid GART PTE entry
during table walk.
Jul 24 02:41:05 jupiter kernel: [26276.428134] Invalid GART PTE entry
during table walk.
Jul 24 02:47:47 jupiter kernel: [26678.485158] Invalid GART PTE entry
during table walk.
Jul 24 02:56:09 jupiter kernel: [27180.511717] Invalid GART PTE entry
during table walk.
Jul 24 03:04:01 jupiter kernel: [27652.540110] Invalid GART PTE entry
during table walk.
Jul 24 03:10:59 jupiter kernel: [28070.573139] Invalid GART PTE entry
during table walk.
Jul 24 03:18:24 jupiter kernel: [28515.604177] Invalid GART PTE entry
during table walk.
Jul 24 03:25:39 jupiter kernel: [28950.669266] Invalid GART PTE entry
during table walk.
Jul 24 03:32:57 jupiter kernel: [29388.704171] Invalid GART PTE entry
during table walk.
Jul 24 03:41:07 jupiter kernel: [29878.756135] Invalid GART PTE entry
during table walk.
Jul 24 03:48:17 jupiter kernel: [30308.769289] Invalid GART PTE entry
during table walk.
Jul 24 03:54:45 jupiter kernel: [30696.800299] Invalid GART PTE entry
during table walk.
Jul 24 04:01:56 jupiter kernel: [31127.809561] Invalid GART PTE entry
during table walk.
Jul 24 04:09:51 jupiter kernel: [31602.856368] Invalid GART PTE entry
during table walk.
Jul 24 04:16:45 jupiter kernel: [32016.900212] Invalid GART PTE entry
during table walk.
Jul 24 04:23:34 jupiter kernel: [32425.968297] Invalid GART PTE entry
during table walk.
Jul 24 04:31:26 jupiter kernel: [32897.973248] Invalid GART PTE entry
during table walk.
Jul 24 04:36:19 jupiter kernel: [33191.004099] Invalid GART PTE entry
during table walk.
Jul 24 04:42:34 jupiter kernel: [33566.077240] Invalid GART PTE entry
during table walk.
Jul 24 04:49:56 jupiter kernel: [34008.116101] Invalid GART PTE entry
during table walk.
Jul 24 04:57:44 jupiter kernel: [34476.165189] Invalid GART PTE entry
during table walk.
Jul 24 05:03:41 jupiter kernel: [34833.209961] Invalid GART PTE entry
during table walk.
Jul 24 05:08:47 jupiter kernel: [35139.248310] Invalid GART PTE entry
during table walk.
Jul 24 05:14:12 jupiter kernel: [35464.297150] Invalid GART PTE entry
during table walk.
Jul 24 05:19:58 jupiter kernel: [35810.314438] Invalid GART PTE entry
during table walk.
Jul 24 05:23:57 jupiter kernel: [36048.324108] Invalid GART PTE entry
during table walk.
Jul 24 05:30:57 jupiter kernel: [36468.349242] Invalid GART PTE entry
during table walk.
Jul 24 05:38:34 jupiter kernel: [36925.376167] Invalid GART PTE entry
during table walk.
Jul 24 07:25:22 jupiter kernel: [43333.409857] Invalid GART PTE entry
during table walk.
Jul 24 07:30:07 jupiter kernel: [43618.440304] Invalid GART PTE entry
during table walk.
Jul 24 07:34:10 jupiter kernel: [43861.464551] Invalid GART PTE entry
during table walk.
Jul 24 07:38:12 jupiter kernel: [44103.473285] Invalid GART PTE entry
during table walk.
Jul 24 07:42:30 jupiter kernel: [44361.496401] Invalid GART PTE entry
during table walk.
Jul 24 07:47:11 jupiter kernel: [44642.644176] Invalid GART PTE entry
during table walk.
Jul 24 07:51:15 jupiter kernel: [44886.660174] Invalid GART PTE entry
during table walk.
Jul 24 07:57:50 jupiter kernel: [45281.744116] Invalid GART PTE entry
during table walk.
Jul 24 08:03:00 jupiter kernel: [45591.768321] Invalid GART PTE entry
during table walk.
Jul 24 08:09:55 jupiter kernel: [46006.780104] Invalid GART PTE entry
during table walk.
Jul 24 08:15:23 jupiter kernel: [46334.804131] Invalid GART PTE entry
during table walk.
Jul 24 08:19:38 jupiter kernel: [46589.828094] Invalid GART PTE entry
during table walk.
Jul 24 11:23:24 jupiter kernel: [ 0.562777] PCI-DMA: using GART IOMMU.
Jul 24 11:23:24 jupiter kernel: [ 0.577640] Linux agpgart interface
v0.103
Thanks, kind regards,
Jaap Hoetmer