Your message dated Mon, 04 May 2015 01:20:59 +0100 with message-id <1430698859.4113.122.ca...@decadent.org.uk> and subject line Re: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?! has caused the Debian Bug report #417853, regarding kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?! to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact ow...@bugs.debian.org immediately.) -- 417853: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=417853 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
--- Begin Message ---Package: kernel Severity: critical Justification: causes serious data loss Hi everybody. I'm currently (together with others) investigating in a severe data corruption problem that at least many users might suffer from. A short description, when you validate lots of GBs over and over with md5sums (or another hash) there are errors found. We do not yet know the real reson for the problems but it might relate to Opteron (and perhaps Athlon) CPUs and/or Nvidia chipsets (mainboard). So it might be a hardware design error (but even a kernel error could be possible). This is definitely not a single hardware issue of my system as many other users on lkml reported the problem (and we all did very extensive hardware tests). The error occurs only if on has so much memory that the system uses memory mapping (and the hardware iommu). At lkml we currently found two "solutions" (I consider them more workarounds, as we don't know exactly why they're solving the problem): 1) Disabling memory hole mapping in the system BIOS. The downside is that there is no memory hole mapping at all, and the users looses much of his main memory (in my case 1,5 GB) 2) Setting iommu=soft. The users keeps it full memory, and in all our tests (at least as far as I am informed), and we do very much tests as I and someone else administer some big linux clusters,... the error did _not_ occur. Windows users do generally not suffer from this corruption, as Windows (at least until Vista) was not able to make use of the hardware iommu, and always uses the software iommu. The Intel CPUs with EMT64/Intel64 don't suffer from that problem either, as they don't have an hwiommu, too (at least as far as I know). We are not yet sure if this is a large scale problem or affects only some special hardware combinations. We do however think that the issue occurs only with PCI-DMA accesses. (Tests showed, that when disabling dma or at least using slower dma modes on the disks, the issue disappeared). The problem is vendors (at least Nvidia) does not help very much, they even didn't answer my mails. And most "normal" users won't recognise this problem, as they don't have enought main memory and even it they have the error occurs very rarely (perhaps some 100 bytes every 30 GB <- only a very imprecise scale). What I suggest know: As this is a very grave I suggest - to configure all the default kernels for etch that may be affected (as far as I know that are the amd64-k8 and amd64-generic kernels. Perhaps the i386 packages too, have a look at lkml for this) to use iommu=soft. - to update all packages in sarge and woody (as far as they might be affected) - put some warnings in the packages where users might configure their own kernel and the boot-loaders. Have a look at this thread at lkml http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2 for in-depth information. It also contains links to some previous threads. There are also some posts to lkml about this topics in separate threads (e.g. "amd64 iommu causing corruption? (was Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!)"). Best wishes, Chris. btw: please CC me as I'm off-list at the moment. PS: I'll also write this the debian-kernel mailinglist. -- System Information: Debian Release: 4.0 APT prefers unstable APT policy: (500, 'unstable') Architecture: amd64 (x86_64) Shell: /bin/sh linked to /bin/dash Kernel: Linux 2.6.18 Locale: LANG=en...@scientia.net, LC_CTYPE=en...@scientia.net (charmap=UTF-8)<<attachment: calestyo.vcf>>
--- End Message ---
--- Begin Message ---The kernel bug was fixed in 2.6.18.dfsg.1-13 so this doesn't need documenting. Ben. -- Ben Hutchings If you seem to know what you are doing, you'll be given more to do.signature.asc
Description: This is a digitally signed message part
--- End Message ---