** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1352995
Title: ERAT Multihit machine checks Status in “linux” package in Ubuntu: Confirmed Bug description: -- Problem Description -- Our project involves porting a 3rd-party out-of-tree module to LE Ubuntu on Power. We've been seeing occasional ERAT Multihit machine checks with kernels ranging from the LE Ubuntu 14.04 3.13-based kernel through the very latest 3.16-rc5 mainline kernels. Our kernels are running directly on top of OPAL/Sapphire in PowerNV mode, with no intervening KVM host. FSP dumps captured at the time of the ERAT detection show that there are duplicate mappings in force for the same real page, with the duplicate mappings being for different sized pages. So, for example, the same 4K real page will be referred to by a 4K mapping and an overlapping 16M mapping. Aneesh has been working with us on this. We are currently testing this patchset. (git format-patch --stdout format). We are still finding ERAT with this changes. Most of these changes are already posted externally. Some of them got updated after that. Current status is. When hitting multi hit erat, I don't find duplicate hash pte entries. So it possibly indicate a missed flush or a race. Dar value is 3fff7d0f0000 psize 0 slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1 Dump the rest of 256 entries Dar value is 3fff7d0f0000 psize 0 slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1 Done.. Dump the rest of 256 entries Done.. Found hugepage shift 0 for ea 3fff7d0f0000 with ptep 1f283d8000383 Severe Machine check interrupt [Recovered] Initiator: CPU Error type: ERAT [Multihit] Effective address: 00003fff7d0f0000 That is what i am finding on machine check. I am searching the hash pte with base page size 4K and 64K and printing matching hash table entries. b_size = 15 and a_size = -1 both indicate 4K. -aneesh I guess we now have a race in the unmap path. I am auditing the hpte_slot_array usage. We do check for hpte_slot_array != NULL in invalidate. But if we hit two pmdp_splitting flush one will skip the invalidate as per current code and will go ahead and mark hpte_slot_array NULL. I have a patch in the repo which try to work around that. But I am not sure whether we really can have two pmdp_splitting flush simultaneously. because we call that under pmd_lock. Still need to look at the details. -aneesh I added more debug prints. And this is what i found. Before a hugepage flush I added debug prints to dump the hash table to see if we are failing to clear any hash table entries. After every update we seems to have clearly updated hash table. One MCE some of the relevant part of logs are pmd_hugepage_update dumping entries for 0x3fff71000000 with clr = 0xffffffffffffffff set = 0x0 ..... ..... dump_hash_pte_group dumping entries for 0x3fff7191da8c with clr = 0x0 set = 0x0 func = dump_hash_pte_group, addr = 3fff7191da8c psize = 0 slot = 1174024 v = 4001a9245cff7181 r = 7dfb5d193 with b_size = 0 a_size = 0 count = 2333 func = dump_hash_pte_group, addr = 3fff71000000 psize = 0 slot = 1155808 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 0 func = dump_hash_pte_group, addr = 3fff710a2000 psize = 0 slot = 1157104 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 162 func = dump_hash_pte_group, addr = 3fff710e6000 psize = 0 slot = 1156560 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 230 func = dump_hash_pte_group, addr = 3fff71378000 psize = 0 slot = 1161504 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 888 So we end up clearing the huge pmd with 0x3fff71000000 and at that point we didn't had anything in hash table. That is the last pmdp_splitting_flush or pmd_hugepage_update even on that address. Can we audit the driver code to understand large/huge page usage and if it is making any x86 assumptions around the page table accessor. For example ppc64 rules around page table access are more strict than x86. We don't have flush_tlb_* functions and we need to make sure we hold ptl while updating page table and also flush the hash pte holding the lock. Attaching the log also -aneesh Aneesh writes: Can we audit the driver code to understand large/huge page usage and if it is making any x86 assumptions around the page table accessor. For example ppc64 rules around page table access are more strict than x86. We don't have flush_tlb_* functions and we need to make sure we hold ptl while updating page table and also flush the hash pte holding the lock. Yes, we can do that (all the driver code that's specific to linux is in the kernel-interface subdirectory, so you can take a look as well). But I'm not quite sure what we'd be looking for. The driver doesn't have any explicit awareness of huge-pages; it doesn't intend or expect to interact with them in any way. And I wouldn't expect the driver to be updating the kernel's page tables itself but rather to use of some set of (relatively safe) services to do that. So if you can tell us what we might want to look for in the driver code, we'll be happy to do that. I do notice a couple uses of __flush_tlb() and global_flush_tlb(), but those are under x86 ifdefs and won't be compiled in for Power. The intent of the code using those is to flush the caches when the driver changes the cache attribute of memory regions between cached and uncached. The driver's linux kernel interface code does contain references to updating "pte", but those should all be the PTEs that are used by the adapter, not the linux kernel page table entries. After some additional looking, I see that there are some code paths in the driver's kernel interface layer at least refer to the kernel page table structures (see the references to pte_t, pmd_t, pgd_t, etc.) in kernel_interface/nv-linux.h and nv.c. But again, these are code paths that should only be compiled in for x86 (and in this case for kernel versions < 2.6.1) as far as I can see. Can you try the new patchset? I was able to run recreat1.sh in loop for more than 8 times now. I will leave it running for the rest of the day and will check again tomorrow morning. I still need to get clarification on calling tlbie in loop for huge pages from hardware guys. -aneesh To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1352995/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp