Re: mm: BUG_ON with NUMA_BALANCING (kernel BUG at include/linux/swapops.h:131!)
On Wed, May 13, 2015 at 01:17:54AM -0700, Haren Myneni wrote: > Hi, > > I am getting BUG_ON in migration_entry_to_page() with 4.1.0-rc2 > kernel on powerpc system which has 512 CPUs (64 cores - 16 nodes) and > 1.6 TB memory. We can easily recreate this issue with kernel compile > (make -j500). But I could not reproduce with numa_balancing=disable. > Is this patched in any way? I ask because line 134 on 4.1.0-rc2 does not match up with a BUG_ON. It's close to a PageLocked check but I want to be sure there are no other modifications. Otherwise, when was the last time this worked? Was 4.0 ok? As it can be easily reproduced, can the problem be bisected please? -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: mm: BUG_ON with NUMA_BALANCING (kernel BUG at include/linux/swapops.h:131!)
On Mon, May 18, 2015 at 12:32:29AM -0700, Haren Myneni wrote: > Mel, > I am hitting this issue with 4.0 kernel and even with 3.19 and > 3.17 kernels. I will also try with previous versions. Please let me > know any suggestions on the debugging. > Please keep going further back in time to see if there was a point where this was ever working. It could be a ppc64-specific bug but right now, I'm still drawing a blank. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote: > Mel Gorman writes: > > > This is follow up from the "pipe/page fault oddness" thread. > > > > Automatic NUMA balancing depends on being able to protect PTEs to trap a > > fault and gather reference locality information. Very broadly speaking it > > would mark PTEs as not present and use another bit to distinguish between > > NUMA hinting faults and other types of faults. It was universally loved > > by everybody and caused no problems whatsoever. That last sentence might > > be a lie. > > > > This series is very heavily based on patches from Linus and Aneesh to > > replace the existing PTE/PMD NUMA helper functions with normal change > > protections. I did alter and add parts of it but I consider them relatively > > minor contributions. Note that the signed-offs here need addressing. I > > couldn't use "From" or Signed-off-by from the original authors as the > > patches had to be broken up and they were never signed off. I expect the > > two people involved will just stick their signed-off-by on it. > > > How about the additional change listed below for ppc64 ? One part of the > patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd > because we find the _PAGE_PRESENT bit set in case of numa fault. I > ended up relaxing the check there. > I folded the set_pte_at and set_pmd_at changes into the patch "mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa" with one change -- both set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency. > Second part of the change is to add a WARN_ON to make sure we are > not depending on DSISR_PROTFAULT for anything else. We ideally should not > get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check > whether the access is allowed by pte before inserting a pte into hash > page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes > in hash page table. But it is good to run with VM_WARN_ON ? > Due to the nature of the check and when they are hit, I converted it to a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also noted that with the patch the kernel potentially no longer recovers from this exceptional cirsumstance and instead falls through. To avoid this, I preserved the "goto out_unlock". Is this still ok? ---8<--- ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Requires-signed-off-by: Aneesh Kumar K.V Signed-off-by: Mel Gorman --- arch/powerpc/mm/copro_fault.c | 7 ++- arch/powerpc/mm/fault.c | 20 +--- 2 files changed, 15 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 5a236f0..46152aa 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock; } else { - if (dsisr & DSISR_PROTFAULT) + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT)) goto out_unlock; if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto out_unlock; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5007497..9d6e0b3 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -396,17 +396,6 @@ good_area: #endif /* CONFIG_8xx */ if (is_exec) { -#ifdef CONFIG_PPC_STD_MMU - /* Protection fault on exec go straight to failure on -* Hash based MMUs as they either don't support per-page -* execute permission, or if they do, it's handled already -* at the hash level. This test would probably have to -* be removed if we change the way this works to make hash -* processors use the same I/D cache coherency mechanism -* as embedded. -*/ -#endif /* CONFIG_PPC_STD_MMU */ - /* * Allow execution from readable areas if the MMU does not * provide separate controls over reading and executing. @@ -421,6 +410,14 @@ good_area: (cpu_
Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
On Tue, Nov 18, 2014 at 10:03:30PM +0530, Aneesh Kumar K.V wrote: > > diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c > > index 5a236f0..46152aa 100644 > > --- a/arch/powerpc/mm/copro_fault.c > > +++ b/arch/powerpc/mm/copro_fault.c > > @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned > > long ea, > > if (!(vma->vm_flags & VM_WRITE)) > > goto out_unlock; > > } else { > > - if (dsisr & DSISR_PROTFAULT) > > + /* > > +* protfault should only happen due to us > > +* mapping a region readonly temporarily. PROT_NONE > > +* is also covered by the VMA check above. > > +*/ > > + if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT)) > > goto out_unlock; > > if (!(vma->vm_flags & (VM_READ | VM_EXEC))) > > goto out_unlock; > > > we should do that DSISR_PROTFAILT check after vma->vm_flags. It is not > that we will not hit DSISR_PROTFAULT, what we want to ensure here is that > we get a prot fault only for cases convered by that vma check. So > everything should be taking the if (!(vma->vm_flags & (VM_READ | > VM_EXEC))) branch if it is a protfault. If not we would like to know > about that. And hence the idea of not using WARN_ON_ONCE. I was also not > sure whether we want to enable that always. The reason for keeping that > within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on > PROTFAULT outside the vma check convered. So expectations is that > developers working on feature will run with DEBUG_VM enable and finds > this warning. We don't expect to hit this otherwise. > /me slaps self. It's clear now and updated accordingly. Thanks. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault
A transhuge NUMA hinting fault may find the page is migrating and should wait until migration completes. The check is race-prone because the pmd is deferenced outside of the page lock and while the race is tiny, it'll be larger if the PMD is cleared while marking PMDs for hinting fault. This patch closes the race. Signed-off-by: Mel Gorman --- include/linux/migrate.h | 4 mm/huge_memory.c| 3 ++- mm/migrate.c| 6 -- 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 01aad3e..a3edcdf 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_NUMA_BALANCING extern bool pmd_trans_migrating(pmd_t pmd); -extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); @@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd) { return false; } -static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index de98415..38fa6cc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1284,8 +1284,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * check_same as the page may no longer be mapped. */ if (unlikely(pmd_trans_migrating(*pmdp))) { + page = pmd_page(*pmdp); spin_unlock(ptl); - wait_migrate_huge_page(vma->anon_vma, pmdp); + wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index 0143995..baaf80e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } -void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ - struct page *page = pmd_page(*pmd); - wait_on_page_locked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2
V1 failed while running under kvm-tools very quickly and a second report indicated that it happens on bare metal as well. This version survived an overnight run of trinity running under kvm-tools here but verification from Sasha would be appreciated. Changelog since V1 o ppc64 paranoia checks and clarifications (aneesh) o Fix trinity regression (hopefully) o Reduce unnecessary TLB flushes(mel) Automatic NUMA balancing depends on being able to protect PTEs to trap a fault and gather reference locality information. Very broadly speaking it would mark PTEs as not present and use another bit to distinguish between NUMA hinting faults and other types of faults. It was universally loved by everybody and caused no problems whatsoever. That last sentence might be a lie. This series is very heavily based on patches from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions with normal change protections. I did alter and add parts of it but I consider them relatively minor contributions. At their suggestion, acked-bys are in there but I've no problem converting them to Signed-off-by if requested. AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for that. I tested trinity under kvm-tool and passed and ran a few other basic tests. In most cases I'm leaving out detail as it's not that interesting. specjbb single JVM: There was negligible performance difference in the benchmark itself for short and long runs. However, system activity is higher and interrupts are much higher over time -- possibly TLB flushes. Migrations are also higher. Overall, this is more overhead but considering the problems faced with the old approach I think we just have to suck it up and find another way of reducing the overhead. specjbb multi JVM: Negligible performance difference to the actual benchmarm but like the single JVM case, the system overhead is noticably higher. Again, interrupts are a major factor. autonumabench: This was all over the place and about all that can be reasonably concluded is that it's different but not necessarily better or worse. autonumabench 3.18.0-rc43.18.0-rc4 vanilla protnone-v2r5 UserNUMA01 32806.01 ( 0.00%)20250.67 ( 38.27%) UserNUMA01_THEADLOCAL23910.28 ( 0.00%)22734.37 ( 4.92%) UserNUMA023176.85 ( 0.00%) 3082.68 ( 2.96%) UserNUMA02_SMT1600.06 ( 0.00%) 1547.08 ( 3.31%) System NUMA01 719.07 ( 0.00%) 1344.39 (-86.96%) System NUMA01_THEADLOCAL 916.26 ( 0.00%) 180.90 ( 80.26%) System NUMA02 20.92 ( 0.00%) 17.34 ( 17.11%) System NUMA02_SMT 8.76 ( 0.00%)7.24 ( 17.35%) Elapsed NUMA01 728.27 ( 0.00%) 519.28 ( 28.70%) Elapsed NUMA01_THEADLOCAL 589.15 ( 0.00%) 554.73 ( 5.84%) Elapsed NUMA02 81.20 ( 0.00%) 81.72 ( -0.64%) Elapsed NUMA02_SMT 80.49 ( 0.00%) 79.58 ( 1.13%) CPU NUMA014603.00 ( 0.00%) 4158.00 ( 9.67%) CPU NUMA01_THEADLOCAL 4213.00 ( 0.00%) 4130.00 ( 1.97%) CPU NUMA023937.00 ( 0.00%) 3793.00 ( 3.66%) CPU NUMA02_SMT1998.00 ( 0.00%) 1952.00 ( 2.30%) System CPU usage of NUMA01 is worse but it's an adverse workload on this machine so I'm reluctant to conclude that it's a problem that matters. On the other workloads that are sensible on this machine, system CPU usage is great. Overall time to complete the benchmark is comparable 3.18.0-rc4 3.18.0-rc4 vanillaprotnone-v2r5 User61493.3847615.01 System 1665.17 1550.07 Elapsed 1480.79 1236.74 NUMA alloc hit 4739774 5328362 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 4664980 5328351 NUMA base PTE updates556489407 444119981 NUMA huge PMD updates 1086000 866680 NUMA page range updates 1112521407 887860141 NUMA hint faults 1538964 1242142 NUMA hint local faults 835871 814313 NUMA hint local percent 54 65 NUMA pages migrated732921259883854 The NUMA pages migrated look terrible but when I looked at a graph of the activity over time I see that the massive spike in migration activity was during NUMA01. This correlates with high system CPU usage and could be simply down to bad luck but any modifications that affect that workload would be related to scan rates and migrations, not the protection mechanism. For all other workloads, migration activity was comparable. Overall, headline performance figures are co
[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/pgtable.h | 11 +++ arch/x86/include/asm/pgtable.h | 16 include/asm-generic/pgtable.h | 19 +++ 3 files changed, 46 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 316f9a5..452c3b4 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -39,6 +39,17 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING +static inline int pte_protnone_numa(pte_t pte) +{ + return (pte_val(pte) & + (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT; +} + +static inline int pmd_protnone_numa(pmd_t pmd) +{ + return pte_protnone_numa(pmd_pte(pmd)); +} + static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_NUMA_MASK; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index aa97a07..613cd00 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -497,6 +497,22 @@ static inline int pmd_present(pmd_t pmd) _PAGE_NUMA); } +#ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h + */ +static inline int pte_protnone_numa(pte_t pte) +{ + return pte_flags(pte) & _PAGE_PROTNONE; +} + +static inline int pmd_protnone_numa(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_PROTNONE; +} +#endif /* CONFIG_NUMA_BALANCING */ + static inline int pmd_none(pmd_t pmd) { /* Only check low word on 32-bit platforms, since it might be diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 752e30d..7e74122 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -677,6 +677,25 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +#ifndef CONFIG_NUMA_BALANCING +/* + * Technically a PTE can be PROTNONE even when not doing NUMA balancing but + * the only case the kernel cares is for NUMA balancing. By default, + * implement the helper as "always no". Note that this does not check VMA + * protection bits meaning that it is up to the caller to distinguish between + * PROT_NONE protections and NUMA hinting fault protections. + */ +static inline int pte_protnone_numa(pte_t pte) +{ + return 0; +} + +static inline int pmd_protnone_numa(pmd_t pmd) +{ + return 0; +} +#endif /* CONFIG_NUMA_BALANCING */ + #ifdef CONFIG_NUMA_BALANCING /* * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/fault.c | 5 - arch/powerpc/mm/gup.c | 4 ++-- arch/powerpc/mm/pgtable.c | 8 +++- arch/powerpc/mm/pgtable_64.c| 3 ++- arch/x86/mm/gup.c | 4 ++-- include/uapi/linux/mempolicy.h | 2 +- mm/gup.c| 8 mm/huge_memory.c| 16 +++ mm/memory.c | 4 ++-- mm/mprotect.c | 39 ++--- mm/pgtable-generic.c| 2 +- 12 files changed, 40 insertions(+), 57 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 084ad54..bbd8499 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size); - if (pte_present(pte) && !pte_numa(pte)) { + if (pte_present(pte) && !pte_protnone_numa(pte)) { if (writing && !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 08d659a..5007497 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -405,8 +405,6 @@ good_area: * processors use the same I/D cache coherency mechanism * as embedded. */ - if (error_code & DSISR_PROTFAULT) - goto bad_area; #endif /* CONFIG_PPC_STD_MMU */ /* @@ -430,9 +428,6 @@ good_area: flags |= FAULT_FLAG_WRITE; /* a read */ } else { - /* protection fault */ - if (error_code & 0x0800) - goto bad_area; if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; } diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c index d874668..d870d93 100644 --- a/arch/powerpc/mm/gup.c +++ b/arch/powerpc/mm/gup.c @@ -39,7 +39,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, /* * Similar to the PMD case, NUMA hinting must take slow path */ - if (pte_numa(pte)) + if (pte_protnone_numa(pte)) return 0; if ((pte_val(pte) & mask) != result) @@ -85,7 +85,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone_numa(pmd)) return 0; if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next, diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c90e602..b5d58d3 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -173,7 +173,13 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT); + /* +* When handling numa faults, we already have the pte marked +* _PAGE_PRESENT, but we can be sure that it is not in hpte. +* Hence we can use set_pte_at for them. +*/ + WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); #endif /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index c8d709a..c721c5e 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -710,7 +710,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT); + WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); assert_spin_locked(&mm->page_table_lock); WARN_ON(!pmd_tra
[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Signed-off-by: Mel Gorman Acked-by: Aneesh Kumar K.V --- arch/powerpc/mm/copro_fault.c | 8 ++-- arch/powerpc/mm/fault.c | 20 +--- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 5a236f0..0450d68 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock; } else { - if (dsisr & DSISR_PROTFAULT) - goto out_unlock; if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto out_unlock; + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(dsisr & DSISR_PROTFAULT); } ret = 0; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5007497..9d6e0b3 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -396,17 +396,6 @@ good_area: #endif /* CONFIG_8xx */ if (is_exec) { -#ifdef CONFIG_PPC_STD_MMU - /* Protection fault on exec go straight to failure on -* Hash based MMUs as they either don't support per-page -* execute permission, or if they do, it's handled already -* at the hash level. This test would probably have to -* be removed if we change the way this works to make hash -* processors use the same I/D cache coherency mechanism -* as embedded. -*/ -#endif /* CONFIG_PPC_STD_MMU */ - /* * Allow execution from readable areas if the MMU does not * provide separate controls over reading and executing. @@ -421,6 +410,14 @@ good_area: (cpu_has_feature(CPU_FTR_NOEXECUTE) || !(vma->vm_flags & (VM_READ | VM_WRITE goto bad_area; +#ifdef CONFIG_PPC_STD_MMU + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); +#endif /* CONFIG_PPC_STD_MMU */ /* a write */ } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) @@ -430,6 +427,7 @@ good_area: } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); } /* -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations
With PROT_NONE, the traditional page table manipulation functions are sufficient. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar --- include/linux/huge_mm.h | 3 +-- mm/huge_memory.c| 33 +++-- mm/memory.c | 10 ++ mm/mempolicy.c | 2 +- mm/migrate.c| 2 +- mm/mprotect.c | 2 +- mm/pgtable-generic.c| 2 -- 7 files changed, 17 insertions(+), 37 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ad9051b..554bbe3 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ec0b424..668f1a3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1367,9 +1367,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - pmd = pmd_mknonnuma(pmd); + pmd = pmd_modify(pmd, vma->vm_page_prot); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_protnone_numa(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1503,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; ret = 1; - if (!prot_numa) { - entry = pmdp_get_and_clear(mm, addr, pmd); - if (pmd_protnone_numa(entry)) - entry = pmd_mknonnuma(entry); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); - } else { - struct page *page = pmd_page(*pmd); - - /* -* Do not trap faults against the zero page. The -* read-only data is likely to be read-cached on the -* local CPU cache and it is less useful to know about -* local vs remote hits on the zero page. -*/ - if (!is_huge_zero_page(page) && - !pmd_protnone_numa(*pmd)) { - pmdp_set_numa(mm, addr, pmd); - ret = HPAGE_PMD_NR; - } - } + entry = pmdp_get_and_clear(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index 96ceb0a..900127b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3120,9 +3120,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * validation through pte_unmap_same(). It's of NUMA type but * the pfn may be screwed if the read is non atomic. * - * ptep_modify_prot_start is not called as this is clearing - * the _PAGE_NUMA bit and it is not really expected that there - * would be concurrent hardware modifications to the PTE. + * We can safely just do a "set_pte_at()", because the old + * page table entry is not accessible, so there would be no + * concurrent hardware modifications to the PTE. */ ptl = pte_lockptr(mm, pmd); spin_lock(ptl); @@ -3131,7 +3131,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index e58725a..9d61dce 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy
[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page
Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c| 13 - mm/memory.c | 1 - mm/mprotect.c | 15 ++- 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 554bbe3..ad9051b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot); + unsigned long addr, pgprot_t newprot, + int prot_numa); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 668f1a3..3013eb8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1502,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot) + unsigned long addr, pgprot_t newprot, int prot_numa) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + + /* +* Avoid trapping faults against the zero page. The read-only +* data is likely to be read-cached on the local CPU and +* local/remote hits to the zero page are not interesting. +*/ + if (prot_numa && is_huge_zero_pmd(*pmd)) { + spin_unlock(ptl); + return 0; + } + ret = 1; entry = pmdp_get_and_clear(mm, addr, pmd); entry = pmd_modify(entry, newprot); diff --git a/mm/memory.c b/mm/memory.c index 900127b..a725c08 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3142,7 +3142,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return 0; } - BUG_ON(is_zero_pfn(page_to_pfn(page))); /* * Avoid grouping on DSO/COW pages in specific and RO pages diff --git a/mm/mprotect.c b/mm/mprotect.c index dc65c0f..33dfafb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; + + /* +* Avoid trapping faults against the zero or KSM +* pages. See similar comment in change_huge_pmd. +*/ + if (prot_numa) { + struct page *page; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || PageKsm(page)) + continue; + } + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); @@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot); + newprot, prot_numa); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting updateparallel fault -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting updateparallel migration -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: Mel Gorman --- arch/powerpc/include/asm/pgtable.h| 54 +--- arch/powerpc/include/asm/pte-common.h | 5 -- arch/powerpc/include/asm/pte-hash64.h | 6 -- arch/x86/include/asm/pgtable.h| 22 + arch/x86/include/asm/pgtable_64.h | 5 -- arch/x86/include/asm/pgtable_types.h | 41 + include/asm-generic/pgtable.h | 155 -- include/linux/swapops.h | 2 +- 8 files changed, 7 insertions(+), 283 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 452c3b4..2e074e7 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -49,64 +49,12 @@ static inline int pmd_protnone_numa(pmd_t pmd) { return pte_protnone_numa(pmd_pte(pmd)); } - -static inline int pte_present(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -#define pte_present_nonuma pte_present_nonuma -static inline int pte_present_nonuma(pte_t pte) -{ - return pte_val(pte) & (_PAGE_PRESENT); -} - -#define ptep_set_numa ptep_set_numa -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, -pte_t *ptep) -{ - if ((pte_val(*ptep) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0); - return; -} - -#define pmdp_set_numa pmdp_set_numa -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, -pmd_t *pmdp) -{ - if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA); - return; -} - -/* - * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist - * which was inherited from x86. For the purposes of powerpc pte_basic_t and - * pmd_t are equivalent - */ -#define pteval_t pte_basic_t -#define pmdval_t pmd_t -static inline pteval_t ptenuma_flags(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -static inline pmdval_t pmdnuma_flags(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_NUMA_MASK; -} - -# else +#endif /* CONFIG_NUMA_BALANCING */ static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_PRESENT; } -#endif /* CONFIG_NUMA_BALANCING */ /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h index e040c35..8d1569c 100644 -
[PATCH 08/10] x86: mm: Restore original pte_special check
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted the pte_special check to take into account that a special pte had SPECIAL and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no longer modifying _PAGE_PRESENT it should be safe to restore the original pte_special behaviour. Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index f8799e0..5241332 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -131,13 +131,7 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - /* -* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. -* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == -* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. -*/ - return (pte_flags(pte) & _PAGE_SPECIAL) && - (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); + return pte_flags(pte) & _PAGE_SPECIAL; } static inline unsigned long pte_pfn(pte_t pte) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 3 +++ mm/memory.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3013eb8..6458229f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1274,6 +1274,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + ptl = pmd_lock(mm, pmdp); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; diff --git a/mm/memory.c b/mm/memory.c index a725c08..42d652d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3115,6 +3115,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries
If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 14 -- mm/mprotect.c| 4 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6458229f..a7ea9b8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, return 0; } - ret = 1; - entry = pmdp_get_and_clear(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + if (!prot_numa || !pmd_protnone_numa(*pmd)) { + ret = 1; + entry = pmdp_get_and_clear(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); + } spin_unlock(ptl); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 33dfafb..eb890d0 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue; + + /* Avoid TLB flush if possible */ + if (pte_protnone_numa(oldpte)) + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
On Thu, Nov 20, 2014 at 10:38:56AM +, David Laight wrote: > From: Mel Gorman > > Convert existing users of pte_numa and friends to the new helper. Note > > that the kernel is broken after this patch is applied until the other > > page table modifiers are also altered. This patch layout is to make > > review easier. > > Doesn't that break bisection? > Yes, for automatic NUMA balancing at least. The patch structure is to to make reviewers job easier and besides, bisecting within patches 2-6 is pointless. If desired, I can collapse patches 2-6 together for the final submission. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2
On Thu, Nov 20, 2014 at 04:50:25PM -0500, Sasha Levin wrote: > On 11/20/2014 05:19 AM, Mel Gorman wrote: > > V1 failed while running under kvm-tools very quickly and a second report > > indicated that it happens on bare metal as well. This version survived > > an overnight run of trinity running under kvm-tools here but verification > > from Sasha would be appreciated. > > Hi Mel, > > I tried giving it a spin, but it won't apply at all on the latest -mm > tree: > > $ git am -3 numa/* > Applying: mm: numa: Do not dereference pmd outside of the lock during NUMA > hinting fault > Applying: mm: Add p[te|md] protnone helpers for use by NUMA balancing > Applying: mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa > fatal: sha1 information is lacking or useless (mm/huge_memory.c). > Repository lacks necessary blobs to fall back on 3-way merge. > Cannot fall back to three-way merge. > > Did I miss a prerequisite? > No. V2 was still against 3.18-rc4 as that was what I had vanilla kernel test data for. V3 will be against latest mmotm. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing
On Thu, Nov 20, 2014 at 11:54:06AM -0800, Linus Torvalds wrote: > On Thu, Nov 20, 2014 at 2:19 AM, Mel Gorman wrote: > > This is a preparatory patch that introduces protnone helpers for automatic > > NUMA balancing. > > Oh, I hadn't noticed that you had renamed these things. It was > probably already true in your V1 version. > > I do *not* think that "pte_protnone_numa()" makes sense as a name. It > only confuses people to think that there is still/again something > NUMA-special about the PTE. The whole point of the protnone changes > was to make it really very very clear that from a hardware standpoint, > this is *exactly* about protnone, and nothing else. > > The fact that we then use protnone PTE's for numa faults is a VM > internal issue, it should *not* show up in the architecture page table > helpers. > > I'm not NAK'ing this name, but I really think it's a very important > part of the whole patch series - to stop the stupid confusion about > NUMA entries. As far as the page tables are concerned, this has > absolutely _zero_ to do with NUMA. > > We made that mistake once. We're fixing it. Let the naming *show* that > it's fixed, and this is "pte_protnone()". > > The places that use this for NUMA handling might have a comment or > something. But they'll be in the VM where this matters, not in the > architecture page table description files. The comment would be > something like "if the vma is accessible, but the PTE is marked > protnone, this is a autonuma entry". > I feared that people would eventually make the mistake of thinking that pte_protnone() would return true for PROT_NONE VMAs that do *not* have the page table bit set. I'll use the old name as you suggest and expand the comment. It'll be in v3. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault
A transhuge NUMA hinting fault may find the page is migrating and should wait until migration completes. The check is race-prone because the pmd is deferenced outside of the page lock and while the race is tiny, it'll be larger if the PMD is cleared while marking PMDs for hinting fault. This patch closes the race. Signed-off-by: Mel Gorman --- include/linux/migrate.h | 4 mm/huge_memory.c| 3 ++- mm/migrate.c| 6 -- 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 01aad3e..a3edcdf 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_NUMA_BALANCING extern bool pmd_trans_migrating(pmd_t pmd); -extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); @@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd) { return false; } -static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 817a875..a2cd021 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1283,8 +1283,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * check_same as the page may no longer be mapped. */ if (unlikely(pmd_trans_migrating(*pmdp))) { + page = pmd_page(*pmdp); spin_unlock(ptl); - wait_migrate_huge_page(vma->anon_vma, pmdp); + wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index 41945cb..11d86b4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } -void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ - struct page *page = pmd_page(*pmd); - wait_on_page_locked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v3
The main change here is to rebase on mmotm-20141119 as the series had significant conflicts that were non-obvious to resolve. The main blockers for merging are independent testing from Sasha (trinity), independent testing from Aneesh (ppc64 support) and acks from Ben and Paul on the powerpc patches. Changelog since V2 o Rename *_protnone_numa to _protnone and extend docs (linus) o Rebase to mmotm-20141119 for pre-merge testing(mel) o Conver WARN_ON to VM_WARN_ON (aneesh) Changelog since V1 o ppc64 paranoia checks and clarifications (aneesh) o Fix trinity regression (hopefully) o Reduce unnecessary TLB flushes(mel) Automatic NUMA balancing depends on being able to protect PTEs to trap a fault and gather reference locality information. Very broadly speaking it would mark PTEs as not present and use another bit to distinguish between NUMA hinting faults and other types of faults. It was universally loved by everybody and caused no problems whatsoever. That last sentence might be a lie. This series is very heavily based on patches from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions with normal change protections. I did alter and add parts of it but I consider them relatively minor contributions. At their suggestion, acked-bys are in there but I've no problem converting them to Signed-off-by if requested. AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for that. I tested trinity under kvm-tool and passed and ran a few other basic tests. At the time of writing, only the short-lived tests have completed but testing of V2 indicated that long-term testing had no surprises. In most cases I'm leaving out detail as it's not that interesting. specjbb single JVM: There was negligible performance difference in the benchmark itself for short runs. However, system activity is higher and interrupts are much higher over time -- possibly TLB flushes. Migrations are also higher. Overall, this is more overhead but considering the problems faced with the old approach I think we just have to suck it up and find another way of reducing the overhead. specjbb multi JVM: Negligible performance difference to the actual benchmark but like the single JVM case, the system overhead is noticeably higher. Again, interrupts are a major factor. autonumabench: This was all over the place and about all that can be reasonably concluded is that it's different but not necessarily better or worse. autonumabench 3.18.0-rc53.18.0-rc5 mmotm-20141119 protnone-v3r3 UserNUMA01 32380.24 ( 0.00%)21642.92 ( 33.16%) UserNUMA01_THEADLOCAL22481.02 ( 0.00%)22283.22 ( 0.88%) UserNUMA023137.00 ( 0.00%) 3116.54 ( 0.65%) UserNUMA02_SMT1614.03 ( 0.00%) 1543.53 ( 4.37%) System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%) System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%) System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%) System NUMA02_SMT 7.36 ( 0.00%)7.45 ( -1.22%) Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%) Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%) Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%) Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%) CPU NUMA014563.00 ( 0.00%) 3855.00 ( 15.52%) CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%) CPU NUMA023785.00 ( 0.00%) 3770.00 ( 0.40%) CPU NUMA02_SMT1872.00 ( 0.00%) 1959.00 ( -4.65%) System CPU usage of NUMA01 is worse but it's an adverse workload on this machine so I'm reluctant to conclude that it's a problem that matters. On the other workloads that are sensible on this machine, system CPU usage is great. Overall time to complete the benchmark is comparable 3.18.0-rc5 3.18.0-rc5 mmotm-20141119protnone-v3r3 User59612.5048586.44 System460.22 1537.45 Elapsed 1442.20 1304.29 NUMA alloc hit 5075182 5743353 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 5075174 5743339 NUMA base PTE updates637061448 443106883 NUMA huge PMD updates 1243434 864747 NUMA page range updates 1273699656 885857347 NUMA hint faults 1658116 1214277 NUMA hint local faults 959487 754113 NUMA hint local percent 57 62 NUMA pages migrated546705661676398 The NUMA pages migrated look terrible but when I looked at
[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/pgtable.h | 15 +++ arch/x86/include/asm/pgtable.h | 16 include/asm-generic/pgtable.h | 20 3 files changed, 51 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index a8805fe..490bd6d 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -39,6 +39,21 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h + */ +static inline int pte_protnone(pte_t pte) +{ + return (pte_val(pte) & + (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pte_protnone(pmd_pte(pmd)); +} + static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_NUMA_MASK; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 081d6f4..2e25780 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -502,6 +502,22 @@ static inline int pmd_present(pmd_t pmd) _PAGE_NUMA); } +#ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h + */ +static inline int pte_protnone(pte_t pte) +{ + return pte_flags(pte) & _PAGE_PROTNONE; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_PROTNONE; +} +#endif /* CONFIG_NUMA_BALANCING */ + static inline int pmd_none(pmd_t pmd) { /* Only check low word on 32-bit platforms, since it might be diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 177d597..d497d08 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -688,6 +688,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +#ifndef CONFIG_NUMA_BALANCING +/* + * Technically a PTE can be PROTNONE even when not doing NUMA balancing but + * the only case the kernel cares is for NUMA balancing and is only ever set + * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked + * _PAGE_PROTNONE so by by default, implement the helper as "always no". It + * is the responsibility of the caller to distinguish between PROT_NONE + * protections and NUMA hinting fault protections. + */ +static inline int pte_protnone(pte_t pte) +{ + return 0; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return 0; +} +#endif /* CONFIG_NUMA_BALANCING */ + #ifdef CONFIG_NUMA_BALANCING /* * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/fault.c | 5 - arch/powerpc/mm/pgtable.c | 11 --- arch/powerpc/mm/pgtable_64.c| 3 ++- arch/x86/mm/gup.c | 4 ++-- include/uapi/linux/mempolicy.h | 2 +- mm/gup.c| 10 +- mm/huge_memory.c| 16 +++ mm/memory.c | 4 ++-- mm/mprotect.c | 39 ++--- mm/pgtable-generic.c| 2 +- 11 files changed, 40 insertions(+), 58 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 084ad54..3e6ad3f 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size); - if (pte_present(pte) && !pte_numa(pte)) { + if (pte_present(pte) && !pte_protnone(pte)) { if (writing && !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index eb79907..b434153 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -398,8 +398,6 @@ good_area: * processors use the same I/D cache coherency mechanism * as embedded. */ - if (error_code & DSISR_PROTFAULT) - goto bad_area; #endif /* CONFIG_PPC_STD_MMU */ /* @@ -423,9 +421,6 @@ good_area: flags |= FAULT_FLAG_WRITE; /* a read */ } else { - /* protection fault */ - if (error_code & 0x0800) - goto bad_area; if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; } diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c90e602..83dfcb5 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma, void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { -#ifdef CONFIG_DEBUG_VM - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT); -#endif + /* +* When handling numa faults, we already have the pte marked +* _PAGE_PRESENT, but we can be sure that it is not in hpte. +* Hence we can use set_pte_at for them. +*/ + VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); + /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this * is called. diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 87ff0c1..435ebf7 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT); + WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); assert_spin_locked(&mm->page_table_lock); WARN_ON(!pmd_trans_huge(pmd)); #endif diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 207d9aef..f32e12c 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, struct page *page; /* Similar to the PMD case, NUMA hinting must take slow path */ - if (pte_numa(pte)) { + if (pte_protnone(pte)) { pte_unmap(ptep); return 0; } @@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd)) return 0;
[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Signed-off-by: Mel Gorman Acked-by: Aneesh Kumar K.V --- arch/powerpc/mm/copro_fault.c | 8 ++-- arch/powerpc/mm/fault.c | 20 +--- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 5a236f0..0450d68 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock; } else { - if (dsisr & DSISR_PROTFAULT) - goto out_unlock; if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto out_unlock; + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(dsisr & DSISR_PROTFAULT); } ret = 0; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index b434153..1bcd378 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -389,17 +389,6 @@ good_area: #endif /* CONFIG_8xx */ if (is_exec) { -#ifdef CONFIG_PPC_STD_MMU - /* Protection fault on exec go straight to failure on -* Hash based MMUs as they either don't support per-page -* execute permission, or if they do, it's handled already -* at the hash level. This test would probably have to -* be removed if we change the way this works to make hash -* processors use the same I/D cache coherency mechanism -* as embedded. -*/ -#endif /* CONFIG_PPC_STD_MMU */ - /* * Allow execution from readable areas if the MMU does not * provide separate controls over reading and executing. @@ -414,6 +403,14 @@ good_area: (cpu_has_feature(CPU_FTR_NOEXECUTE) || !(vma->vm_flags & (VM_READ | VM_WRITE goto bad_area; +#ifdef CONFIG_PPC_STD_MMU + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); +#endif /* CONFIG_PPC_STD_MMU */ /* a write */ } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) @@ -423,6 +420,7 @@ good_area: } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); } /* -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations
With PROT_NONE, the traditional page table manipulation functions are sufficient. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar --- include/linux/huge_mm.h | 3 +-- mm/huge_memory.c| 33 +++-- mm/memory.c | 10 ++ mm/mempolicy.c | 2 +- mm/migrate.c| 2 +- mm/mprotect.c | 2 +- mm/pgtable-generic.c| 2 -- 7 files changed, 17 insertions(+), 37 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ad9051b..554bbe3 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f81fddf..5618e22 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - pmd = pmd_mknonnuma(pmd); + pmd = pmd_modify(pmd, vma->vm_page_prot); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1503,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; ret = 1; - if (!prot_numa) { - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); - } else { - struct page *page = pmd_page(*pmd); - - /* -* Do not trap faults against the zero page. The -* read-only data is likely to be read-cached on the -* local CPU cache and it is less useful to know about -* local vs remote hits on the zero page. -*/ - if (!is_huge_zero_page(page) && - !pmd_protnone(*pmd)) { - pmdp_set_numa(mm, addr, pmd); - ret = HPAGE_PMD_NR; - } - } + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index eaa46f1..2100e0f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3114,9 +3114,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * validation through pte_unmap_same(). It's of NUMA type but * the pfn may be screwed if the read is non atomic. * - * ptep_modify_prot_start is not called as this is clearing - * the _PAGE_NUMA bit and it is not really expected that there - * would be concurrent hardware modifications to the PTE. + * We can safely just do a "set_pte_at()", because the old + * page table entry is not accessible, so there would be no + * concurrent hardware modifications to the PTE. */ ptl = pte_lockptr(mm, pmd); spin_lock(ptl); @@ -3125,7 +3125,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f22c559..530a0da 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy
[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting updateparallel fault -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting updateparallel migration -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: Mel Gorman --- arch/powerpc/include/asm/pgtable.h| 54 +--- arch/powerpc/include/asm/pte-common.h | 5 -- arch/powerpc/include/asm/pte-hash64.h | 6 -- arch/x86/include/asm/pgtable.h| 22 + arch/x86/include/asm/pgtable_64.h | 5 -- arch/x86/include/asm/pgtable_types.h | 41 + include/asm-generic/pgtable.h | 155 -- include/linux/swapops.h | 2 +- 8 files changed, 7 insertions(+), 283 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 490bd6d..7e4cb3d 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -53,64 +53,12 @@ static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } - -static inline int pte_present(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -#define pte_present_nonuma pte_present_nonuma -static inline int pte_present_nonuma(pte_t pte) -{ - return pte_val(pte) & (_PAGE_PRESENT); -} - -#define ptep_set_numa ptep_set_numa -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, -pte_t *ptep) -{ - if ((pte_val(*ptep) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0); - return; -} - -#define pmdp_set_numa pmdp_set_numa -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, -pmd_t *pmdp) -{ - if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA); - return; -} - -/* - * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist - * which was inherited from x86. For the purposes of powerpc pte_basic_t and - * pmd_t are equivalent - */ -#define pteval_t pte_basic_t -#define pmdval_t pmd_t -static inline pteval_t ptenuma_flags(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -static inline pmdval_t pmdnuma_flags(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_NUMA_MASK; -} - -# else +#endif /* CONFIG_NUMA_BALANCING */ static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_PRESENT; } -#endif /* CONFIG_NUMA_BALANCING */ /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h index e040c35..8d1569c 100644 --- a/arch
[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page
Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c| 13 - mm/memory.c | 1 - mm/mprotect.c | 15 ++- 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 554bbe3..ad9051b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot); + unsigned long addr, pgprot_t newprot, + int prot_numa); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5618e22..ad2a3ee 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1502,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot) + unsigned long addr, pgprot_t newprot, int prot_numa) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + + /* +* Avoid trapping faults against the zero page. The read-only +* data is likely to be read-cached on the local CPU and +* local/remote hits to the zero page are not interesting. +*/ + if (prot_numa && is_huge_zero_pmd(*pmd)) { + spin_unlock(ptl); + return 0; + } + ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); diff --git a/mm/memory.c b/mm/memory.c index 2100e0f..2ec07a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3136,7 +3136,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return 0; } - BUG_ON(is_zero_pfn(page_to_pfn(page))); /* * Avoid grouping on DSO/COW pages in specific and RO pages diff --git a/mm/mprotect.c b/mm/mprotect.c index dc65c0f..33dfafb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; + + /* +* Avoid trapping faults against the zero or KSM +* pages. See similar comment in change_huge_pmd. +*/ + if (prot_numa) { + struct page *page; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || PageKsm(page)) + continue; + } + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); @@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot); + newprot, prot_numa); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 08/10] x86: mm: Restore original pte_special check
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted the pte_special check to take into account that a special pte had SPECIAL and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no longer modifying _PAGE_PRESENT it should be safe to restore the original pte_special behaviour. Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index cf428a7..0dd5be3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -136,13 +136,7 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - /* -* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. -* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == -* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. -*/ - return (pte_flags(pte) & _PAGE_SPECIAL) && - (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); + return pte_flags(pte) & _PAGE_SPECIAL; } static inline unsigned long pte_pfn(pte_t pte) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 3 +++ mm/memory.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ad2a3ee..8546654 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1273,6 +1273,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + ptl = pmd_lock(mm, pmdp); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; diff --git a/mm/memory.c b/mm/memory.c index 2ec07a9..7d97af5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3109,6 +3109,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries
If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 14 -- mm/mprotect.c| 4 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8546654..f2bf521 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, return 0; } - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + if (!prot_numa || !pmd_protnone(*pmd)) { + ret = 1; + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); + } spin_unlock(ptl); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 33dfafb..109e7aa 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue; + + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing
On Tue, Dec 02, 2014 at 09:38:39AM +1100, Benjamin Herrenschmidt wrote: > On Fri, 2014-11-21 at 13:57 +0000, Mel Gorman wrote: > > > #ifdef CONFIG_NUMA_BALANCING > > +/* > > + * These work without NUMA balancing but the kernel does not care. See the > > + * comment in include/asm-generic/pgtable.h > > + */ > > +static inline int pte_protnone(pte_t pte) > > +{ > > + return (pte_val(pte) & > > + (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT; > > +} > > I would add a comment clarifying that this only works for user pages, > ie, this accessor will always return "true" for a kernel page on ppc. > diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 490bd6d..7b889a3 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -41,7 +41,8 @@ static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PA #ifdef CONFIG_NUMA_BALANCING /* * These work without NUMA balancing but the kernel does not care. See the - * comment in include/asm-generic/pgtable.h + * comment in include/asm-generic/pgtable.h . On powerpc, this will only + * work for user pages and always return true for kernel pages. */ static inline int pte_protnone(pte_t pte) { -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
On Wed, Dec 03, 2014 at 08:53:37PM +0530, Aneesh Kumar K.V wrote: > Benjamin Herrenschmidt writes: > > > On Tue, 2014-12-02 at 12:57 +0530, Aneesh Kumar K.V wrote: > >> Now, hash_preload can possibly insert an hpte in hash page table even if > >> the access is not allowed by the pte permissions. But i guess even that > >> is ok. because we will fault again, end-up calling hash_page_mm where we > >> handle that part correctly. > > > > I think we need a test case... > > > > I ran the subpageprot test that Paul had written. I modified it to ran > with selftest. > It's implied but can I assume it passed? If so, Ben and Paul, can I consider the series to be acked by you other than the minor comment updates? Thanks. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
On Thu, Dec 04, 2014 at 08:01:57AM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2014-12-03 at 15:52 +0000, Mel Gorman wrote: > > > > It's implied but can I assume it passed? If so, Ben and Paul, can I > > consider the series to be acked by you other than the minor comment > > updates? > > Yes. Assuming it passed :-) > > Acked-by: Benjamin Herrenschmidt > Sweet, thanks. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
On Wed, Dec 03, 2014 at 10:50:35PM +0530, Aneesh Kumar K.V wrote: > Mel Gorman writes: > > > On Wed, Dec 03, 2014 at 08:53:37PM +0530, Aneesh Kumar K.V wrote: > >> Benjamin Herrenschmidt writes: > >> > >> > On Tue, 2014-12-02 at 12:57 +0530, Aneesh Kumar K.V wrote: > >> >> Now, hash_preload can possibly insert an hpte in hash page table even if > >> >> the access is not allowed by the pte permissions. But i guess even that > >> >> is ok. because we will fault again, end-up calling hash_page_mm where we > >> >> handle that part correctly. > >> > > >> > I think we need a test case... > >> > > >> > >> I ran the subpageprot test that Paul had written. I modified it to ran > >> with selftest. > >> > > > > It's implied but can I assume it passed? > > Yes. > > -bash-4.2# ./subpage_prot > test: subpage_prot > tags: git_version:v3.17-rc3-13511-g0cd3756 > allocated malloc block of 0x400 bytes at 0x0x3fffb0d1 > testing malloc block... > OK > success: subpage_prot > -bash-4.2# > Thanks for adding that and double checking. I won't pick up the patch as part of this series because it's not directly related but I would strongly suggest sending the patch separately. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v4
There are no functional changes here and I kept the mmotm-20141119 baseline as that is what got tested but it rebases cleanly to current mmotm. The series makes architectural changes but splitting this on a per-arch basis would cause bisect-related brain damage. I'm hoping this can go through Andrew without conflict. It's been tested by myself (standard tests), Aneesh (ppc64) and Sasha (trinity) so there is some degree of confidence that it's ok. Changelog since V3 o Minor comment update (benh) o Add ack'ed bys Changelog since V2 o Rename *_protnone_numa to _protnone and extend docs (linus) o Rebase to mmotm-20141119 for pre-merge testing(mel) o Conver WARN_ON to VM_WARN_ON (aneesh) Changelog since V1 o ppc64 paranoia checks and clarifications (aneesh) o Fix trinity regression (hopefully) o Reduce unnecessary TLB flushes(mel) Automatic NUMA balancing depends on being able to protect PTEs to trap a fault and gather reference locality information. Very broadly speaking it would mark PTEs as not present and use another bit to distinguish between NUMA hinting faults and other types of faults. It was universally loved by everybody and caused no problems whatsoever. That last sentence might be a lie. This series is very heavily based on patches from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions with normal change protections. I did alter and add parts of it but I consider them relatively minor contributions. At their suggestion, acked-bys are in there but I've no problem converting them to Signed-off-by if requested. AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for that. I tested trinity under kvm-tool and passed and ran a few other basic tests. At the time of writing, only the short-lived tests have completed but testing of V2 indicated that long-term testing had no surprises. In most cases I'm leaving out detail as it's not that interesting. specjbb single JVM: There was negligible performance difference in the benchmark itself for short runs. However, system activity is higher and interrupts are much higher over time -- possibly TLB flushes. Migrations are also higher. Overall, this is more overhead but considering the problems faced with the old approach I think we just have to suck it up and find another way of reducing the overhead. specjbb multi JVM: Negligible performance difference to the actual benchmark but like the single JVM case, the system overhead is noticeably higher. Again, interrupts are a major factor. autonumabench: This was all over the place and about all that can be reasonably concluded is that it's different but not necessarily better or worse. autonumabench 3.18.0-rc53.18.0-rc5 mmotm-20141119 protnone-v3r3 UserNUMA01 32380.24 ( 0.00%)21642.92 ( 33.16%) UserNUMA01_THEADLOCAL22481.02 ( 0.00%)22283.22 ( 0.88%) UserNUMA023137.00 ( 0.00%) 3116.54 ( 0.65%) UserNUMA02_SMT1614.03 ( 0.00%) 1543.53 ( 4.37%) System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%) System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%) System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%) System NUMA02_SMT 7.36 ( 0.00%)7.45 ( -1.22%) Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%) Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%) Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%) Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%) CPU NUMA014563.00 ( 0.00%) 3855.00 ( 15.52%) CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%) CPU NUMA023785.00 ( 0.00%) 3770.00 ( 0.40%) CPU NUMA02_SMT1872.00 ( 0.00%) 1959.00 ( -4.65%) System CPU usage of NUMA01 is worse but it's an adverse workload on this machine so I'm reluctant to conclude that it's a problem that matters. On the other workloads that are sensible on this machine, system CPU usage is great. Overall time to complete the benchmark is comparable 3.18.0-rc5 3.18.0-rc5 mmotm-20141119protnone-v3r3 User59612.5048586.44 System460.22 1537.45 Elapsed 1442.20 1304.29 NUMA alloc hit 5075182 5743353 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 5075174 5743339 NUMA base PTE updates637061448 443106883 NUMA huge PMD updates 1243434 864747 NUMA page range updates 1273699656 885857347 NUMA h
[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault
A transhuge NUMA hinting fault may find the page is migrating and should wait until migration completes. The check is race-prone because the pmd is deferenced outside of the page lock and while the race is tiny, it'll be larger if the PMD is cleared while marking PMDs for hinting fault. This patch closes the race. Signed-off-by: Mel Gorman --- include/linux/migrate.h | 4 mm/huge_memory.c| 3 ++- mm/migrate.c| 6 -- 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 01aad3e..a3edcdf 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_NUMA_BALANCING extern bool pmd_trans_migrating(pmd_t pmd); -extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); @@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd) { return false; } -static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 817a875..a2cd021 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1283,8 +1283,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * check_same as the page may no longer be mapped. */ if (unlikely(pmd_trans_migrating(*pmdp))) { + page = pmd_page(*pmdp); spin_unlock(ptl); - wait_migrate_huge_page(vma->anon_vma, pmdp); + wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index 41945cb..11d86b4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } -void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ - struct page *page = pmd_page(*pmd); - wait_on_page_locked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar K.V Tested-by: Sasha Levin --- arch/powerpc/include/asm/pgtable.h | 16 arch/x86/include/asm/pgtable.h | 16 include/asm-generic/pgtable.h | 20 3 files changed, 52 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index a8805fe..7b889a3 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -39,6 +39,22 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h . On powerpc, this will only + * work for user pages and always return true for kernel pages. + */ +static inline int pte_protnone(pte_t pte) +{ + return (pte_val(pte) & + (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pte_protnone(pmd_pte(pmd)); +} + static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_NUMA_MASK; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 081d6f4..2e25780 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -502,6 +502,22 @@ static inline int pmd_present(pmd_t pmd) _PAGE_NUMA); } +#ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h + */ +static inline int pte_protnone(pte_t pte) +{ + return pte_flags(pte) & _PAGE_PROTNONE; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_PROTNONE; +} +#endif /* CONFIG_NUMA_BALANCING */ + static inline int pmd_none(pmd_t pmd) { /* Only check low word on 32-bit platforms, since it might be diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 177d597..d497d08 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -688,6 +688,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +#ifndef CONFIG_NUMA_BALANCING +/* + * Technically a PTE can be PROTNONE even when not doing NUMA balancing but + * the only case the kernel cares is for NUMA balancing and is only ever set + * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked + * _PAGE_PROTNONE so by by default, implement the helper as "always no". It + * is the responsibility of the caller to distinguish between PROT_NONE + * protections and NUMA hinting fault protections. + */ +static inline int pte_protnone(pte_t pte) +{ + return 0; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return 0; +} +#endif /* CONFIG_NUMA_BALANCING */ + #ifdef CONFIG_NUMA_BALANCING /* * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Acked-by: Benjamin Herrenschmidt Tested-by: Sasha Levin --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/fault.c | 5 - arch/powerpc/mm/pgtable.c | 11 --- arch/powerpc/mm/pgtable_64.c| 3 ++- arch/x86/mm/gup.c | 4 ++-- include/uapi/linux/mempolicy.h | 2 +- mm/gup.c| 10 +- mm/huge_memory.c| 16 +++ mm/memory.c | 4 ++-- mm/mprotect.c | 39 ++--- mm/pgtable-generic.c| 2 +- 11 files changed, 40 insertions(+), 58 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 084ad54..3e6ad3f 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size); - if (pte_present(pte) && !pte_numa(pte)) { + if (pte_present(pte) && !pte_protnone(pte)) { if (writing && !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index eb79907..b434153 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -398,8 +398,6 @@ good_area: * processors use the same I/D cache coherency mechanism * as embedded. */ - if (error_code & DSISR_PROTFAULT) - goto bad_area; #endif /* CONFIG_PPC_STD_MMU */ /* @@ -423,9 +421,6 @@ good_area: flags |= FAULT_FLAG_WRITE; /* a read */ } else { - /* protection fault */ - if (error_code & 0x0800) - goto bad_area; if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; } diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c90e602..83dfcb5 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma, void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { -#ifdef CONFIG_DEBUG_VM - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT); -#endif + /* +* When handling numa faults, we already have the pte marked +* _PAGE_PRESENT, but we can be sure that it is not in hpte. +* Hence we can use set_pte_at for them. +*/ + VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); + /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this * is called. diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 87ff0c1..435ebf7 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT); + WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); assert_spin_locked(&mm->page_table_lock); WARN_ON(!pmd_trans_huge(pmd)); #endif diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 207d9aef..f32e12c 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, struct page *page; /* Similar to the PMD case, NUMA hinting must take slow path */ - if (pte_numa(pte)) { + if (pte_protnone(pte)) { pte_unmap(ptep); return 0; } @@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd))
[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Signed-off-by: Mel Gorman Acked-by: Aneesh Kumar K.V Tested-by: Sasha Levin --- arch/powerpc/mm/copro_fault.c | 8 ++-- arch/powerpc/mm/fault.c | 20 +--- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 5a236f0..0450d68 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock; } else { - if (dsisr & DSISR_PROTFAULT) - goto out_unlock; if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto out_unlock; + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(dsisr & DSISR_PROTFAULT); } ret = 0; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index b434153..1bcd378 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -389,17 +389,6 @@ good_area: #endif /* CONFIG_8xx */ if (is_exec) { -#ifdef CONFIG_PPC_STD_MMU - /* Protection fault on exec go straight to failure on -* Hash based MMUs as they either don't support per-page -* execute permission, or if they do, it's handled already -* at the hash level. This test would probably have to -* be removed if we change the way this works to make hash -* processors use the same I/D cache coherency mechanism -* as embedded. -*/ -#endif /* CONFIG_PPC_STD_MMU */ - /* * Allow execution from readable areas if the MMU does not * provide separate controls over reading and executing. @@ -414,6 +403,14 @@ good_area: (cpu_has_feature(CPU_FTR_NOEXECUTE) || !(vma->vm_flags & (VM_READ | VM_WRITE goto bad_area; +#ifdef CONFIG_PPC_STD_MMU + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); +#endif /* CONFIG_PPC_STD_MMU */ /* a write */ } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) @@ -423,6 +420,7 @@ good_area: } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); } /* -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations
With PROT_NONE, the traditional page table manipulation functions are sufficient. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Tested-by: Sasha Levin --- include/linux/huge_mm.h | 3 +-- mm/huge_memory.c| 33 +++-- mm/memory.c | 10 ++ mm/mempolicy.c | 2 +- mm/migrate.c| 2 +- mm/mprotect.c | 2 +- mm/pgtable-generic.c| 2 -- 7 files changed, 17 insertions(+), 37 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ad9051b..554bbe3 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f81fddf..5618e22 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - pmd = pmd_mknonnuma(pmd); + pmd = pmd_modify(pmd, vma->vm_page_prot); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1503,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; ret = 1; - if (!prot_numa) { - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); - } else { - struct page *page = pmd_page(*pmd); - - /* -* Do not trap faults against the zero page. The -* read-only data is likely to be read-cached on the -* local CPU cache and it is less useful to know about -* local vs remote hits on the zero page. -*/ - if (!is_huge_zero_page(page) && - !pmd_protnone(*pmd)) { - pmdp_set_numa(mm, addr, pmd); - ret = HPAGE_PMD_NR; - } - } + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index eaa46f1..2100e0f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3114,9 +3114,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * validation through pte_unmap_same(). It's of NUMA type but * the pfn may be screwed if the read is non atomic. * - * ptep_modify_prot_start is not called as this is clearing - * the _PAGE_NUMA bit and it is not really expected that there - * would be concurrent hardware modifications to the PTE. + * We can safely just do a "set_pte_at()", because the old + * page table entry is not accessible, so there would be no + * concurrent hardware modifications to the PTE. */ ptl = pte_lockptr(mm, pmd); spin_lock(ptl); @@ -3125,7 +3125,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f22c559..530a0da 100644 --- a/mm/mempol
[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting updateparallel fault -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting updateparallel migration -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: Mel Gorman Tested-by: Sasha Levin --- arch/powerpc/include/asm/pgtable.h| 54 +--- arch/powerpc/include/asm/pte-common.h | 5 -- arch/powerpc/include/asm/pte-hash64.h | 6 -- arch/x86/include/asm/pgtable.h| 22 + arch/x86/include/asm/pgtable_64.h | 5 -- arch/x86/include/asm/pgtable_types.h | 41 + include/asm-generic/pgtable.h | 155 -- include/linux/swapops.h | 2 +- 8 files changed, 7 insertions(+), 283 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 7b889a3..0a85e33 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -54,64 +54,12 @@ static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } - -static inline int pte_present(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -#define pte_present_nonuma pte_present_nonuma -static inline int pte_present_nonuma(pte_t pte) -{ - return pte_val(pte) & (_PAGE_PRESENT); -} - -#define ptep_set_numa ptep_set_numa -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, -pte_t *ptep) -{ - if ((pte_val(*ptep) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0); - return; -} - -#define pmdp_set_numa pmdp_set_numa -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, -pmd_t *pmdp) -{ - if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA); - return; -} - -/* - * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist - * which was inherited from x86. For the purposes of powerpc pte_basic_t and - * pmd_t are equivalent - */ -#define pteval_t pte_basic_t -#define pmdval_t pmd_t -static inline pteval_t ptenuma_flags(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -static inline pmdval_t pmdnuma_flags(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_NUMA_MASK; -} - -# else +#endif /* CONFIG_NUMA_BALANCING */ static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_PRESENT; } -#endif /* CONFIG_NUMA_BALANCING */ /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h index e040c35..
[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page
Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c| 13 - mm/memory.c | 1 - mm/mprotect.c | 15 ++- 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 554bbe3..ad9051b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot); + unsigned long addr, pgprot_t newprot, + int prot_numa); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5618e22..ad2a3ee 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1502,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot) + unsigned long addr, pgprot_t newprot, int prot_numa) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + + /* +* Avoid trapping faults against the zero page. The read-only +* data is likely to be read-cached on the local CPU and +* local/remote hits to the zero page are not interesting. +*/ + if (prot_numa && is_huge_zero_pmd(*pmd)) { + spin_unlock(ptl); + return 0; + } + ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); diff --git a/mm/memory.c b/mm/memory.c index 2100e0f..2ec07a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3136,7 +3136,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return 0; } - BUG_ON(is_zero_pfn(page_to_pfn(page))); /* * Avoid grouping on DSO/COW pages in specific and RO pages diff --git a/mm/mprotect.c b/mm/mprotect.c index dc65c0f..33dfafb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; + + /* +* Avoid trapping faults against the zero or KSM +* pages. See similar comment in change_huge_pmd. +*/ + if (prot_numa) { + struct page *page; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || PageKsm(page)) + continue; + } + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); @@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot); + newprot, prot_numa); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 08/10] x86: mm: Restore original pte_special check
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted the pte_special check to take into account that a special pte had SPECIAL and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no longer modifying _PAGE_PRESENT it should be safe to restore the original pte_special behaviour. Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index cf428a7..0dd5be3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -136,13 +136,7 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - /* -* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. -* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == -* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. -*/ - return (pte_flags(pte) & _PAGE_SPECIAL) && - (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); + return pte_flags(pte) & _PAGE_SPECIAL; } static inline unsigned long pte_pfn(pte_t pte) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 3 +++ mm/memory.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ad2a3ee..8546654 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1273,6 +1273,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + ptl = pmd_lock(mm, pmdp); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; diff --git a/mm/memory.c b/mm/memory.c index 2ec07a9..7d97af5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3109,6 +3109,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries
If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 14 -- mm/mprotect.c| 4 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8546654..f2bf521 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, return 0; } - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + if (!prot_numa || !pmd_protnone(*pmd)) { + ret = 1; + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); + } spin_unlock(ptl); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 33dfafb..109e7aa 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue; + + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault
A transhuge NUMA hinting fault may find the page is migrating and should wait until migration completes. The check is race-prone because the pmd is deferenced outside of the page lock and while the race is tiny, it'll be larger if the PMD is cleared while marking PMDs for hinting fault. This patch closes the race. Signed-off-by: Mel Gorman --- include/linux/migrate.h | 4 mm/huge_memory.c| 3 ++- mm/migrate.c| 6 -- 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index fab9b32..78baed5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -67,7 +67,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_NUMA_BALANCING extern bool pmd_trans_migrating(pmd_t pmd); -extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); @@ -76,9 +75,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd) { return false; } -static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 817a875..a2cd021 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1283,8 +1283,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * check_same as the page may no longer be mapped. */ if (unlikely(pmd_trans_migrating(*pmdp))) { + page = pmd_page(*pmdp); spin_unlock(ptl); - wait_migrate_huge_page(vma->anon_vma, pmdp); + wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index 344cdf6..e6a5ff1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1685,12 +1685,6 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } -void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ - struct page *page = pmd_page(*pmd); - wait_on_page_locked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Acked-by: Benjamin Herrenschmidt Tested-by: Sasha Levin --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/fault.c | 5 - arch/powerpc/mm/pgtable.c | 11 --- arch/powerpc/mm/pgtable_64.c| 3 ++- arch/x86/mm/gup.c | 4 ++-- include/uapi/linux/mempolicy.h | 2 +- mm/gup.c| 10 +- mm/huge_memory.c| 16 +++ mm/memory.c | 4 ++-- mm/mprotect.c | 39 ++--- mm/pgtable-generic.c| 2 +- 11 files changed, 40 insertions(+), 58 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 510bdfb..625407e 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -212,7 +212,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size); - if (pte_present(pte) && !pte_numa(pte)) { + if (pte_present(pte) && !pte_protnone(pte)) { if (writing && !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index eb79907..b434153 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -398,8 +398,6 @@ good_area: * processors use the same I/D cache coherency mechanism * as embedded. */ - if (error_code & DSISR_PROTFAULT) - goto bad_area; #endif /* CONFIG_PPC_STD_MMU */ /* @@ -423,9 +421,6 @@ good_area: flags |= FAULT_FLAG_WRITE; /* a read */ } else { - /* protection fault */ - if (error_code & 0x0800) - goto bad_area; if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; } diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c90e602..83dfcb5 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma, void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { -#ifdef CONFIG_DEBUG_VM - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT); -#endif + /* +* When handling numa faults, we already have the pte marked +* _PAGE_PRESENT, but we can be sure that it is not in hpte. +* Hence we can use set_pte_at for them. +*/ + VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); + /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this * is called. diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 4fe5f64..91bb883 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT); + WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); assert_spin_locked(&mm->page_table_lock); WARN_ON(!pmd_trans_huge(pmd)); #endif diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index d754782..3b1d819 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, struct page *page; /* Similar to the PMD case, NUMA hinting must take slow path */ - if (pte_numa(pte)) { + if (pte_protnone(pte)) { pte_unmap(ptep); return 0; } @@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd)) return 0;
[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar K.V Tested-by: Sasha Levin --- arch/powerpc/include/asm/pgtable.h | 16 arch/x86/include/asm/pgtable.h | 16 include/asm-generic/pgtable.h | 20 3 files changed, 52 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index a8805fe..7b889a3 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -39,6 +39,22 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h . On powerpc, this will only + * work for user pages and always return true for kernel pages. + */ +static inline int pte_protnone(pte_t pte) +{ + return (pte_val(pte) & + (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pte_protnone(pmd_pte(pmd)); +} + static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_NUMA_MASK; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e8a5454..8b92203 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -503,6 +503,22 @@ static inline int pmd_present(pmd_t pmd) _PAGE_NUMA); } +#ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h + */ +static inline int pte_protnone(pte_t pte) +{ + return pte_flags(pte) & _PAGE_PROTNONE; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_PROTNONE; +} +#endif /* CONFIG_NUMA_BALANCING */ + static inline int pmd_none(pmd_t pmd) { /* Only check low word on 32-bit platforms, since it might be diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 177d597..d497d08 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -688,6 +688,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +#ifndef CONFIG_NUMA_BALANCING +/* + * Technically a PTE can be PROTNONE even when not doing NUMA balancing but + * the only case the kernel cares is for NUMA balancing and is only ever set + * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked + * _PAGE_PROTNONE so implement the helper as "always no" by default. It is + * the responsibility of the caller to distinguish between PROT_NONE + * protections and NUMA hinting fault protections. + */ +static inline int pte_protnone(pte_t pte) +{ + return 0; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return 0; +} +#endif /* CONFIG_NUMA_BALANCING */ + #ifdef CONFIG_NUMA_BALANCING /* * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v5
Changelog since V4 o Rebase to 3.19-rc2(mel) Changelog since V3 o Minor comment update (benh) o Add ack'ed bys Changelog since V2 o Rename *_protnone_numa to _protnone and extend docs (linus) o Rebase to mmotm-20141119 for pre-merge testing(mel) o Conver WARN_ON to VM_WARN_ON (aneesh) Changelog since V1 o ppc64 paranoia checks and clarifications (aneesh) o Fix trinity regression (hopefully) o Reduce unnecessary TLB flushes(mel) Automatic NUMA balancing depends on protecting PTEs to trap a fault and gather reference locality information. Very broadly speaking it marks PTEs as not present and uses another bit to distinguish between NUMA hinting faults and other types of faults. This approach is not universally loved, ultimately resulted in swap space shrinking and has had a number of problems with Xen support. This series is very heavily based on patches from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions with normal change protections that should be less problematic. This was tested on a few different workloads that showed automatic NUMA balancing was still active with mostly comparable results. specjbb single JVM: There was negligible performance difference in the benchmark itself for short runs. However, system activity is higher and interrupts are much higher over time -- possibly TLB flushes. Migrations are also higher. Overall, this is more overhead but considering the problems faced with the old approach I think we just have to suck it up and find another way of reducing the overhead. specjbb multi JVM: Negligible performance difference to the actual benchmark but like the single JVM case, the system overhead is noticeably higher. Again, interrupts are a major factor. autonumabench: This was all over the place and about all that can be reasonably concluded is that it's different but not necessarily better or worse. autonumabench 3.19.0-rc23.19.0-rc2 vanilla protnone-v5r1 Time System-NUMA01 268.99 ( 0.00%) 1350.70 (-402.14%) Time System-NUMA01_THEADLOCAL 110.14 ( 0.00%) 50.68 ( 53.99%) Time System-NUMA02 20.14 ( 0.00%) 31.12 (-54.52%) Time System-NUMA02_SMT7.40 ( 0.00%)6.57 ( 11.22%) Time Elapsed-NUMA01 687.57 ( 0.00%) 528.51 ( 23.13%) Time Elapsed-NUMA01_THEADLOCAL 540.29 ( 0.00%) 554.36 ( -2.60%) Time Elapsed-NUMA02 84.98 ( 0.00%) 78.87 ( 7.19%) Time Elapsed-NUMA02_SMT 77.32 ( 0.00%) 87.07 (-12.61%) System CPU usage of NUMA01 is worse but it's an adverse workload on this machine so I'm reluctant to conclude that it's a problem that matters. Overall time to complete the benchmark is comparable 3.19.0-rc2 3.19.0-rc2 vanillaprotnone-v5r1 User58100.8948351.17 System407.74 1439.22 Elapsed 1411.44 1250.55 NUMA alloc hit 5398081 5536696 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 5398073 5536668 NUMA base PTE updates62271 442576477 NUMA huge PMD updates 1215268 863690 NUMA page range updates 1244939437 884785757 NUMA hint faults 1696858 1221541 NUMA hint local faults 1046842 791219 NUMA hint local percent 61 64 NUMA pages migrated604443059291698 The NUMA pages migrated look terrible but when I looked at a graph of the activity over time I see that the massive spike in migration activity was during NUMA01. This correlates with high system CPU usage and could be simply down to bad luck but any modifications that affect that workload would be related to scan rates and migrations, not the protection mechanism. For all other workloads, migration activity was comparable. Overall, headline performance figures are comparable but the overhead is higher, mostly in interrupts. To some extent, higher overhead from this approach was anticipated but not to this degree. It's going to be necessary to reduce this again with a separate series in the future. It's still worth going ahead with this series though as it's likely to avoid constant headaches with Xen and is probably easier to maintain. arch/powerpc/include/asm/pgtable.h| 54 ++-- arch/powerpc/include/asm/pte-common.h | 5 -- arch/powerpc/include/asm/pte-hash64.h | 6 -- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/copro_fault.c | 8 +- arch/powerpc/mm/fault.c |
[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Signed-off-by: Mel Gorman Acked-by: Aneesh Kumar K.V Tested-by: Sasha Levin --- arch/powerpc/mm/copro_fault.c | 8 ++-- arch/powerpc/mm/fault.c | 20 +--- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 5a236f0..0450d68 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock; } else { - if (dsisr & DSISR_PROTFAULT) - goto out_unlock; if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto out_unlock; + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(dsisr & DSISR_PROTFAULT); } ret = 0; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index b434153..1bcd378 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -389,17 +389,6 @@ good_area: #endif /* CONFIG_8xx */ if (is_exec) { -#ifdef CONFIG_PPC_STD_MMU - /* Protection fault on exec go straight to failure on -* Hash based MMUs as they either don't support per-page -* execute permission, or if they do, it's handled already -* at the hash level. This test would probably have to -* be removed if we change the way this works to make hash -* processors use the same I/D cache coherency mechanism -* as embedded. -*/ -#endif /* CONFIG_PPC_STD_MMU */ - /* * Allow execution from readable areas if the MMU does not * provide separate controls over reading and executing. @@ -414,6 +403,14 @@ good_area: (cpu_has_feature(CPU_FTR_NOEXECUTE) || !(vma->vm_flags & (VM_READ | VM_WRITE goto bad_area; +#ifdef CONFIG_PPC_STD_MMU + /* +* protfault should only happen due to us +* mapping a region readonly temporarily. PROT_NONE +* is also covered by the VMA check above. +*/ + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); +#endif /* CONFIG_PPC_STD_MMU */ /* a write */ } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) @@ -423,6 +420,7 @@ good_area: } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); } /* -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting updateparallel fault -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting updateparallel migration -- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: Mel Gorman Tested-by: Sasha Levin --- arch/powerpc/include/asm/pgtable.h| 54 +--- arch/powerpc/include/asm/pte-common.h | 5 -- arch/powerpc/include/asm/pte-hash64.h | 6 -- arch/x86/include/asm/pgtable.h| 22 + arch/x86/include/asm/pgtable_64.h | 5 -- arch/x86/include/asm/pgtable_types.h | 41 + include/asm-generic/pgtable.h | 155 -- include/linux/swapops.h | 2 +- 8 files changed, 7 insertions(+), 283 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 7b889a3..0a85e33 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -54,64 +54,12 @@ static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } - -static inline int pte_present(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -#define pte_present_nonuma pte_present_nonuma -static inline int pte_present_nonuma(pte_t pte) -{ - return pte_val(pte) & (_PAGE_PRESENT); -} - -#define ptep_set_numa ptep_set_numa -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, -pte_t *ptep) -{ - if ((pte_val(*ptep) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0); - return; -} - -#define pmdp_set_numa pmdp_set_numa -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, -pmd_t *pmdp) -{ - if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA); - return; -} - -/* - * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist - * which was inherited from x86. For the purposes of powerpc pte_basic_t and - * pmd_t are equivalent - */ -#define pteval_t pte_basic_t -#define pmdval_t pmd_t -static inline pteval_t ptenuma_flags(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -static inline pmdval_t pmdnuma_flags(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_NUMA_MASK; -} - -# else +#endif /* CONFIG_NUMA_BALANCING */ static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_PRESENT; } -#endif /* CONFIG_NUMA_BALANCING */ /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h index e040c35..
[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations
With PROT_NONE, the traditional page table manipulation functions are sufficient. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Tested-by: Sasha Levin --- include/linux/huge_mm.h | 3 +-- mm/huge_memory.c| 33 +++-- mm/memory.c | 10 ++ mm/mempolicy.c | 2 +- mm/migrate.c| 2 +- mm/mprotect.c | 2 +- mm/pgtable-generic.c| 2 -- 7 files changed, 17 insertions(+), 37 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ad9051b..554bbe3 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f81fddf..5618e22 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - pmd = pmd_mknonnuma(pmd); + pmd = pmd_modify(pmd, vma->vm_page_prot); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1503,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; ret = 1; - if (!prot_numa) { - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); - } else { - struct page *page = pmd_page(*pmd); - - /* -* Do not trap faults against the zero page. The -* read-only data is likely to be read-cached on the -* local CPU cache and it is less useful to know about -* local vs remote hits on the zero page. -*/ - if (!is_huge_zero_page(page) && - !pmd_protnone(*pmd)) { - pmdp_set_numa(mm, addr, pmd); - ret = HPAGE_PMD_NR; - } - } + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index 47aa715..debe3f4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3113,9 +3113,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * validation through pte_unmap_same(). It's of NUMA type but * the pfn may be screwed if the read is non atomic. * - * ptep_modify_prot_start is not called as this is clearing - * the _PAGE_NUMA bit and it is not really expected that there - * would be concurrent hardware modifications to the PTE. + * We can safely just do a "set_pte_at()", because the old + * page table entry is not accessible, so there would be no + * concurrent hardware modifications to the PTE. */ ptl = pte_lockptr(mm, pmd); spin_lock(ptl); @@ -3124,7 +3124,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0e0961b..4fcbf12 100644 --- a/mm/mempol
[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page
Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c| 13 - mm/memory.c | 1 - mm/mprotect.c | 15 ++- 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 554bbe3..ad9051b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot); + unsigned long addr, pgprot_t newprot, + int prot_numa); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5618e22..ad2a3ee 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1502,7 +1502,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot) + unsigned long addr, pgprot_t newprot, int prot_numa) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + + /* +* Avoid trapping faults against the zero page. The read-only +* data is likely to be read-cached on the local CPU and +* local/remote hits to the zero page are not interesting. +*/ + if (prot_numa && is_huge_zero_pmd(*pmd)) { + spin_unlock(ptl); + return 0; + } + ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); diff --git a/mm/memory.c b/mm/memory.c index debe3f4..3c50046 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3135,7 +3135,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return 0; } - BUG_ON(is_zero_pfn(page_to_pfn(page))); /* * Avoid grouping on DSO/COW pages in specific and RO pages diff --git a/mm/mprotect.c b/mm/mprotect.c index dc65c0f..33dfafb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; + + /* +* Avoid trapping faults against the zero or KSM +* pages. See similar comment in change_huge_pmd. +*/ + if (prot_numa) { + struct page *page; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || PageKsm(page)) + continue; + } + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); @@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot); + newprot, prot_numa); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 08/10] x86: mm: Restore original pte_special check
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted the pte_special check to take into account that a special pte had SPECIAL and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no longer modifying _PAGE_PRESENT it should be safe to restore the original pte_special behaviour. Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index b9a13e9..4673d6e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -137,13 +137,7 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - /* -* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. -* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == -* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. -*/ - return (pte_flags(pte) & _PAGE_SPECIAL) && - (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); + return pte_flags(pte) & _PAGE_SPECIAL; } static inline unsigned long pte_pfn(pte_t pte) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries
If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 14 -- mm/mprotect.c| 4 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8546654..f2bf521 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, return 0; } - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + if (!prot_numa || !pmd_protnone(*pmd)) { + ret = 1; + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); + } spin_unlock(ptl); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 33dfafb..109e7aa 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue; + + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 3 +++ mm/memory.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ad2a3ee..8546654 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1273,6 +1273,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + ptl = pmd_lock(mm, pmdp); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; diff --git a/mm/memory.c b/mm/memory.c index 3c50046..9df2d09 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3108,6 +3108,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote: > > On 09.07.14 00:59, Stewart Smith wrote: > >Hi! > > > >Thanks for review, much appreciated! > > > >Alexander Graf writes: > >>On 08.07.14 07:06, Stewart Smith wrote: > >>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) > >>> int i, need_vpa_update; > >>> int srcu_idx; > >>> struct kvm_vcpu *vcpus_to_update[threads_per_core]; > >>>+ phys_addr_t phy_addr, tmp; > >>Please put the variable declarations into the if () branch so that the > >>compiler can catch potential leaks :) > >ack. will fix. > > > >>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) > >>> srcu_idx = srcu_read_lock(&vc->kvm->srcu); > >>>+ /* If we have a saved list of L2/L3, restore it */ > >>>+ if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) { > >>>+ phy_addr = virt_to_phys((void *)vc->mpp_buffer); > >>>+#if defined(CONFIG_PPC_4K_PAGES) > >>>+ phy_addr = (phy_addr + 8*4096) & ~(8*4096); > >>get_free_pages() is automatically aligned to the order, no? > >That's what Paul reckoned too, and then we've attempted to find anywhere > >that documents that behaviour. Happen to be able to point to docs/source > >that say this is part of API? > > Phew - it's probably buried somewhere. I could only find this > document saying that we always get order-aligned allocations: > > http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html > > Mel, do you happen to have any pointer to something that explicitly > (or even properly implicitly) says that get_free_pages() returns > order-aligned memory? > I did not read the whole thread so I lack context and will just answer this part. There is no guarantee that pages are returned in PFN order for multiple requests to the page allocator. This is the relevant comment in rmqueue_bulk /* * Split buddy pages returned by expand() are received here * in physical page order. The page is added to the callers and * list and the list head then moves forward. From the callers * perspective, the linked list is ordered by page number in * some conditions. This is useful for IO devices that can * merge IO requests if the physical pages are ordered * properly. */ It will probably be true early in the lifetime of the system but the milage will vary on systems with a lot of uptime. If you depend on this behaviour for correctness then you will have a bad day. High-order page requests to the page allocator are guaranteed to be in physical order. However, this does not apply to vmalloc() where allocations are only guaranteed to be virtually contiguous. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote: > > On 10.07.14 15:07, Mel Gorman wrote: > >On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote: > >>On 09.07.14 00:59, Stewart Smith wrote: > >>>Hi! > >>> > >>>Thanks for review, much appreciated! > >>> > >>>Alexander Graf writes: > >>>>On 08.07.14 07:06, Stewart Smith wrote: > >>>>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore > >>>>>*vc) > >>>>> int i, need_vpa_update; > >>>>> int srcu_idx; > >>>>> struct kvm_vcpu *vcpus_to_update[threads_per_core]; > >>>>>+phys_addr_t phy_addr, tmp; > >>>>Please put the variable declarations into the if () branch so that the > >>>>compiler can catch potential leaks :) > >>>ack. will fix. > >>> > >>>>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore > >>>>>*vc) > >>>>> srcu_idx = srcu_read_lock(&vc->kvm->srcu); > >>>>>+/* If we have a saved list of L2/L3, restore it */ > >>>>>+if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) { > >>>>>+phy_addr = virt_to_phys((void *)vc->mpp_buffer); > >>>>>+#if defined(CONFIG_PPC_4K_PAGES) > >>>>>+phy_addr = (phy_addr + 8*4096) & ~(8*4096); > >>>>get_free_pages() is automatically aligned to the order, no? > >>>That's what Paul reckoned too, and then we've attempted to find anywhere > >>>that documents that behaviour. Happen to be able to point to docs/source > >>>that say this is part of API? > >>Phew - it's probably buried somewhere. I could only find this > >>document saying that we always get order-aligned allocations: > >> > >>http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html > >> > >>Mel, do you happen to have any pointer to something that explicitly > >>(or even properly implicitly) says that get_free_pages() returns > >>order-aligned memory? > >> > >I did not read the whole thread so I lack context and will just answer > >this part. > > > >There is no guarantee that pages are returned in PFN order for multiple > >requests to the page allocator. This is the relevant comment in > >rmqueue_bulk > > > > /* > > * Split buddy pages returned by expand() are received here > > * in physical page order. The page is added to the callers > > and > > * list and the list head then moves forward. From the > > callers > > * perspective, the linked list is ordered by page number in > > * some conditions. This is useful for IO devices that can > > * merge IO requests if the physical pages are ordered > > * properly. > > */ > > > >It will probably be true early in the lifetime of the system but the milage > >will vary on systems with a lot of uptime. If you depend on this behaviour > >for correctness then you will have a bad day. > > > >High-order page requests to the page allocator are guaranteed to be in > >physical > >order. However, this does not apply to vmalloc() where allocations are > >only guaranteed to be virtually contiguous. > > Hrm, ok to be very concrete: > > Does __get_free_pages(..., 4); on a 4k page size system give me a > 64k aligned pointer? :) > Yes. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: mm: BUG in unmap_page_range
On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote: > > -#define pmd_mknonnuma pmd_mknonnuma > > -static inline pmd_t pmd_mknonnuma(pmd_t pmd) > > +/* > > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist > > + * which was inherited from x86. For the purposes of powerpc pte_basic_t is > > + * equivalent > > + */ > > +#define pteval_t pte_basic_t > > +#define pmdval_t pmd_t > > +static inline pteval_t pte_flags(pte_t pte) > > { > > - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); > > + return pte_val(pte) & PAGE_PROT_BITS; > > PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have > to check further to find out why the mask doesn't include > _PAGE_PRESENT. > Dumb of me, not sure how I managed that. For the purposes of what is required it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask that defines what bits are of interest to the generic helpers which is what this version attempts to do. It's not tested on powerpc at all unfortunately. ---8<--- mm: Remove misleading ARCH_USES_NUMA_PROT_NONE ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting fault scanner. This was found to be conceptually confusing with a lot of implicit assumptions and it was asked that an alternative be found. Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE bits and shrunk the maximum possible swap size but it did not go far enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA but the relics still exist. This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary duplication in powerpc vs the generic implementation by defining the types the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. This necessitated that a PTE bit mask be created that identified the bits that distinguish present from NUMA pte entries but it is expected this will only differ between arches based on _PAGE_PROTNONE. The naming for the generic helpers was taken from x86 originally but ppc64 has types that are equivalent for the purposes of the helper so they are mapped instead of duplicating code. Signed-off-by: Mel Gorman --- arch/powerpc/include/asm/pgtable.h| 57 --- arch/powerpc/include/asm/pte-common.h | 5 +++ arch/x86/Kconfig | 1 - arch/x86/include/asm/pgtable_types.h | 7 + include/asm-generic/pgtable.h | 27 ++--- init/Kconfig | 11 --- 6 files changed, 33 insertions(+), 75 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index d98c1ec..beeb09e 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING - static inline int pte_present(pte_t pte) { - return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA); + return pte_val(pte) & _PAGE_NUMA_MASK; } #define pte_present_nonuma pte_present_nonuma @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte) return pte_val(pte) & (_PAGE_PRESENT); } -#define pte_numa pte_numa -static inline int pte_numa(pte_t pte) -{ - return (pte_val(pte) & - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA; -} - -#define pte_mknonnuma pte_mknonnuma -static inline pte_t pte_mknonnuma(pte_t pte) -{ - pte_val(pte) &= ~_PAGE_NUMA; - pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED; - return pte; -} - -#define pte_mknuma pte_mknuma -static inline pte_t pte_mknuma(pte_t pte) -{ - /* -* We should not set _PAGE_NUMA on non present ptes. Also clear the -* present bit so that hash_page will return 1 and we collect this -* as numa fault. -*/ - if (pte_present(pte)) { - pte_val(pte) |= _PAGE_NUMA; - pte_val(pte) &= ~_PAGE_PRESENT; - } else - VM_BUG_ON(1); - return pte; -} - #define ptep_set_numa ptep_set_numa static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, return; } -#define pmd_numa pmd_numa -static inline int pmd_numa(pmd_t pmd) -{ - return pte_numa(pmd_pte(pmd)); -} - #define pmdp_set_numa pmdp_set_numa
Re: [PATCH -V3] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
On Fri, Dec 06, 2013 at 12:08:22AM +0530, Aneesh Kumar K.V wrote: > From: "Aneesh Kumar K.V" > > change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE. > On archs like ppc64 that don't use _PAGE_PROTNONE and also have > a separate page table outside linux pagetable, we just need to > make sure that when calling change_prot_numa we flush the > hardware page table entry so that next page access result in a numa > fault. > > We still need to make sure we use the numa faulting logic only > when CONFIG_NUMA_BALANCING is set. This implies the migrate-on-fault > (Lazy migration) via mbind will only work if CONFIG_NUMA_BALANCING > is set. > > Signed-off-by: Aneesh Kumar K.V You're right on that there is no direct dependance on numa balancing and use of prot_none. The BUILD_BUG_ON was to flag very clearly that arches wanting to support automatic NUMA balancing must ensure such things as o _PAGE_NUMA is defined o setting _PAGE_NUMA traps a fault and the fault can be uniquely identified as being a numa hinting fault o that pte_present still returns true for pte_numa pages even though the underlying present bit may be cleared. Otherwise operations like following and copying ptes will get confused o shortly, arches will also need to avoid taking references on pte_numa pages in get_user_pages to account for hinting faults properly I guess the _PAGE_NUMA parts will already be caught by other checks and the rest will fall out during testing so it's ok to remove. Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/3] mm: dirty accountable change only apply to non prot numa case
On Tue, Feb 11, 2014 at 04:04:54PM +0530, Aneesh Kumar K.V wrote: > From: "Aneesh Kumar K.V" > > So move it within the if loop > > Signed-off-by: Aneesh Kumar K.V Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/3] mm: Use ptep/pmdp_set_numa for updating _PAGE_NUMA bit
On Tue, Feb 11, 2014 at 04:04:55PM +0530, Aneesh Kumar K.V wrote: > From: "Aneesh Kumar K.V" > > Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions. ppc64 also > doesn't implement > flush_tlb_range. ppc64 require the tlb flushing to be batched within ptl > locks. The reason > to do that is to ensure that the hash page table is in sync with linux page > table. > We track the hpte index in linux pte and if we clear them without flushing > hash and drop the > ptl lock, we can have another cpu update the pte and can end up with double > hash. We also want > to keep set_pte_at simpler by not requiring them to do hash flush for > performance reason. > Hence cannot use them while updating _PAGE_NUMA bit. Add new functions for > marking pte/pmd numa > > Signed-off-by: Aneesh Kumar K.V Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: mm: Add new set flag argument to pte/pmd update function
On Tue, Feb 11, 2014 at 04:04:53PM +0530, Aneesh Kumar K.V wrote: > From: "Aneesh Kumar K.V" > > We will use this later to set the _PAGE_NUMA bit. > > Signed-off-by: Aneesh Kumar K.V Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap
On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote: > > On a powerpc machine with CONFIG_NUMA_BALANCING=y and CONFIG_SPARSEMEM_VMEMMAP > not enabled, kernel panics. > This? ---8<--- sched: numa: Do not group tasks if last cpu is not set On configurations with vmemmap disabled, the following partial is observed [ 299.268623] CPU: 47 PID: 4366 Comm: numa01 Tainted: G D 3.14.0-rc5-vanilla #4 [ 299.278295] Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012 [ 299.287452] task: 880c670bc110 ti: 880c66db6000 task.ti: 880c66db6000 [ 299.296642] RIP: 0010:[] [] task_numa_fault+0x50f/0x8b0 [ 299.306778] RSP: :880c66db7670 EFLAGS: 00010282 [ 299.313769] RAX: 33ee RBX: 880c670bc110 RCX: 0001 [ 299.322590] RDX: 0001 RSI: 0003 RDI: [ 299.331394] RBP: 880c66db76c8 R08: R09: 000166b0 [ 299.340203] R10: 880c7ffecd80 R11: R12: 01ff [ 299.348989] R13: 00ff R14: R15: 0003 [ 299.357763] FS: 7f5a60a3f700() GS:88106f2c() knlGS: [ 299.367510] CS: 0010 DS: ES: CR0: 80050033 [ 299.374913] CR2: 37da CR3: 000868ed4000 CR4: 07e0 [ 299.383726] Stack: [ 299.387414] 0003 00010003 00010003 [ 299.396564] 811888f4 880c66db7698 0003 880c7f9b3ac0 [ 299.405730] 880c66ccebd8 0003 880c66db7718 [ 299.414907] Call Trace: [ 299.419095] [] ? migrate_misplaced_page+0xb4/0x140 [ 299.427301] [] do_numa_page+0x18c/0x1f0 [ 299.434554] [] handle_mm_fault+0x617/0xf70 [ ..] SNIPPED The oops occurs in task_numa_group looking up cpu_rq(LAST__CPU_MASK). The bug exists for all configurations but will manifest differently. On vmemmap configurations, it looks up garbage and on !vmemmap configuraitons it will oops. This patch adds the necessary check and also fixes the type for LAST__PID_MASK and LAST__CPU_MASK which are currently signed instead of unsigned integers. Signed-off-by: Mel Gorman Cc: sta...@vger.kernel.org diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index da52366..6f661d9 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -63,10 +63,10 @@ #ifdef CONFIG_NUMA_BALANCING #define LAST__PID_SHIFT 8 -#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1) +#define LAST__PID_MASK ((1UL << LAST__PID_SHIFT)-1) #define LAST__CPU_SHIFT NR_CPUS_BITS -#define LAST__CPU_MASK ((1 << LAST__CPU_SHIFT)-1) +#define LAST__CPU_MASK ((1UL << LAST__CPU_SHIFT)-1) #define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT) #else diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7815709..b44a8b1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1463,6 +1463,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, int cpu = cpupid_to_cpu(cpupid); int i; + if (unlikely(cpu == LAST__CPU_MASK && !cpu_online(cpu))) + return; + if (unlikely(!p->numa_group)) { unsigned int size = sizeof(struct numa_group) + 2*nr_node_ids*sizeof(unsigned long); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap
On Tue, Mar 04, 2014 at 12:45:19AM +0530, Aneesh Kumar K.V wrote: > Mel Gorman writes: > > > On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote: > >> > >> On a powerpc machine with CONFIG_NUMA_BALANCING=y and > >> CONFIG_SPARSEMEM_VMEMMAP > >> not enabled, kernel panics. > >> > > > > This? > > This one fixed that crash on ppc64 > > http://mid.gmane.org/1393578122-6500-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com > Thanks. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: hugetlb: ensure hugepage access is denied if hugepages are not supported
On Thu, Apr 03, 2014 at 04:14:13PM -0700, Nishanth Aravamudan wrote: > In KVM guests on Power, in a guest not backed by hugepages, we see the > following: > > AnonHugePages: 0 kB > HugePages_Total: 0 > HugePages_Free:0 > HugePages_Rsvd:0 > HugePages_Surp:0 > Hugepagesize: 64 kB > > HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages > are not supported at boot-time, but this is only checked in > hugetlb_init(). Extract the check to a helper function, and use it in a > few relevant places. > > This does make hugetlbfs not supported (not registered at all) in this > environment. I believe this is fine, as there are no valid hugepages and > that won't change at runtime. > > Signed-off-by: Nishanth Aravamudan Acked-by: Mel Gorman This patch looks ok but the changelog misses important information from the original report which is probably why it fell through the cracks. Add the fact that you encountered a problem during mount to the changelog and resend it directly to Andrew. This part from your original report; Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x0031 Faulting instruction address: 0xc0245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries It probably slipped through the cracks because from the changelog this looks like a minor formatting issue and not a functional fix. > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index d19b30a..cc8fcc7 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -1017,6 +1017,11 @@ static int __init init_hugetlbfs_fs(void) > int error; > int i; > > + if (!hugepages_supported()) { > + printk(KERN_ERR "hugetlbfs: Disabling because there are no > supported hugepage sizes\n"); > + return -ENOTSUPP; > + } > + > error = bdi_init(&hugetlbfs_backing_dev_info); > if (error) > return error; KERN_ERR feels like overkill for this type of issue. KERN_INFO? -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 0/2] Automatic NUMA balancing and PROT_NONE handling followup
Dave Chinner reported a problem due to excessive NUMA balancing activity and bisected it. These are two patches that address two major issues with that series. The first patch is almost certainly unrelated to what he saw due to fact his vmstats showed no huge page activity but the fix is important. The second patch restores performance of one benchmark to similar levels to 3.19-vanilla but it still has to be tested on his workload. While I have a test configuration for his workload, I don't have either the KVM setup or suitable storage to test against. It also needs to be reviewed and tested on ppc64. arch/powerpc/include/asm/pgtable-ppc64.h | 16 arch/x86/include/asm/pgtable.h | 14 ++ include/asm-generic/pgtable.h| 19 +++ mm/huge_memory.c | 23 ++- mm/mprotect.c| 5 + 5 files changed, 72 insertions(+), 5 deletions(-) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] mm: thp: Return the correct value for change_huge_pmd
The wrong value is being returned by change_huge_pmd since commit 10c1045f28e8 ("mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries") which allows a fallthrough that tries to adjust non-existent PTEs. This patch corrects it. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fc00c8cb5a82..194c0f019774 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1482,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + ret = 1; /* * Avoid trapping faults against the zero page. The read-only @@ -1490,11 +1491,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, */ if (prot_numa && is_huge_zero_pmd(*pmd)) { spin_unlock(ptl); - return 0; + return ret; } if (!prot_numa || !pmd_protnone(*pmd)) { - ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] mm: numa: Do not clear PTEs or PMDs for NUMA hinting faults
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07%56.07% [kernel][k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This was bisected to commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") which clears PTEs and PMDs to make them PROT_NONE. This is tidy but tests on some benchmarks indicate that there are many more hinting faults trapped resulting in excessive migration. This is the result for the old autonuma benchmark for example. autonumabench 4.0.0-rc1 4.0.0-rc1 3.19.0 vanillanoclear-v1 vanilla Time User-NUMA01 32883.59 ( 0.00%)27401.21 ( 16.67%) 25695.96 ( 21.86%) Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%)17491.98 ( -0.22%) 17404.36 ( 0.28%) Time User-NUMA02 2063.70 ( 0.00%) 2059.94 ( 0.18%) 2037.65 ( 1.26%) Time User-NUMA02_SMT983.70 ( 0.00%) 967.95 ( 1.60%) 981.02 ( 0.27%) Time System-NUMA01 602.44 ( 0.00%) 182.16 ( 69.76%) 194.70 ( 67.68%) Time System-NUMA01_THEADLOCAL78.10 ( 0.00%) 84.84 ( -8.63%) 98.52 (-26.15%) Time System-NUMA026.47 ( 0.00%)9.74 (-50.54%) 9.28 (-43.43%) Time System-NUMA02_SMT5.06 ( 0.00%)3.97 ( 21.54%) 3.79 ( 25.10%) Time Elapsed-NUMA01 755.96 ( 0.00%) 602.20 ( 20.34%) 558.84 ( 26.08%) Time Elapsed-NUMA01_THEADLOCAL 382.22 ( 0.00%) 384.98 ( -0.72%) 382.54 ( -0.08%) Time Elapsed-NUMA02 49.38 ( 0.00%) 49.23 ( 0.30%) 49.83 ( -0.91%) Time Elapsed-NUMA02_SMT 47.70 ( 0.00%) 46.82 ( 1.84%) 46.59 ( 2.33%) Time CPU-NUMA014429.00 ( 0.00%) 4580.00 ( -3.41%) 4632.00 ( -4.58%) Time CPU-NUMA01_THEADLOCAL 4586.00 ( 0.00%) 4565.00 ( 0.46%) 4575.00 ( 0.24%) Time CPU-NUMA024191.00 ( 0.00%) 4203.00 ( -0.29%) 4107.00 ( 2.00%) Time CPU-NUMA02_SMT2072.00 ( 0.00%) 2075.00 ( -0.14%) 2113.00 ( -1.98%) Note the system CPU usage with the patch applied and how it's similar to 3.19-vanilla. The NUMA hinting activity is also restored to similar levels. 4.0.0-rc1 4.0.0-rc1 3.19.0 vanillanoclear-v1r13 vanilla NUMA alloc hit 1437560 1241466 1202922 NUMA alloc miss 0 0 0 NUMA interleave hit 0 0 0 NUMA alloc local 1436781 1240849 1200683 NUMA base PTE updates304513172 223926293 222840103 NUMA huge PMD updates 594467 437025 434894 NUMA page range updates 608880276 447683093 445505831 NUMA hint faults733491 598990 601358 NUMA hint local faults 511530 314936 371571 NUMA hint local percent 69 52 61 NUMA pages migrated 26366701 5424102 7073177 Signed-off-by: Mel Gorman --- arch/powerpc/include/asm/pgtable-ppc64.h | 16 arch/x86/include/asm/pgtable.h | 14 ++ include/asm-generic/pgtable.h| 19 +++ mm/huge_memory.c | 19 --- mm/mprotect.c| 5 + 5 files changed, 70 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 43e6ad424c7f..bb654c192028 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/in
[PATCH 2/4] mm: numa: Remove migrate_ratelimited
This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE scanning even if migrate rate limited") so remove it. Signed-off-by: Mel Gorman --- include/linux/migrate.h | 5 - mm/migrate.c| 20 2 files changed, 25 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 78baed5f2952..cac1c0904d5f 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -69,7 +69,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, extern bool pmd_trans_migrating(pmd_t pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); -extern bool migrate_ratelimited(int node); #else static inline bool pmd_trans_migrating(pmd_t pmd) { @@ -80,10 +79,6 @@ static inline int migrate_misplaced_page(struct page *page, { return -EAGAIN; /* can't migrate now */ } -static inline bool migrate_ratelimited(int node) -{ - return false; -} #endif /* CONFIG_NUMA_BALANCING */ #if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) diff --git a/mm/migrate.c b/mm/migrate.c index 85e042686031..6aa9a4222ea9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1554,30 +1554,10 @@ static struct page *alloc_misplaced_dst_page(struct page *page, * page migration rate limiting control. * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs * window of time. Default here says do not migrate more than 1280M per second. - * If a node is rate-limited then PTE NUMA updates are also rate-limited. However - * as it is faults that reset the window, pte updates will happen unconditionally - * if there has not been a fault since @pteupdate_interval_millisecs after the - * throttle window closed. */ static unsigned int migrate_interval_millisecs __read_mostly = 100; -static unsigned int pteupdate_interval_millisecs __read_mostly = 1000; static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT); -/* Returns true if NUMA migration is currently rate limited */ -bool migrate_ratelimited(int node) -{ - pg_data_t *pgdat = NODE_DATA(node); - - if (time_after(jiffies, pgdat->numabalancing_migrate_next_window + - msecs_to_jiffies(pteupdate_interval_millisecs))) - return false; - - if (pgdat->numabalancing_migrate_nr_pages < ratelimit_pages) - return false; - - return true; -} - /* Returns true if the node is migrate rate-limited after the update */ static bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%)5.69 ( 59.31%) CoeffVar real-fsmark 0.56 ( 0.00%)0.34 ( 38.74%)0.12 ( 77.97%) CoeffVar syst-fsmark 0.40 ( 0.00%)0.38 ( 6.57%)0.24 ( 39.93%) CoeffVar real-xfsrepair2.20 ( 0.00%)0.74 ( 66.22%)1.24 ( 43.47%) CoeffVar syst-xfsrepair2.69 ( 0.00%)7.08 (-163.23%) 2.80 ( -4.23%) Max real-fsmark1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) Max syst-fsmark4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) The key point is that system CPU usage for xfsrepair (syst-xfsrepair) is almost cut in half. It's still not as low as 3.19-vanilla but it's much closer 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla NUMA alloc hit 146138883 121929782 104019526 NUMA alloc miss 1314632811456356 7806370 NUMA interleave hit 0 0 0 NUMA alloc local 146060848 121865921 103953085 NUMA base PTE updates242201535 117237258 216624143 NUMA huge PMD updates 113270 52121 127782 NUMA page range updates 300195775 143923210 282048527 NUMA hint faults 18038802587299060 147235021 NUMA hint local faults727845323293925861866265 NUMA hint local percent 40 37 42 NUMA pages migrated 711752624139530223237799 Note the big differences in faults trapped and pages migrated. 3.19-vanilla still migrated fewer pages but if necessary the threshold at which we start throttling migrations can be lowered. Signed-off-by: Mel Gorman --- include/linux/sched.h | 9 + kernel/sched/fair.c | 8 ++-- mm/huge_memory.c | 3 ++- mm/memory.c | 3 ++- 4 files changed, 15 insertions(+), 8 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432e14ff..a419b65770d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1625,11 +1625,11 @@ struct task_struct { /* * numa_faults_locality tracks if faults recorded during the last -* scan window were remote/local. The task scan period is adapted -* based on the locality of the faults with different weights -* depending on whether they were shared or private faults +* scan window were remote/local or failed to migrate. The task scan +* period is adapted based on the locality of the faults with different +* weights depending on whether they were shared or private faults */ - unsigned long numa_faults_locality[2]; + unsigned long numa_faults_locality[3]; unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ @@ -1719,6 +1719,7 @@ struct task_struct { #define TNF_NO_GROUP 0x02 #define TNF_SHARED 0x04 #define TNF_FAULT_LOCAL0x08 +#define TNF_MIGRATE_FAIL 0x10 #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ce18f3c097a..bcfe32088b37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct *p, /* * If there were no record hinting faults then either the task is * completely idle or all activity is areas that are not of interest -* to automatic numa balancing. Scan slower +* to automatic numa balancing. Related to that, if there were failed +* migration then it implies we are migrating too quickly or the local +* node is overloaded. In either case, scan slower */ - if (local + shared == 0) { + if (local + shared == 0 || p->numa_faults_locality[2]) { p->numa_scan_period = min(p->numa_scan_period_max, p->numa_scan_period << 1); @@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) if (migrated) p->numa_pages_migrated += pages; + if (flags & TNF_MIGRATE_FAIL) + p->numa_faults_locality[2] += pages; p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ae13ad31e113..f508fda07d34 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1353,7 +1353,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (migrated) {
[PATCH 3/4] mm: numa: Mark huge PTEs young when clearing NUMA hinting faults
Base PTEs are marked young when the NUMA hinting information is cleared but the same does not happen for huge pages which this patch addresses. Note that migrated pages are not marked young as the base page migration code does not assume that migrated pages have been referenced. This could be addressed but beyond the scope of this series which is aimed at Dave Chinners shrink workload that is unlikely to be affected by this issue. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 194c0f019774..ae13ad31e113 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1359,6 +1359,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, clear_pmdnuma: BUG_ON(!PageLocked(page)); pmd = pmd_modify(pmd, vma->vm_page_prot); + pmd = pmd_mkyoung(pmd); set_pmd_at(mm, haddr, pmdp, pmd); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 0/4] Automatic NUMA balancing and PROT_NONE handling followup v2r8
Dave Chinner reported a problem due to excessive NUMA balancing activity and bisected it. The first patch in this series corrects a major problem that is unlikely to affect Dave but is still serious. Patch 2 is a minor cleanup that was spotted while looking at scan rate control. Patch 3 is minor and unlikely to make a difference but is still an inconsistentcy between base and THP handling. Patch 4 is the important one, it slows PTE scan updates if migrations are failing or throttled. Details of the performance impact on local tests is included in the patch. include/linux/migrate.h | 5 - include/linux/sched.h | 9 + kernel/sched/fair.c | 8 ++-- mm/huge_memory.c| 8 +--- mm/memory.c | 3 ++- mm/migrate.c| 20 6 files changed, 18 insertions(+), 35 deletions(-) -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] mm: thp: Return the correct value for change_huge_pmd
The wrong value is being returned by change_huge_pmd since commit 10c1045f28e8 ("mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries") which allows a fallthrough that tries to adjust non-existent PTEs. This patch corrects it. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fc00c8cb5a82..194c0f019774 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1482,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + ret = 1; /* * Avoid trapping faults against the zero page. The read-only @@ -1490,11 +1491,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, */ if (prot_numa && is_huge_zero_pmd(*pmd)) { spin_unlock(ptl); - return 0; + return ret; } if (!prot_numa || !pmd_protnone(*pmd)) { - ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sat, Mar 07, 2015 at 05:36:58PM +0100, Ingo Molnar wrote: > > * Mel Gorman wrote: > > > Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 > > > > Across the board the 4.0-rc1 numbers are much slower, and the > > degradation is far worse when using the large memory footprint > > configs. Perf points straight at the cause - this is from 4.0-rc1 on > > the "-o bhash=101073" config: > > > > [...] > > >4.0.0-rc1 4.0.0-rc1 3.19.0 > > vanilla slowscan-v2 vanilla > > User53384.2956093.1146119.12 > > System692.14 311.64 306.41 > > Elapsed 1236.87 1328.61 1039.88 > > > > Note that the system CPU usage is now similar to 3.19-vanilla. > > Similar, but still worse, and also the elapsed time is still much > worse. User time is much higher, although it's the same amount of work > done on every kernel, right? > Elapsed time is primarily worse on one benchmark -- numa01 which is an adverse workload. The user time differences are also dominated by that benchmark 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2r7 vanilla Time User-NUMA01 32883.59 ( 0.00%)35288.00 ( -7.31%) 25695.96 ( 21.86%) Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%)17765.79 ( -1.79%) 17404.36 ( 0.28%) Time User-NUMA02 2063.70 ( 0.00%) 2063.22 ( 0.02%) 2037.65 ( 1.26%) Time User-NUMA02_SMT983.70 ( 0.00%) 976.01 ( 0.78%) 981.02 ( 0.27%) > > I also tested with a workload very similar to Dave's. The machine > > configuration and storage is completely different so it's not an > > equivalent test unfortunately. It's reporting the elapsed time and > > CPU time while fsmark is running to create the inodes and when > > runnig xfsrepair afterwards > > > > xfsrepair > > 4.0.0-rc1 4.0.0-rc1 > >3.19.0 > > vanilla slowscan-v2 > > vanilla > > Min real-fsmark1157.41 ( 0.00%) 1150.38 ( 0.61%) > > 1164.44 ( -0.61%) > > Min syst-fsmark3998.06 ( 0.00%) 3988.42 ( 0.24%) > > 4016.12 ( -0.45%) > > Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) > > 442.64 ( 11.05%) > > Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) > > 194.97 ( 61.05%) > > Ameanreal-fsmark1166.63 ( 0.00%) 1155.97 ( 0.91%) > > 1166.28 ( 0.03%) > > Ameansyst-fsmark4020.94 ( 0.00%) 4004.19 ( 0.42%) > > 4025.87 ( -0.12%) > > Ameanreal-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) > > 447.66 ( 11.85%) > > Ameansyst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) > > 202.93 ( 60.97%) > > Stddev real-fsmark 6.55 ( 0.00%)3.97 ( 39.30%) > > 1.44 ( 77.98%) > > Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) > > 9.76 ( 39.86%) > > Stddev real-xfsrepair 11.17 ( 0.00%)3.41 ( 69.43%) > > 5.57 ( 50.17%) > > Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) > > 5.69 ( 59.31%) > > CoeffVar real-fsmark 0.56 ( 0.00%)0.34 ( 38.74%) > > 0.12 ( 77.97%) > > CoeffVar syst-fsmark 0.40 ( 0.00%)0.38 ( 6.57%) > > 0.24 ( 39.93%) > > CoeffVar real-xfsrepair2.20 ( 0.00%)0.74 ( 66.22%) > > 1.24 ( 43.47%) > > CoeffVar syst-xfsrepair2.69 ( 0.00%)7.08 (-163.23%) > > 2.80 ( -4.23%) > > Max real-fsmark1171.98 ( 0.00%) 1159.25 ( 1.09%) > > 1167.96 ( 0.34%) > > Max syst-fsmark4033.84 ( 0.00%) 4024.53 ( 0.23%) > > 4039.20 ( -0.13%) > > Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) > > 455.42 ( 12.99%) > > Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) > > 207.94 ( 61.01%) > > > > The key point is that system CPU usage for xfsrepair (syst-xfsrepair) > > is almost cut in half. It's still not as low as 3.19-vanilla but it's > > much closer > > > > 4.0.0-rc1 4.0.0-rc1 3.19.0 > >vanilla slowscan
Re: [PATCH 1/4] mm: thp: Return the correct value for change_huge_pmd
On Sat, Mar 07, 2015 at 12:31:03PM -0800, Linus Torvalds wrote: > On Sat, Mar 7, 2015 at 7:20 AM, Mel Gorman wrote: > > > > if (!prot_numa || !pmd_protnone(*pmd)) { > > - ret = 1; > > entry = pmdp_get_and_clear_notify(mm, addr, pmd); > > entry = pmd_modify(entry, newprot); > > ret = HPAGE_PMD_NR; > > Hmm. I know I acked this already, but the return value - which correct > - is still potentially something we could improve upon. > > In particular, we don't need to flush the TLB's if the old entry was > not present. Sadly, we don't have a helper function for that. > > But the code *could* do something like > > entry = pmdp_get_and_clear_notify(mm, addr, pmd); > ret = pmd_tlb_cacheable(entry) ? HPAGE_PMD_NR : 1; > entry = pmd_modify(entry, newprot); > > where pmd_tlb_cacheable() on x86 would test if _PAGE_PRESENT (bit #0) is set. > I agree with you in principle. pmd_tlb_cacheable looks and sounds very similar to pte_accessible(). > In particular, that would mean that as we change *from* a protnone > (whether NUMA or really protnone) we wouldn't need to flush the TLB. > > In fact, we could make it even more aggressive: it's not just an old > non-present TLB entry that doesn't need flushing - we can avoid the > flushing whenever we strictly increase the access rigths. So we could > have something that takes the old entry _and_ the new protections into > account, and avoids the TLB flush if the new entry is strictly more > permissive. > > This doesn't explain the extra TLB flushes Dave sees, though, because > the old code didn't make those kinds of optimizations either. But > maybe something like this is worth doing. > I think it is worth doing although it'll be after LSF/MM before I do it. I severely doubt this is what Dave is seeing because the vmstats indicated there was no THP activity but it's still a good idea. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 08, 2015 at 11:02:23AM +0100, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar wrote: > > > > > > And the patch Dave bisected to is a relatively simple patch. Why > > > not simply revert it to see whether that cures much of the > > > problem? > > > > So the problem with that is that "pmd_set_numa()" and friends simply > > no longer exist. So we can't just revert that one patch, it's the > > whole series, and the whole point of the series. > > Yeah. > > > What confuses me is that the only real change that I can see in that > > patch is the change to "change_huge_pmd()". Everything else is > > pretty much a 100% equivalent transformation, afaik. Of course, I > > may be wrong about that, and missing something silly. > > Well, there's a difference in what we write to the pte: > > #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > and our expectation was that the two should be equivalent methods from > the POV of the NUMA balancing code, right? > Functionally yes but performance-wise no. We are now using the global bit for NUMA faults at the very least. > > And the changes to "change_huge_pmd()" were basically re-done > > differently by subsequent patches anyway. > > > > The *only* change I see remaining is that change_huge_pmd() now does > > > >entry = pmdp_get_and_clear_notify(mm, addr, pmd); > >entry = pmd_modify(entry, newprot); > >set_pmd_at(mm, addr, pmd, entry); > > > > for all changes. It used to do that "pmdp_set_numa()" for the > > prot_numa case, which did just > > > >pmd_t pmd = *pmdp; > >pmd = pmd_mknuma(pmd); > >set_pmd_at(mm, addr, pmdp, pmd); > > > > instead. > > > > I don't like the old pmdp_set_numa() because it can drop dirty bits, > > so I think the old code was actively buggy. > > Could we, as a silly testing hack not to be applied, write a > hack-patch that re-introduces the racy way of setting the NUMA bit, to > confirm that it is indeed this difference that changes pte visibility > across CPUs enough to create so many more faults? > This was already done and tested by Dave but while it helped, it was not enough. As the approach was inherently unsafe it was dropped and the throttling approach taken. However, the fact it made little difference may indicate that this is somehow related to the global bit being used. > Because if the answer is 'yes', then we can safely say: 'we regressed > performance because correctness [not dropping dirty bits] comes before > performance'. > > If the answer is 'no', then we still have a mystery (and a regression) > to track down. > > As a second hack (not to be applied), could we change: > > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > to: > > #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) > In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it can be done. > to double check that the position of the bit does not matter? > It's worth checking in case it's a case of how the global bit is treated. However, note that Dave is currently travelling for LSF/MM in Boston and there is a chance he cannot test this week at all. I'm just after landing in the hotel myself. I'll try find time during during one of the breaks tomorrow but if the wireless is too crap then accessing the test machine remotely might be an issue. > I don't think we've exhaused all avenues of analysis here. > True. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 08, 2015 at 08:40:25PM +, Mel Gorman wrote: > > Because if the answer is 'yes', then we can safely say: 'we regressed > > performance because correctness [not dropping dirty bits] comes before > > performance'. > > > > If the answer is 'no', then we still have a mystery (and a regression) > > to track down. > > > > As a second hack (not to be applied), could we change: > > > > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > > > to: > > > > #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) > > > > In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating > as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it > can be done. > More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just updating the value should crash. For the purposes of testing the idea, I thought the straight-forward option was to break soft dirty page tracking and steal their bit for testing (patch below). Took most of the day to get access to the test machine so tests are not long running and only the autonuma one has completed; autonumabench 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanilla slowscan-v2r7protnone-v3 Time User-NUMA01 25695.96 ( 0.00%)32883.59 (-27.97%) 35288.00 (-37.33%)35236.21 (-37.13%) Time User-NUMA01_THEADLOCAL 17404.36 ( 0.00%)17453.20 ( -0.28%) 17765.79 ( -2.08%)17590.10 ( -1.07%) Time User-NUMA02 2037.65 ( 0.00%) 2063.70 ( -1.28%) 2063.22 ( -1.25%) 2072.95 ( -1.73%) Time User-NUMA02_SMT981.02 ( 0.00%) 983.70 ( -0.27%) 976.01 ( 0.51%) 983.42 ( -0.24%) Time System-NUMA01 194.70 ( 0.00%) 602.44 (-209.42%) 209.42 ( -7.56%) 737.36 (-278.72%) Time System-NUMA01_THEADLOCAL98.52 ( 0.00%) 78.10 ( 20.73%) 92.70 ( 5.91%) 80.69 ( 18.10%) Time System-NUMA029.28 ( 0.00%)6.47 ( 30.28%) 6.06 ( 34.70%)6.63 ( 28.56%) Time System-NUMA02_SMT3.79 ( 0.00%)5.06 (-33.51%) 3.39 ( 10.55%)3.60 ( 5.01%) Time Elapsed-NUMA01 558.84 ( 0.00%) 755.96 (-35.27%) 833.63 (-49.17%) 804.50 (-43.96%) Time Elapsed-NUMA01_THEADLOCAL 382.54 ( 0.00%) 382.22 ( 0.08%) 395.45 ( -3.37%) 388.12 ( -1.46%) Time Elapsed-NUMA02 49.83 ( 0.00%) 49.38 ( 0.90%) 50.21 ( -0.76%) 48.99 ( 1.69%) Time Elapsed-NUMA02_SMT 46.59 ( 0.00%) 47.70 ( -2.38%) 48.55 ( -4.21%) 49.50 ( -6.25%) Time CPU-NUMA014632.00 ( 0.00%) 4429.00 ( 4.38%) 4258.00 ( 8.07%) 4471.00 ( 3.48%) Time CPU-NUMA01_THEADLOCAL 4575.00 ( 0.00%) 4586.00 ( -0.24%) 4515.00 ( 1.31%) 4552.00 ( 0.50%) Time CPU-NUMA024107.00 ( 0.00%) 4191.00 ( -2.05%) 4120.00 ( -0.32%) 4244.00 ( -3.34%) Time CPU-NUMA02_SMT2113.00 ( 0.00%) 2072.00 ( 1.94%) 2017.00 ( 4.54%) 1993.00 ( 5.68%) 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanillaslowscan-v2r7protnone-v3 User46119.1253384.2956093.1155882.82 System306.41 692.14 311.64 828.36 Elapsed 1039.88 1236.87 1328.61 1292.92 So just using a different bit doesn't seem to be it either 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanillaslowscan-v2r7protnone-v3 NUMA alloc hit 1202922 1437560 1472578 1499274 NUMA alloc miss 0 0 0 0 NUMA interleave hit 0 0 0 0 NUMA alloc local 1200683 1436781 1472226 1498680 NUMA base PTE updates222840103 304513172 121532313 337431414 NUMA huge PMD updates 434894 594467 237170 658715 NUMA page range updates 445505831 608880276 242963353 674693494 NUMA hint faults601358 733491 334334 820793 NUMA hint local faults 371571 511530 227171 565003 NUMA hint local percent 61 69 67 68 NUMA pages migrated707317726366701 860708231288355 Patch to use a bit other than the global bit for prot none is below. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 8c7c10802e9c..1f243323693c 100644 --- a/arch/x86/include/asm/pgtable_types.h
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Mon, Mar 09, 2015 at 09:02:19PM +, Mel Gorman wrote: > On Sun, Mar 08, 2015 at 08:40:25PM +0000, Mel Gorman wrote: > > > Because if the answer is 'yes', then we can safely say: 'we regressed > > > performance because correctness [not dropping dirty bits] comes before > > > performance'. > > > > > > If the answer is 'no', then we still have a mystery (and a regression) > > > to track down. > > > > > > As a second hack (not to be applied), could we change: > > > > > > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > > > > > to: > > > > > > #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) > > > > > > > In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating > > as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it > > can be done. > > > > More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just > updating the value should crash. For the purposes of testing the idea, I > thought the straight-forward option was to break soft dirty page tracking > and steal their bit for testing (patch below). Took most of the day to > get access to the test machine so tests are not long running and only > the autonuma one has completed; > And the xfsrepair workload also does not show any benefit from using a different bit either 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanilla slowscan-v2r7protnone-v3r17 Min real-fsmark1164.44 ( 0.00%) 1157.41 ( 0.60%) 1150.38 ( 1.21%) 1173.22 ( -0.75%) Min syst-fsmark4016.12 ( 0.00%) 3998.06 ( 0.45%) 3988.42 ( 0.69%) 4037.90 ( -0.54%) Min real-xfsrepair 442.64 ( 0.00%) 497.64 (-12.43%) 456.87 ( -3.21%) 489.60 (-10.61%) Min syst-xfsrepair 194.97 ( 0.00%) 500.61 (-156.76%) 263.41 (-35.10%) 544.56 (-179.30%) Ameanreal-fsmark1166.28 ( 0.00%) 1166.63 ( -0.03%) 1155.97 ( 0.88%) 1183.19 ( -1.45%) Ameansyst-fsmark4025.87 ( 0.00%) 4020.94 ( 0.12%) 4004.19 ( 0.54%) 4061.64 ( -0.89%) Ameanreal-xfsrepair 447.66 ( 0.00%) 507.85 (-13.45%) 459.58 ( -2.66%) 498.71 (-11.40%) Ameansyst-xfsrepair 202.93 ( 0.00%) 519.88 (-156.19%) 281.63 (-38.78%) 569.21 (-180.50%) Stddev real-fsmark 1.44 ( 0.00%)6.55 (-354.10%) 3.97 (-175.65%)9.20 (-537.90%) Stddev syst-fsmark 9.76 ( 0.00%) 16.22 (-66.27%) 15.09 (-54.69%) 17.47 (-79.13%) Stddev real-xfsrepair5.57 ( 0.00%) 11.17 (-100.68%) 3.41 ( 38.66%)6.77 (-21.63%) Stddev syst-xfsrepair5.69 ( 0.00%) 13.98 (-145.78%) 19.94 (-250.49%) 20.03 (-252.05%) CoeffVar real-fsmark 0.12 ( 0.00%)0.56 (-353.96%) 0.34 (-178.11%)0.78 (-528.79%) CoeffVar syst-fsmark 0.24 ( 0.00%)0.40 (-66.48%)0.38 (-55.53%)0.43 (-77.55%) CoeffVar real-xfsrepair1.24 ( 0.00%)2.20 (-76.89%)0.74 ( 40.25%)1.36 ( -9.17%) CoeffVar syst-xfsrepair2.80 ( 0.00%)2.69 ( 4.06%)7.08 (-152.54%)3.52 (-25.51%) Max real-fsmark1167.96 ( 0.00%) 1171.98 ( -0.34%) 1159.25 ( 0.75%) 1195.41 ( -2.35%) Max syst-fsmark4039.20 ( 0.00%) 4033.84 ( 0.13%) 4024.53 ( 0.36%) 4079.45 ( -1.00%) Max real-xfsrepair 455.42 ( 0.00%) 523.40 (-14.93%) 464.40 ( -1.97%) 505.82 (-11.07%) Max syst-xfsrepair 207.94 ( 0.00%) 533.37 (-156.50%) 309.38 (-48.78%) 593.62 (-185.48%) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 10, 2015 at 04:55:52PM -0700, Linus Torvalds wrote: > On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner wrote: > > On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: > >> > >> What's your virtual environment setup? Kernel config, and > >> virtualization environment to actually get that odd fake NUMA thing > >> happening? > > > > I don't have the exact .config with me (test machines at home > > are shut down because I'm half a world away), but it's pretty much > > this (copied and munged from a similar test vm on my laptop): > > [ snip snip ] > > Ok, I hate debugging by symptoms anyway, so I didn't do any of this, > and went back to actually *thinking* about the code instead of trying > to reproduce this and figure things out by trial and error. > > And I think I figured it out. > I believe you're correct and it matches what was observed. I'm still travelling and wireless is dirt but managed to queue a test using pmd_dirty 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanilla ptewrite-v1r20 Time User-NUMA01 25695.96 ( 0.00%)32883.59 (-27.97%) 24012.80 ( 6.55%) Time User-NUMA01_THEADLOCAL 17404.36 ( 0.00%)17453.20 ( -0.28%) 17950.54 ( -3.14%) Time User-NUMA02 2037.65 ( 0.00%) 2063.70 ( -1.28%) 2046.88 ( -0.45%) Time User-NUMA02_SMT981.02 ( 0.00%) 983.70 ( -0.27%) 983.68 ( -0.27%) Time System-NUMA01 194.70 ( 0.00%) 602.44 (-209.42%) 158.90 ( 18.39%) Time System-NUMA01_THEADLOCAL98.52 ( 0.00%) 78.10 ( 20.73%) 107.66 ( -9.28%) Time System-NUMA029.28 ( 0.00%)6.47 ( 30.28%) 9.25 ( 0.32%) Time System-NUMA02_SMT3.79 ( 0.00%)5.06 (-33.51%) 3.92 ( -3.43%) Time Elapsed-NUMA01 558.84 ( 0.00%) 755.96 (-35.27%) 532.41 ( 4.73%) Time Elapsed-NUMA01_THEADLOCAL 382.54 ( 0.00%) 382.22 ( 0.08%) 390.48 ( -2.08%) Time Elapsed-NUMA02 49.83 ( 0.00%) 49.38 ( 0.90%) 49.79 ( 0.08%) Time Elapsed-NUMA02_SMT 46.59 ( 0.00%) 47.70 ( -2.38%) 47.77 ( -2.53%) Time CPU-NUMA014632.00 ( 0.00%) 4429.00 ( 4.38%) 4539.00 ( 2.01%) Time CPU-NUMA01_THEADLOCAL 4575.00 ( 0.00%) 4586.00 ( -0.24%) 4624.00 ( -1.07%) Time CPU-NUMA024107.00 ( 0.00%) 4191.00 ( -2.05%) 4129.00 ( -0.54%) Time CPU-NUMA02_SMT2113.00 ( 0.00%) 2072.00 ( 1.94%) 2067.00 ( 2.18%) 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanillaptewrite-v1r20 User46119.1253384.2944994.10 System306.41 692.14 279.78 Elapsed 1039.88 1236.87 1022.92 There are still some difference but it's much closer to what it was. The balancing stats are almost looking similar to 3.19 NUMA base PTE updates222840103 304513172 230724075 NUMA huge PMD updates 434894 594467 450274 NUMA page range updates 445505831 608880276 461264363 NUMA hint faults601358 733491 626176 NUMA hint local faults 371571 511530 359215 NUMA hint local percent 61 69 57 NUMA pages migrated707317726366701 6829196 XFS repair on the same machine is not fully restore either but a big enough move in the right direction to indicate this was the relevant change. xfsrepair 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanilla ptewrite-v1r20 Ameanreal-fsmark1166.28 ( 0.00%) 1166.63 ( -0.03%) 1184.97 ( -1.60%) Ameansyst-fsmark4025.87 ( 0.00%) 4020.94 ( 0.12%) 4071.10 ( -1.12%) Ameanreal-xfsrepair 447.66 ( 0.00%) 507.85 (-13.45%) 460.94 ( -2.97%) Ameansyst-xfsrepair 202.93 ( 0.00%) 519.88 (-156.19%) 282.45 (-39.19%) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote: > On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman wrote: > > > > I believe you're correct and it matches what was observed. I'm still > > travelling and wireless is dirt but managed to queue a test using pmd_dirty > > Ok, thanks. > > I'm not entirely happy with that change, and I suspect the whole > heuristic should be looked at much more (maybe it should also look at > whether it's executable, for example), but it's a step in the right > direction. > I can follow up when I'm back in work properly. As you have already pulled this in directly, can you also consider pulling in "mm: thp: return the correct value for change_huge_pmd" please? The other two patches were very minor can be resent through the normal paths later. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: > > - something completely different that I am entirely missing > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. > Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. -* -* FIXME! This checks "pmd_dirty()" as an approximation of -* "is this a read-only page", since checking "pmd_write()" -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. -*/ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. +* Avoid grouping on RO pages in general. RO pages shouldn't hurt as +* much anyway since they can be in shared cache state. This misses +* the case where a mapping is writable but the process never writes +* to it but pte_write gets cleared during protection updates and +* pte_dirty has unpredictable behaviour between PTE scan updates, +* background writeback, dirty balancing and application behaviour. * -* FIXME! This checks "pmd_dirty()" as an approximation of -* "is this a read-only page", since checking "pmd_write()" -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. +* TODO: Note that the ideal here would be to avoid a situation where a +* NUMA fault is taken immediately followed by a write fault in +* some cases which would have lower overhead overall but would be +* invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: > On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner wrote: > > > > My recollection wasn't faulty - I pulled it from an earlier email. > > That said, the original measurement might have been faulty. I ran > > the numbers again on the 3.19 kernel I saved away from the original > > testing. That came up at 235k, which is pretty much the same as > > yesterday's test. The runtime,however, is unchanged from my original > > measurements of 4m54s (pte_hack came in at 5m20s). > > Ok. Good. So the "more than an order of magnitude difference" was > really about measurement differences, not quite as real. Looks like > more a "factor of two" than a factor of 20. > > Did you do the profiles the same way? Because that would explain the > differences in the TLB flush percentages too (the "1.4% from > tlb_invalidate_range()" vs "pretty much everything from migration"). > > The runtime variation does show that there's some *big* subtle > difference for the numa balancing in the exact TNF_NO_GROUP details. TNF_NO_GROUP affects whether the scheduler tries to group related processes together. Whether migration occurs depends on what node a process is scheduled on. If processes are aggressively grouped inappropriately then it is possible there is a bug that causes the load balancer to move processes off a node (possible migration) with NUMA balancing trying to pull it back (another possible migration). Small bugs there can result in excessive migration. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: > And the VM_WRITE test should be stable and not have any subtle > interaction with the other changes that the numa pte things > introduced. It would be good to see if the profiles then pop something > *else* up as the performance difference (which I'm sure will remain, > since the 7m50s was so far off). > As a side-note, I did test a patch that checked pte_write and preserved it across both faults and setting the protections. It did not alter migration activity much but there was a drop in minor faults - 20% drop in autonumabench, 58% drop in xfsrepair workload. I'm assuming this is due to refaults to mark pages writable. The patch looks and is hacky so I won't post it to save people bleaching their eyes. I'll spend some time soon (hopefully today) at a smooth way of falling through to WP checks after trapping a NUMA fault. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Fri, Mar 20, 2015 at 10:02:23AM -0700, Linus Torvalds wrote: > On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner wrote: > > > > Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and > > 7m36s. IOWs's a bit better, but not significantly. page migrations > > are pretty much unchanged, too: > > > >558,632 migrate:mm_migrate_pages ( +- 6.38% ) > > Ok. That was kind of the expected thing. > > I don't really know the NUMA fault rate limiting code, but one thing > that strikes me is that if it tries to balance the NUMA faults against > the *regular* faults, then maybe just the fact that we end up taking > more COW faults after a NUMA fault then means that the NUMA rate > limiting code now gets over-eager (because it sees all those extra > non-numa faults). > > Mel, does that sound at all possible? I really have never looked at > the magic automatic rate handling.. > It should not be trying to balance against regular faults as it has no information on it. The trapping of additional faults to mark the PTE writable will alter timing so it indirectly affects how many migration faults there but this is only a side-effect IMO. There is more overhead now due to losing the writable information and that should be reduced so I tried a few approaches. Ultimately, the one that performed the best and was easiest to understand simply preserved the writable bit across the protection update and page fault. I'll post it later when I stick a changelog on it. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags
Threads that share writable data within pages are grouped together as related tasks. This decision is based on whether the PTE is marked dirty which is subject to timing races between the PTE scanner update and when the application writes the page. If the page is file-backed, then background flushes and sync also affect placement. This is unpredictable behaviour which is impossible to reason about so this patch makes grouping decisions based on the VMA flags. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 13 ++--- mm/memory.c | 19 +++ 2 files changed, 13 insertions(+), 19 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. -* -* FIXME! This checks "pmd_dirty()" as an approximation of -* "is this a read-only page", since checking "pmd_write()" -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. -*/ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. +* Avoid grouping on RO pages in general. RO pages shouldn't hurt as +* much anyway since they can be in shared cache state. This misses +* the case where a mapping is writable but the process never writes +* to it but pte_write gets cleared during protection updates and +* pte_dirty has unpredictable behaviour between PTE scan updates, +* background writeback, dirty balancing and application behaviour. * -* FIXME! This checks "pmd_dirty()" as an approximation of -* "is this a read-only page", since checking "pmd_write()" -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. +* TODO: Note that the ideal here would be to avoid a situation where a +* NUMA fault is taken immediately followed by a write fault in +* some cases which would have lower overhead overall but would be +* invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma->vm_flags & VM_WRITE)) flags |= TNF_NO_GROUP; /* -- 2.1.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/3] Reduce system overhead of automatic NUMA balancing
These are three follow-on patches based on the xfsrepair workload Dave Chinner reported was problematic in 4.0-rc1 due to changes in page table management -- https://lkml.org/lkml/2015/3/1/226. Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp: Return the correct value for change_huge_pmd"). It was known that the performance in 3.19 was still better even if is far less safe. This series aims to restore the performance without compromising on safety. Dave, you already tested patch 1 on its own but it would be nice to test patches 1+2 and 1+2+3 separately just to be certain. For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the three patches applied on top autonumabench 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 Time System-NUMA01 124.00 ( 0.00%) 161.86 (-30.53%) 107.13 ( 13.60%) 103.13 ( 16.83%) 145.01 (-16.94%) Time System-NUMA01_THEADLOCAL 115.54 ( 0.00%) 107.64 ( 6.84%) 131.87 (-14.13%) 83.30 ( 27.90%) 92.35 ( 20.07%) Time System-NUMA029.35 ( 0.00%) 10.44 (-11.66%) 8.95 ( 4.28%) 10.72 (-14.65%)8.16 ( 12.73%) Time System-NUMA02_SMT3.87 ( 0.00%)4.63 (-19.64%) 4.57 (-18.09%)3.99 ( -3.10%)3.36 ( 13.18%) Time Elapsed-NUMA01 570.06 ( 0.00%) 567.82 ( 0.39%) 515.78 ( 9.52%) 517.26 ( 9.26%) 543.80 ( 4.61%) Time Elapsed-NUMA01_THEADLOCAL 393.69 ( 0.00%) 384.83 ( 2.25%) 384.10 ( 2.44%) 384.31 ( 2.38%) 380.73 ( 3.29%) Time Elapsed-NUMA02 49.09 ( 0.00%) 49.33 ( -0.49%) 48.86 ( 0.47%) 48.78 ( 0.63%) 50.94 ( -3.77%) Time Elapsed-NUMA02_SMT 47.51 ( 0.00%) 47.15 ( 0.76%) 47.98 ( -0.99%) 48.12 ( -1.28%) 49.56 ( -4.31%) 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 User46334.6046391.9444383.9543971.8944372.12 System252.84 284.66 252.61 201.24 249.00 Elapsed 1062.14 1050.96 998.68 1000.94 1026.78 Overall the system CPU usage is comparable and the test is naturally a bit variable. The slowing of the scanner hurts numa01 but on this machine it is an adverse workload and patches that dramatically help it often hurt absolutely everything else. Due to patch 2, the fault activity is interesting 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 Minor Faults 2097811 2656646 2597249 1981230 1636841 Major Faults 362 450 365 364 365 Note the impact preserving the write bit across protection updates and fault reduces faults. NUMA alloc hit 1229008 1217015 1191660 1178322 1199681 NUMA alloc miss 0 0 0 0 0 NUMA interleave hit 0 0 0 0 0 NUMA alloc local 1228514 1216317 1190871 1177448 1199021 NUMA base PTE updates245706197 240041607 238195516 244704842 115012800 NUMA huge PMD updates 479530 468448 464868 477573 224487 NUMA page range updates 491225557 479886983 476207932 48918 229950144 NUMA hint faults659753 656503 641678 656926 294842 NUMA hint local faults 381604 373963 360478 337585 186249 NUMA hint local percent 57 56 56 51 63 NUMA pages migrated5412140 6374899 6266530 5277468 5755096 AutoNUMA cost5121% 5083% 4994% 5097% 2388% Here the impact of slowing the PTE scanner on migratrion failures is obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are massively reduced even though the headline performance is very similar. As xfsrepair was the reported workload here is the impact of the series on it. xfsrepair 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 Min real-fsmark1183.29 ( 0.00%) 1165.73 (
[PATCH 2/3] mm: numa: Preserve PTE write permissions across a NUMA hinting fault
Protecting a PTE to trap a NUMA hinting fault clears the writable bit and further faults are needed after trapping a NUMA hinting fault to set the writable bit again. This patch preserves the writable bit when trapping NUMA hinting faults. The impact is obvious from the number of minor faults trapped during the basis balancing benchmark and the system CPU usage; autonumabench 4.0.0-rc4 4.0.0-rc4 baseline preserve Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%) Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%) Time System-NUMA028.95 ( 0.00%) 10.72 (-19.78%) Time System-NUMA02_SMT4.57 ( 0.00%)3.99 ( 12.69%) Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%) Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%) Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%) Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%) 4.0.0-rc4 4.0.0-rc4 baselinepreserve User 44383.9543971.89 System 252.61 201.24 Elapsed 998.68 1000.94 Minor Faults 2597249 1981230 Major Faults 365 364 There is a similar drop in system CPU usage using Dave Chinner's xfsrepair workload 4.0.0-rc4 4.0.0-rc4 baseline preserve Ameanreal-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%) Ameansyst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%) The patch looks hacky but the alternatives looked worse. The tidest was to rewalk the page tables after a hinting fault but it was more complex than this approach and the performance was worse. It's not generally safe to just mark the page writable during the fault if it's a write fault as it may have been read-only for COW so that approach was discarded. Signed-off-by: Mel Gorman --- mm/huge_memory.c | 9 - mm/memory.c | 8 +++- mm/mprotect.c| 3 +++ 3 files changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2f12e9fcf1a2..0a42d1521aa4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1260,6 +1260,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, int target_nid, last_cpupid = -1; bool page_locked; bool migrated = false; + bool was_writable; int flags = 0; /* A PROT_NONE fault should not end up here */ @@ -1354,7 +1355,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); + was_writable = pmd_write(pmd); pmd = pmd_modify(pmd, vma->vm_page_prot); + if (was_writable) + pmd = pmd_mkwrite(pmd); set_pmd_at(mm, haddr, pmdp, pmd); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); @@ -1478,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + bool preserve_write = prot_numa && pmd_write(*pmd); ret = 1; /* @@ -1493,9 +1498,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (!prot_numa || !pmd_protnone(*pmd)) { entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); + if (preserve_write) + entry = pmd_mkwrite(entry); ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + BUG_ON(!preserve_write && pmd_write(entry)); } spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index 20beb6647dba..d20e12da3a3c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3035,6 +3035,7 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, int last_cpupid; int target_nid; bool migrated = false; + bool was_writable = pte_write(pte); int flags = 0; /* A PROT_NONE fault should not end up here */ @@ -3059,6 +3060,8 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Make it present again */ pte = pte_modify(pte, vma->vm_page_prot); pte = pte_mkyoung(pte); + if (was_writable) + pte = pte_mkwrite(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); @@ -3075,11 +3078,6 @@ static int do_numa_page(struct mm_struct *mm, struct
[PATCH 3/3] mm: numa: Slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07%56.07% [kernel][k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This is showing excessive migration activity even though excessive migrations are meant to get throttled. Normally, the scan rate is tuned on a per-task basis depending on the locality of faults. However, if migrations fail for any reason then the PTE scanner may scan faster if the faults continue to be remote. This means there is higher system CPU overhead and fault trapping at exactly the time we know that migrations cannot happen. This patch tracks when migration failures occur and slows the PTE scanner. Signed-off-by: Mel Gorman --- include/linux/sched.h | 9 + kernel/sched/fair.c | 8 ++-- mm/huge_memory.c | 3 ++- mm/memory.c | 3 ++- 4 files changed, 15 insertions(+), 8 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432e14ff..a419b65770d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1625,11 +1625,11 @@ struct task_struct { /* * numa_faults_locality tracks if faults recorded during the last -* scan window were remote/local. The task scan period is adapted -* based on the locality of the faults with different weights -* depending on whether they were shared or private faults +* scan window were remote/local or failed to migrate. The task scan +* period is adapted based on the locality of the faults with different +* weights depending on whether they were shared or private faults */ - unsigned long numa_faults_locality[2]; + unsigned long numa_faults_locality[3]; unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ @@ -1719,6 +1719,7 @@ struct task_struct { #define TNF_NO_GROUP 0x02 #define TNF_SHARED 0x04 #define TNF_FAULT_LOCAL0x08 +#define TNF_MIGRATE_FAIL 0x10 #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ce18f3c097a..bcfe32088b37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct *p, /* * If there were no record hinting faults then either the task is * completely idle or all activity is areas that are not of interest -* to automatic numa balancing. Scan slower +* to automatic numa balancing. Related to that, if there were failed +* migration then it implies we are migrating too quickly or the local +* node is overloaded. In either case, scan slower */ - if (local + shared == 0) { + if (local + shared == 0 || p->numa_faults_locality[2]) { p->numa_scan_period = min(p->numa_scan_period_max, p->numa_scan_period << 1); @@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) if (migrated) p->numa_pages_migrated += pages; + if (flags & TNF_MIGRATE_FAIL) + p->numa_faults_locality[2] += pages; p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0a42d1521aa4..51b3e7c64622 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1350,7 +1350,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (migrated) { flags |= TNF_MIGRATED; page_nid = target_nid; - } + } else + flags |= TNF_MIGRATE_FAIL; goto out; clear_pmdnuma: diff --git a/mm/memory.c b/
Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote: > On Mon, Mar 23, 2015 at 12:24:00PM +0000, Mel Gorman wrote: > > These are three follow-on patches based on the xfsrepair workload Dave > > Chinner reported was problematic in 4.0-rc1 due to changes in page table > > management -- https://lkml.org/lkml/2015/3/1/226. > > > > Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa > > read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp: > > Return the correct value for change_huge_pmd"). It was known that the > > performance > > in 3.19 was still better even if is far less safe. This series aims to > > restore the performance without compromising on safety. > > > > Dave, you already tested patch 1 on its own but it would be nice to test > > patches 1+2 and 1+2+3 separately just to be certain. > > 3.19 4.0-rc4+p1 +p2 +p3 > mm_migrate_pages 266,750 572,839 558,632 223,706 201,429 > run time4m54s7m50s7m20s5m07s4m31s > Excellent, this is in line with predictions and roughly matches what I was seeing on bare metal + real NUMA + spinning disk instead of KVM + fake NUMA + SSD. Editting slightly; > numa stats form p1+p2:numa_pte_updates 46109698 > numa stats form p1+p2+p3: numa_pte_updates 24460492 The big drop in PTE updates matches what I expected -- migration failures should not lead to increased scan rates which is what patch 3 fixes. I'm also pleased that there was not a drop in performance. > > OK, the summary with all patches applied: > > config 3.19 4.0-rc1 4.0-rc4 4.0-rc5+ > defaults 8m08s 9m34s9m14s6m57s > -o ag_stride=-14m04s 4m38s4m11s4m06s > -o bhash=1010736m04s17m43s7m35s6m13s > -o ag_stride=-1,bhash=101073 4m54s 9m58s7m50s4m31s > > So it looks like the patch set fixes the remaining regression and in > 2 of the four cases actually improves performance > \o/ Linus, these three patches plus the small fixlet for pmd_mkyoung (to match pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to you before 4.0 assuming nothing else goes pear shaped. > Thanks, Linus and Mel, for tracking this tricky problem down! > Thanks Dave for persisting with this and collecting the necessary data. FWIW, I've marked the xfsrepair test case as a "large memory test". It'll take time before the test machines have historical data for it but in theory if this regresses again then I should spot it eventually. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 1/2] mm: meminit: initialise more memory for inode/dentry hash tables in early boot
On Thu, Mar 03, 2016 at 03:01:40PM +0800, Li Zhang wrote: > From: Li Zhang > > This patch is based on Mel Gorman's old patch in the mailing list, > https://lkml.org/lkml/2015/5/5/280 which is dicussed but it is > fixed with a completion to wait for all memory initialised in > page_alloc_init_late(). It is to fix the oom problem on X86 > with 24TB memory which allocates memory in late initialisation. > But for Power platform with 32TB memory, it causes a call trace > in vfs_caches_init->inode_init() and inode hash table needs more > memory. > So this patch allocates 1GB for 0.25TB/node for large system > as it is mentioned in https://lkml.org/lkml/2015/5/1/627 > Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 2/2] powerpc/mm: Enable page parallel initialisation
On Thu, Mar 03, 2016 at 03:01:41PM +0800, Li Zhang wrote: > From: Li Zhang > > Parallel initialisation has been enabled for X86, > boot time is improved greatly. > On Power8, for small memory, it is improved greatly. > Here is the result from my test on Power8 platform: > > For 4GB memory: 57% is improved > For 50GB memory: 22% is improve > > Signed-off-by: Li Zhang Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] mm: Fix memory corruption caused by deferred page initialization
On Thu, Mar 31, 2016 at 01:27:34PM +1100, Gavin Shan wrote: > >So the issue is only existing when CONFIG_NO_BOOTMEM=n. The alternative fix > >would > >be similar to what we have on !CONFIG_NO_BOOTMEM: In early stage, all page > >structs > >for bootmem reserved pages are initialized and mark them with PG_reserved. > >I'm > >not sure it's worthy to fix it as we won't support bootmem as Michael > >mentioned. > > > > Mel, could you please confirm if we need a fix on !CONFIG_NO_BOOTMEM? If we > need, > I'll respin and send a patch for review. > Given that CONFIG_NO_BOOTMEM is not supported and bootmem is meant to be slowly retiring, I would suggest instead making deferred memory init depend on NO_BOOTMEM. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations
On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote: > While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory > Overcommit) on powerpc, we tripped the following: > > kernel BUG at mm/bootmem.c:483! > cpu 0x0: Vector: 700 (Program Check) at [c0c03940] > pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c > lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c > sp: c0c03bc0 >msr: 80021032 > current = 0xc0b0cce0 > paca= 0xc1d8 > pid = 0, comm = swapper > kernel BUG at mm/bootmem.c:483! > enter ? for help > [c0c03c80] c0a64bcc > .sparse_early_usemaps_alloc_node+0x84/0x29c > [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c > [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294 > [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460 > [c0c03f90] c0009670 .start_here_common+0x1c/0x2c > > This is > > BUG_ON(limit && goal + size > limit); > > and after some debugging, it seems that > > goal = 0x700 > limit = 0x800 > > and sparse_early_usemaps_alloc_node -> > sparse_early_usemaps_alloc_pgdat_section -> alloc_bootmem_section calls > > return alloc_bootmem_section(usemap_size() * count, section_nr); > > This is on a system with 8TB available via the AMS pool, and as a quirk > of AMS in firmware, all of that memory shows up in node 0. So, we end up > with an allocation that will fail the goal/limit constraints. In theory, > we could "fall-back" to alloc_bootmem_node() in > sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE > defined, we'll BUG_ON() instead. A simple solution appears to be to > disable the limit check if the size of the allocation in > alloc_bootmem_secition exceeds the section size. > > Signed-off-by: Nishanth Aravamudan > Cc: Dave Hansen > Cc: Anton Blanchard > Cc: Paul Mackerras > Cc: Ben Herrenschmidt > Cc: Robert Jennings > Cc: linux...@kvack.org > Cc: linuxppc-dev@lists.ozlabs.org > --- > include/linux/mmzone.h |2 ++ > mm/bootmem.c |5 - > 2 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 650ba2f..4176834 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -967,6 +967,8 @@ static inline unsigned long early_pfn_to_nid(unsigned > long pfn) > * PA_SECTION_SHIFT physical address to/from section number > * PFN_SECTION_SHIFT pfn to/from section number > */ > +#define BYTES_PER_SECTION(1UL << SECTION_SIZE_BITS) > + > #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) > > #define PA_SECTION_SHIFT (SECTION_SIZE_BITS) > diff --git a/mm/bootmem.c b/mm/bootmem.c > index 668e94d..5cbbc76 100644 > --- a/mm/bootmem.c > +++ b/mm/bootmem.c > @@ -770,7 +770,10 @@ void * __init alloc_bootmem_section(unsigned long size, > > pfn = section_nr_to_pfn(section_nr); > goal = pfn << PAGE_SHIFT; > - limit = section_nr_to_pfn(section_nr + 1) << PAGE_SHIFT; > + if (size > BYTES_PER_SECTION) > + limit = 0; > + else > + limit = section_nr_to_pfn(section_nr + 1) << PAGE_SHIFT; As it's ok to spill the allocation over to an adjacent section, why not just make limit==0 unconditionally. That would avoid defining BYTES_PER_SECTION. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] bootmem/sparsemem: remove limit constraint in alloc_bootmem_section
On Wed, Feb 29, 2012 at 10:12:33AM -0800, Nishanth Aravamudan wrote: > > > Signed-off-by: Nishanth Aravamudan > Acked-by: Mel Gorman -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] page_alloc.c: remove argument to pageblock_default_order
On Thu, May 03, 2012 at 04:37:49PM -0700, Andrew Morton wrote: > On Thu, 3 May 2012 22:45:12 +0530 > rajman mekaco wrote: > > > When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not defined, then > > pageblock_default_order has an argument to it. > > > > However, free_area_init_core will call it without any argument > > anyway. > > > > Remove the argument to pageblock_default_order when > > CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not defined. > > > > Signed-off-by: rajman mekaco > > --- > > mm/page_alloc.c |2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index a712fb9..4b95412 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4274,7 +4274,7 @@ static inline void __init > > set_pageblock_order(unsigned int order) > > * at compile-time. See include/linux/pageblock-flags.h for the values of > > * pageblock_order based on the kernel config > > */ > > -static inline int pageblock_default_order(unsigned int order) > > +static inline int pageblock_default_order(void) > > { > > return MAX_ORDER-1; > > } > > Interesting. It has been that way since at least 3.1. > /me slaps self > It didn't break the build because pageblock_default_order() is only > ever invoked by set_pageblock_order(), with: > > set_pageblock_order(pageblock_default_order()); > > and set_pageblock_order() is a macro: > > #define set_pageblock_order(x)do {} while (0) > > There's yet another reason not to use macros, dammit - they hide bugs. > > > Mel, can you have a think about this please? Can we just kill off > pageblock_default_order() and fold its guts into > set_pageblock_order(void)? Only ia64 and powerpc can define > CONFIG_HUGETLB_PAGE_SIZE_VARIABLE. > This looks reasonable to me. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55
On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: > (cc linuxppc-dev@lists.ozlabs.org) > > On Mon, 11 Oct 2010 15:30:22 +0100 > Mel Gorman wrote: > > > On Sat, Oct 09, 2010 at 04:57:18AM -0500, pac...@kosh.dhis.org wrote: > > > (What a big Cc: list... scripts/get_maintainer.pl made me do it.) > > > > > > This will be a long story with a weak conclusion, sorry about that, but > > > it's > > > been a long bug-hunt. > > > > > > With recent kernels I've seen a bug that appears to corrupt random 4-byte > > > chunks of memory. It's not easy to reproduce. It seems to happen only once > > > per boot, pretty quickly after userspace has gotten started, and > > > sometimes it > > > doesn't happen at all. > > > > > > > A corruption of 4 bytes could be consistent with a pointer value being > > written to an incorrect location. > > It's corruption of user memory, which is unusual. I'd be wondering if > there was a pre-existing bug which 6dda9d55bf545013597 has exposed - > previously the corruption was hitting something harmless. Something > like a missed CPU cache writeback or invalidate operation. > This seems somewhat plausible although it's hard to tell for sure. But lets say we had the following situation in memory [<MAX_ORDER_NR_PAGES>][<MAX_ORDER_NR_PAGES>] INITRDmemmap array initrd gets freed and someone else very early in boot gets allocated in there. Lets further guess that the struct pages in the memmap area are managing the page frame where the INITRD was because it makes the situation slightly easier to trigger. As pages get freed in the memmap array, we could reference memory where initrd used to be but the physical memory is mapped at two virtual addresses. CPU A CPU B Reads kernelspace virtual (gets cache line) Writes userspace virtual (gets different cache line) IO, writes buffer destined for userspace (via cache line) Cache line eviction, writeback to memory This is somewhat contrived but I can see how it might happen even on one CPU particularly if the L1 cache is virtual and is loose about checking physical tags. > How sensitive/vulnerable is PPC32 to such things? > I can not tell you specifically but if the above scenario is in any way plausible, I believe it would depend on what sort of L1 cache the CPU has. Maybe this particular version has a virtual cache with no physical tagging and is depending on the OS not to make virtual aliasing mistakes. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55
On Wed, Oct 13, 2010 at 12:52:05PM -0500, pac...@kosh.dhis.org wrote: > Mel Gorman writes: > > > > On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: > > > > > > It's corruption of user memory, which is unusual. I'd be wondering if > > > there was a pre-existing bug which 6dda9d55bf545013597 has exposed - > > > previously the corruption was hitting something harmless. Something > > > like a missed CPU cache writeback or invalidate operation. > > > > > > > This seems somewhat plausible although it's hard to tell for sure. But > > lets say we had the following situation in memory > > > > [<MAX_ORDER_NR_PAGES>][<MAX_ORDER_NR_PAGES>] > > INITRDmemmap array > > I don't use initrd, so this isn't exactly what happened here. But it could be > close. Let me throw out some more information and see if it triggers any > ideas. > Ok. > First, I tried a new test after seeing the corruption happen: > # md5sum /sbin/e2fsck ; echo 1 > /proc/sys/vm/drop_caches ; md5sum > /sbin/e2fsck > And got 2 different answers. The second answer was the correct one. > > Since applying the suggested patch which changed MAX_ORDER-1 to MAX_ORDER-2, > I've been trying to isolate exactly when the corruption happens. Since I > don't know much about kernel code, my main method is stuffing the area full > of printk's. > > First I duplicated the affected function __free_one_page, since it's inlined > at 2 different places, so I could apply the patch to just one of them. This > proved that the problem is happening when called from free_one_page. > > The patch which fixes (or at least covers up) the bug will only matter when > order==MAX_ORDER-2, otherwise everything is the same. So I added a lot of > printk's to show what's happening when order==MAX_ORDER-2. I found that, very > repeatably, 126 such instances occur during boot, and 61 of them pass the > page_is_buddy(higher_page, higher_buddy, order + 1) test, causing them to > call list_add_tail. > > Next, since the bug appears when this code decides to call list_add_tail, > I made my own wrapper for list_add_tail, which allowed me to force some of > the calls to do list_add instead. Eventually I found that of the 61 calls, > the last one makes the difference. Allowing the first 60 calls to go through > to list_add_tail, and switching the last one to list_add, the symptom goes > away. > > dump_stack() for that last call gave me a backtrace like this: > [c0303e80] [c0008124] show_stack+0x4c/0x144 (unreliable) > [c0303ec0] [c0068a84] free_one_page+0x28c/0x5b0 > [c0303f20] [c0069588] __free_pages_ok+0xf8/0x120 > [c0303f40] [c02d28c8] free_all_bootmem_core+0xf0/0x1f8 > [c0303f70] [c02d29fc] free_all_bootmem+0x2c/0x6c > [c0303f90] [c02cc7dc] mem_init+0x70/0x2ac > [c0303fc0] [c02c66a4] start_kernel+0x150/0x27c > [c0303ff0] [3438] 0x3438 > > And this might be interesting: the PFN of the page being added in that > critical 61st call is 130048, which exactly matches the number of available > pages: > > free_area_init_node: node 0, pgdat c02fee6c, node_mem_map c033 > DMA zone: 1024 pages used for memmap > DMA zone: 0 pages reserved > DMA zone: 130048 pages, LIFO batch:31 > Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048 > > Suspicious? > A bit but I still don't know why it would cause corruption. Maybe this is still a caching issue but the difference in timing between list_add and list_add_tail is enough to hide the bug. It's also possible there are some registers ioremapped after the memmap array and reading them is causing some problem. Andrew, what is the right thing to do here? We could flail around looking for explanations as to why the bug causes a user buffer corruption but never get an answer or do we go with this patch, preferably before 2.6.36 releases? CUT HERE mm, page-allocator: Do not check the state of a non-existant buddy during free There is a bug in commit [6dda9d55: page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists] that means a buddy at order MAX_ORDER is checked for merging. A page of this order never exists so at times, an effectively random piece of memory is being checked. Alan Curry has reported that this is causing memory corruption in userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32). It is not clear why this is happening. It could be a cache coherency problem where pages mapped in both user and kernel space are getting different cache lines due to the bad read from kernel space (http://lkml.org/lkml/2010
Re: powerpc: Move 64bit heap above 1TB on machines with 1TB segments
Anton Blanchard wrote on 22/09/2009 03:52:35: > If we are using 1TB segments and we are allowed to randomise the heap, we can > put it above 1TB so it is backed by a 1TB segment. Otherwise the heap will be > in the bottom 1TB which always uses 256MB segments and this may result in a > performance penalty. > > This functionality is disabled when heap randomisation is turned off: > > echo 1 > /proc/sys/kernel/randomize_va_space > > which may be useful when trying to allocate the maximum amount of 16M or 16G > pages. > > On a microbenchmark that repeatedly touches 32GB of memory with a stride of > 256MB + 4kB (designed to stress 256MB segments while still mapping nicely into > the L1 cache), we see the improvement: > > Force malloc to use heap all the time: > # export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 > > Disable heap randomization: > # echo 1 > /proc/sys/kernel/randomize_va_space > # time ./test > 12.51s > > Enable heap randomization: > # echo 2 > /proc/sys/kernel/randomize_va_space > # time ./test > 1.70s > > Signed-off-by: Anton Blanchard > --- > > I've cc-ed Mel on this one. As you can see it definitely helps the base > page size performance, but I'm a bit worried of the impact of taking away > another of our 1TB slices. > Unfortunately, I am not sensitive to issues surrounding 1TB segments or how they are currently being used. However, as this clearly helps performance for large amounts of memory, is it worth providing an option to libhugetlbfs to locate 16MB pages above 1TB when they are otherwise being unused? > Index: linux.trees.git/arch/powerpc/kernel/process.c > === > --- linux.trees.git.orig/arch/powerpc/kernel/process.c 2009-09-17 > 15:47:46.0 +1000 > +++ linux.trees.git/arch/powerpc/kernel/process.c 2009-09-17 15: > 49:11.0 +1000 > @@ -1165,7 +1165,22 @@ static inline unsigned long brk_rnd(void > > unsigned long arch_randomize_brk(struct mm_struct *mm) > { > - unsigned long ret = PAGE_ALIGN(mm->brk + brk_rnd()); > + unsigned long base = mm->brk; > + unsigned long ret; > + > +#ifdef CONFIG_PPC64 > + /* > +* If we are using 1TB segments and we are allowed to randomise > +* the heap, we can put it above 1TB so it is backed by a 1TB > +* segment. Otherwise the heap will be in the bottom 1TB > +* which always uses 256MB segments and this may result in a > +* performance penalty. > +*/ > + if (!is_32bit_task() && (mmu_highuser_ssize == MMU_SEGSIZE_1T)) > + base = max_t(unsigned long, mm->brk, 1UL << SID_SHIFT_1T); > +#endif > + > + ret = PAGE_ALIGN(base + brk_rnd()); > > if (ret < mm->brk) >return mm->brk; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: Move 64bit heap above 1TB on machines with 1TB segments
Benjamin Herrenschmidt wrote on 22/09/2009 22:08:22: > > Unfortunately, I am not sensitive to issues surrounding 1TB segments or how > > they are currently being used. However, as this clearly helps performance > > for large amounts of memory, is it worth providing an option to > > libhugetlbfs to locate 16MB pages above 1TB when they are otherwise being > > unused? > > AFAIK, that is already the case, at least the kernel will hand out pages > above 1T preferentially iirc. > > There were talks about making huge pages below 1T not even come up > untily you ask for them with MAP_FIXED, dunno where that went. > Confirmed, huge pages are already being placed above the 1TB mark. I hadn't given thought previously to where hugepages were being placed except within segment boundaries. The patch works as advertised and doesn't appear to collide with huge pages in any obvious way as far as I can tell. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers
oid *v); > extern struct memory_block *find_memory_block(struct mem_section *); > #define CONFIG_MEM_BLOCK_SIZE(PAGES_PER_SECTION< enum mem_add_context { BOOT, HOTPLUG }; > Index: b/mm/page_alloc.c > === > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -48,6 +48,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -4985,23 +4986,55 @@ void set_pageblock_flags_group(struct pa > int set_migratetype_isolate(struct page *page) > { > struct zone *zone; > - unsigned long flags; > + unsigned long flags, pfn, iter; > + long immobile = 0; So, the count in the structure is unsigned long, but long here. Why the difference in types? > + struct memory_isolate_notify arg; > + int notifier_ret; > int ret = -EBUSY; > int zone_idx; > > zone = page_zone(page); > zone_idx = zone_idx(zone); > + > + pfn = page_to_pfn(page); > + arg.start_addr = (unsigned long)page_address(page); > + arg.nr_pages = pageblock_nr_pages; > + arg.pages_found = 0; > + > spin_lock_irqsave(&zone->lock, flags); > /* >* In future, more migrate types will be able to be isolation target. >*/ > - if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE && > - zone_idx != ZONE_MOVABLE) > - goto out; > - set_pageblock_migratetype(page, MIGRATE_ISOLATE); > - move_freepages_block(zone, page, MIGRATE_ISOLATE); > - ret = 0; > -out: > + do { > + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE && > + zone_idx == ZONE_MOVABLE) { So, this condition requires the zone be MOVABLE and the migrate type be movable. That prevents MIGRATE_RESERVE regions in ZONE_MOVABLE being off-lined even though they can likely be off-lined. It also prevents MIGRATE_MOVABLE sections in other zones being off-lined. Did you mean || here instead of && ? Might want to expand the comment explaining this condition instead of leaving it in the old location which is confusing. > + ret = 0; > + break; > + } Why do you wrap all this in a do {} while(0) instead of preserving the out: label and using goto? > + > + /* > + * If all of the pages in a zone are used by a balloon, > + * the range can be still be isolated. The balloon will > + * free these pages from the memory notifier chain. > + */ > + notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); > + notifier_ret = notifier_to_errno(ret); > + if (notifier_ret || !arg.pages_found) > + break; > + > + for (iter = pfn; iter < (pfn + pageblock_nr_pages); iter++) > + if (page_count(pfn_to_page(iter))) > + immobile++; > + > + if (arg.pages_found == immobile) and here you compare a signed with an unsigned type. Probably harmless but why do it? > + ret = 0; > + } while (0); > + So the out label would go here and you'd get rid of the do {} while(0) loop. > + if (!ret) { > + set_pageblock_migratetype(page, MIGRATE_ISOLATE); > + move_freepages_block(zone, page, MIGRATE_ISOLATE); > + } > + > spin_unlock_irqrestore(&zone->lock, flags); > if (!ret) > drain_all_pages(); > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev