Re: mm: BUG_ON with NUMA_BALANCING (kernel BUG at include/linux/swapops.h:131!)

2015-05-14 Thread Mel Gorman
On Wed, May 13, 2015 at 01:17:54AM -0700, Haren Myneni wrote:
> Hi,
> 
>  I am getting BUG_ON in migration_entry_to_page() with 4.1.0-rc2
> kernel on powerpc system which has 512 CPUs (64 cores - 16 nodes) and
> 1.6 TB memory. We can easily recreate this issue with kernel compile
> (make -j500). But I could not reproduce with numa_balancing=disable.
> 

Is this patched in any way? I ask because line 134 on 4.1.0-rc2 does not
match up with a BUG_ON. It's close to a PageLocked check but I want to
be sure there are no other modifications.

Otherwise, when was the last time this worked? Was 4.0 ok? As it can be
easily reproduced, can the problem be bisected please?

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: mm: BUG_ON with NUMA_BALANCING (kernel BUG at include/linux/swapops.h:131!)

2015-05-18 Thread Mel Gorman
On Mon, May 18, 2015 at 12:32:29AM -0700, Haren Myneni wrote:
> Mel,
> I am hitting this issue with 4.0 kernel and even with 3.19 and
> 3.17 kernels. I will also try with previous versions. Please let me
> know any suggestions on the debugging.
> 

Please keep going further back in time to see if there was a point where
this was ever working. It could be a ppc64-specific bug but right now,
I'm still drawing a blank.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
> Mel Gorman  writes:
> 
> > This is follow up from the "pipe/page fault oddness" thread.
> >
> > Automatic NUMA balancing depends on being able to protect PTEs to trap a
> > fault and gather reference locality information. Very broadly speaking it
> > would mark PTEs as not present and use another bit to distinguish between
> > NUMA hinting faults and other types of faults. It was universally loved
> > by everybody and caused no problems whatsoever. That last sentence might
> > be a lie.
> >
> > This series is very heavily based on patches from Linus and Aneesh to
> > replace the existing PTE/PMD NUMA helper functions with normal change
> > protections. I did alter and add parts of it but I consider them relatively
> > minor contributions. Note that the signed-offs here need addressing. I
> > couldn't use "From" or Signed-off-by from the original authors as the
> > patches had to be broken up and they were never signed off. I expect the
> > two people involved will just stick their signed-off-by on it.
> 
> 
> How about the additional change listed below for ppc64 ? One part of the
> patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
> because we find the _PAGE_PRESENT bit set in case of numa fault. I
> ended up relaxing the check there.
> 

I folded the set_pte_at and set_pmd_at changes into the patch "mm: Convert
p[te|md]_numa users to p[te|md]_protnone_numa" with one change -- both
set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.

> Second part of the change is to add a WARN_ON to make sure we are
> not depending on DSISR_PROTFAULT for anything else. We ideally should not
> get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
> whether the access is allowed by pte before inserting a pte into hash
> page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
> in hash page table. But it is good to run with VM_WARN_ON ?
> 

Due to the nature of the check and when they are hit, I converted it to
a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
noted that with the patch the kernel  potentially no longer recovers
from this exceptional cirsumstance and instead falls through. To avoid
this, I preserved the "goto out_unlock".

Is this still ok?

---8<---
ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Requires-signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Mel Gorman 
---
 arch/powerpc/mm/copro_fault.c |  7 ++-
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..46152aa 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5007497..9d6e0b3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Tue, Nov 18, 2014 at 10:03:30PM +0530, Aneesh Kumar K.V wrote:
> > diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
> > index 5a236f0..46152aa 100644
> > --- a/arch/powerpc/mm/copro_fault.c
> > +++ b/arch/powerpc/mm/copro_fault.c
> > @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
> > long ea,
> > if (!(vma->vm_flags & VM_WRITE))
> > goto out_unlock;
> > } else {
> > -   if (dsisr & DSISR_PROTFAULT)
> > +   /*
> > +* protfault should only happen due to us
> > +* mapping a region readonly temporarily. PROT_NONE
> > +* is also covered by the VMA check above.
> > +*/
> > +   if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
> > goto out_unlock;
> > if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> > goto out_unlock;
> 
> 
> we should do that DSISR_PROTFAILT check after vma->vm_flags. It is not
> that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
> we get a prot fault only for cases convered by that vma check. So
> everything should be taking the if (!(vma->vm_flags & (VM_READ |
> VM_EXEC))) branch if it is a protfault. If not we would like to know
> about that. And hence the idea of not using WARN_ON_ONCE. I was also not
> sure whether we want to enable that always. The reason for keeping that
> within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
> PROTFAULT outside the vma check convered. So expectations is that
> developers working on feature will run with DEBUG_VM enable and finds
> this warning. We don't expect to hit this otherwise.
> 

/me slaps self. It's clear now and updated accordingly. Thanks.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault

2014-11-20 Thread Mel Gorman
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h | 4 
 mm/huge_memory.c| 3 ++-
 mm/migrate.c| 6 --
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 01aad3e..a3edcdf 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
-extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
@@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
-static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t 
*pmd)
-{
-}
 static inline int migrate_misplaced_page(struct page *page,
 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..38fa6cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1284,8 +1284,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * check_same as the page may no longer be mapped.
 */
if (unlikely(pmd_trans_migrating(*pmdp))) {
+   page = pmd_page(*pmdp);
spin_unlock(ptl);
-   wait_migrate_huge_page(vma->anon_vma, pmdp);
+   wait_on_page_locked(page);
goto out;
}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 0143995..baaf80e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd)
return PageLocked(page);
 }
 
-void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
-{
-   struct page *page = pmd_page(*pmd);
-   wait_on_page_locked(page);
-}
-
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2

2014-11-20 Thread Mel Gorman
V1 failed while running under kvm-tools very quickly and a second report
indicated that it happens on bare metal as well. This version survived
an overnight run of trinity running under kvm-tools here but verification
from Sasha would be appreciated.

Changelog since V1
o ppc64 paranoia checks and clarifications  (aneesh)
o Fix trinity regression (hopefully)
o Reduce unnecessary TLB flushes(mel)

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking it
would mark PTEs as not present and use another bit to distinguish between
NUMA hinting faults and other types of faults. It was universally loved
by everybody and caused no problems whatsoever. That last sentence might
be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them relatively
minor contributions. At their suggestion, acked-bys are in there but I've
no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for
that. I tested trinity under kvm-tool and passed and ran a few other basic
tests. In most cases I'm leaving out detail as it's not that interesting.

specjbb single JVM: There was negligible performance difference in the
benchmark itself for short and long runs. However, system activity
is higher and interrupts are much higher over time -- possibly
TLB flushes. Migrations are also higher. Overall, this is more
overhead but considering the problems faced with the old approach
I think we just have to suck it up and find another way of reducing
the overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmarm
but like the single JVM case, the system overhead is noticably
higher.  Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.

autonumabench
 3.18.0-rc43.18.0-rc4
vanilla protnone-v2r5
UserNUMA01   32806.01 (  0.00%)20250.67 ( 38.27%)
UserNUMA01_THEADLOCAL23910.28 (  0.00%)22734.37 (  4.92%)
UserNUMA023176.85 (  0.00%) 3082.68 (  2.96%)
UserNUMA02_SMT1600.06 (  0.00%) 1547.08 (  3.31%)
System  NUMA01 719.07 (  0.00%) 1344.39 (-86.96%)
System  NUMA01_THEADLOCAL  916.26 (  0.00%)  180.90 ( 80.26%)
System  NUMA02  20.92 (  0.00%)   17.34 ( 17.11%)
System  NUMA02_SMT   8.76 (  0.00%)7.24 ( 17.35%)
Elapsed NUMA01 728.27 (  0.00%)  519.28 ( 28.70%)
Elapsed NUMA01_THEADLOCAL  589.15 (  0.00%)  554.73 (  5.84%)
Elapsed NUMA02  81.20 (  0.00%)   81.72 ( -0.64%)
Elapsed NUMA02_SMT  80.49 (  0.00%)   79.58 (  1.13%)
CPU NUMA014603.00 (  0.00%) 4158.00 (  9.67%)
CPU NUMA01_THEADLOCAL 4213.00 (  0.00%) 4130.00 (  1.97%)
CPU NUMA023937.00 (  0.00%) 3793.00 (  3.66%)
CPU NUMA02_SMT1998.00 (  0.00%) 1952.00 (  2.30%)


System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage
is great.  Overall time to complete the benchmark is comparable

  3.18.0-rc4  3.18.0-rc4
 vanillaprotnone-v2r5
User61493.3847615.01
System   1665.17 1550.07
Elapsed  1480.79 1236.74

NUMA alloc hit 4739774 5328362
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local   4664980 5328351
NUMA base PTE updates556489407   444119981
NUMA huge PMD updates  1086000  866680
NUMA page range updates 1112521407   887860141
NUMA hint faults   1538964 1242142
NUMA hint local faults  835871  814313
NUMA hint local percent 54  65
NUMA pages migrated732921259883854

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be simply
down to bad luck but any modifications that affect that workload would be
related to scan rates and migrations, not the protection mechanism. For
all other workloads, migration activity was comparable.

Overall, headline performance figures are co

[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

2014-11-20 Thread Mel Gorman
This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 11 +++
 arch/x86/include/asm/pgtable.h | 16 
 include/asm-generic/pgtable.h  | 19 +++
 3 files changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 316f9a5..452c3b4 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -39,6 +39,17 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
+static inline int pte_protnone_numa(pte_t pte)
+{
+   return (pte_val(pte) &
+   (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+   return pte_protnone_numa(pmd_pte(pmd));
+}
+
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_NUMA_MASK;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a07..613cd00 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -497,6 +497,22 @@ static inline int pmd_present(pmd_t pmd)
 _PAGE_NUMA);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone_numa(pte_t pte)
+{
+   return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+   return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline int pmd_none(pmd_t pmd)
 {
/* Only check low word on 32-bit platforms, since it might be
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 752e30d..7e74122 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -677,6 +677,25 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_NUMA_BALANCING
+/*
+ * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
+ * the only case the kernel cares is for NUMA balancing. By default,
+ * implement the helper as "always no". Note that this does not check VMA
+ * protection bits meaning that it is up to the caller to distinguish between
+ * PROT_NONE protections and NUMA hinting fault protections.
+ */
+static inline int pte_protnone_numa(pte_t pte)
+{
+   return 0;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+   return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-11-20 Thread Mel Gorman
Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other
page table modifiers are also altered. This patch layout is to make
review easier.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 arch/powerpc/mm/fault.c |  5 -
 arch/powerpc/mm/gup.c   |  4 ++--
 arch/powerpc/mm/pgtable.c   |  8 +++-
 arch/powerpc/mm/pgtable_64.c|  3 ++-
 arch/x86/mm/gup.c   |  4 ++--
 include/uapi/linux/mempolicy.h  |  2 +-
 mm/gup.c|  8 
 mm/huge_memory.c| 16 +++
 mm/memory.c |  4 ++--
 mm/mprotect.c   | 39 ++---
 mm/pgtable-generic.c|  2 +-
 12 files changed, 40 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54..bbd8499 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  &pte_size);
-   if (pte_present(pte) && !pte_numa(pte)) {
+   if (pte_present(pte) && !pte_protnone_numa(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 08d659a..5007497 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -405,8 +405,6 @@ good_area:
 * processors use the same I/D cache coherency mechanism
 * as embedded.
 */
-   if (error_code & DSISR_PROTFAULT)
-   goto bad_area;
 #endif /* CONFIG_PPC_STD_MMU */
 
/*
@@ -430,9 +428,6 @@ good_area:
flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
-   /* protection fault */
-   if (error_code & 0x0800)
-   goto bad_area;
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
}
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..d870d93 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -39,7 +39,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long 
addr,
/*
 * Similar to the PMD case, NUMA hinting must take slow path
 */
-   if (pte_numa(pte))
+   if (pte_protnone_numa(pte))
return 0;
 
if ((pte_val(pte) & mask) != result)
@@ -85,7 +85,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
 * slowpath for accounting purposes and so that they
 * can be serialised against THP migration.
 */
-   if (pmd_numa(pmd))
+   if (pmd_protnone_numa(pmd))
return 0;
 
if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602..b5d58d3 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -173,7 +173,13 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep,
pte_t pte)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
+   /*
+* When handling numa faults, we already have the pte marked
+* _PAGE_PRESENT, but we can be sure that it is not in hpte.
+* Hence we can use set_pte_at for them.
+*/
+   WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
 #endif
/* Note: mm->context.id might not yet have been assigned as
 * this context might not have been activated yet when this
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8d709a..c721c5e 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -710,7 +710,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
+   WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
assert_spin_locked(&mm->page_table_lock);
WARN_ON(!pmd_tra

[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

2014-11-20 Thread Mel Gorman
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Signed-off-by: Mel Gorman 
Acked-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/copro_fault.c |  8 ++--
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..0450d68 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
-   goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(dsisr & DSISR_PROTFAULT);
}
 
ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5007497..9d6e0b3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
@@ -430,6 +427,7 @@ good_area:
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
}
 
/*
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations

2014-11-20 Thread Mel Gorman
With PROT_NONE, the traditional page table manipulation functions are
sufficient.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
---
 include/linux/huge_mm.h |  3 +--
 mm/huge_memory.c| 33 +++--
 mm/memory.c | 10 ++
 mm/mempolicy.c  |  2 +-
 mm/migrate.c|  2 +-
 mm/mprotect.c   |  2 +-
 mm/pgtable-generic.c|  2 --
 7 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051b..554bbe3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot,
-   int prot_numa);
+   unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec0b424..668f1a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1367,9 +1367,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
-   pmd = pmd_mknonnuma(pmd);
+   pmd = pmd_modify(pmd, vma->vm_page_prot);
set_pmd_at(mm, haddr, pmdp, pmd);
-   VM_BUG_ON(pmd_protnone_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
 out_unlock:
@@ -1503,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot, int prot_numa)
+   unsigned long addr, pgprot_t newprot)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
ret = 1;
-   if (!prot_numa) {
-   entry = pmdp_get_and_clear(mm, addr, pmd);
-   if (pmd_protnone_numa(entry))
-   entry = pmd_mknonnuma(entry);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
-   } else {
-   struct page *page = pmd_page(*pmd);
-
-   /*
-* Do not trap faults against the zero page. The
-* read-only data is likely to be read-cached on the
-* local CPU cache and it is less useful to know about
-* local vs remote hits on the zero page.
-*/
-   if (!is_huge_zero_page(page) &&
-   !pmd_protnone_numa(*pmd)) {
-   pmdp_set_numa(mm, addr, pmd);
-   ret = HPAGE_PMD_NR;
-   }
-   }
+   entry = pmdp_get_and_clear(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
spin_unlock(ptl);
}
 
diff --git a/mm/memory.c b/mm/memory.c
index 96ceb0a..900127b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3120,9 +3120,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*
-   * ptep_modify_prot_start is not called as this is clearing
-   * the _PAGE_NUMA bit and it is not really expected that there
-   * would be concurrent hardware modifications to the PTE.
+   * We can safely just do a "set_pte_at()", because the old
+   * page table entry is not accessible, so there would be no
+   * concurrent hardware modifications to the PTE.
*/
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3131,7 +3131,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
}
 
-   pte = pte_mknonnuma(pte);
+   /* Make it present again */
+   pte = pte_modify(pte, vma->vm_page_prot);
+   pte = pte_mkyoung(pte);
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e58725a..9d61dce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy

[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page

2014-11-20 Thread Mel Gorman
Faults on the huge zero page are pointless and there is a BUG_ON
to catch them during fault time. This patch reintroduces a check
that avoids marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman 
---
 include/linux/huge_mm.h |  3 ++-
 mm/huge_memory.c| 13 -
 mm/memory.c |  1 -
 mm/mprotect.c   | 15 ++-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 554bbe3..ad9051b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot);
+   unsigned long addr, pgprot_t newprot,
+   int prot_numa);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 668f1a3..3013eb8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1502,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot)
+   unsigned long addr, pgprot_t newprot, int prot_numa)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+
+   /*
+* Avoid trapping faults against the zero page. The read-only
+* data is likely to be read-cached on the local CPU and
+* local/remote hits to the zero page are not interesting.
+*/
+   if (prot_numa && is_huge_zero_pmd(*pmd)) {
+   spin_unlock(ptl);
+   return 0;
+   }
+
ret = 1;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
diff --git a/mm/memory.c b/mm/memory.c
index 900127b..a725c08 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3142,7 +3142,6 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
-   BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
/*
 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/mprotect.c b/mm/mprotect.c
index dc65c0f..33dfafb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
+
+   /*
+* Avoid trapping faults against the zero or KSM
+* pages. See similar comment in change_huge_pmd.
+*/
+   if (prot_numa) {
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, oldpte);
+   if (!page || PageKsm(page))
+   continue;
+   }
+
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
 
@@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
-   newprot);
+   newprot, prot_numa);
 
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers

2014-11-20 Thread Mel Gorman
This patch removes the NUMA PTE bits and associated helpers. As a side-effect
it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in the
past due to an different migration mechanism and a dependance on page lock
to serialise migrations and warrants a closer look.

task_work hinting updateparallel fault
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
pmd_none
  do_huge_pmd_anonymous_page
  read? pmd_lock blocks until 
hinting complete, fail !pmd_none test
  write? 
__do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
  pmd_modify
  set_pmd_at

task_work hinting updateparallel migration
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
  do_huge_pmd_numa_page

migrate_misplaced_transhuge_page
pmd_lock waits for updates 
to complete, recheck pmd_same
  pmd_modify
  set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman 
---
 arch/powerpc/include/asm/pgtable.h|  54 +---
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/x86/include/asm/pgtable.h|  22 +
 arch/x86/include/asm/pgtable_64.h |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +
 include/asm-generic/pgtable.h | 155 --
 include/linux/swapops.h   |   2 +-
 8 files changed, 7 insertions(+), 283 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 452c3b4..2e074e7 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -49,64 +49,12 @@ static inline int pmd_protnone_numa(pmd_t pmd)
 {
return pte_protnone_numa(pmd_pte(pmd));
 }
-
-static inline int pte_present(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t pte)
-{
-   return pte_val(pte) & (_PAGE_PRESENT);
-}
-
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-pte_t *ptep)
-{
-   if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
-   return;
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-pmd_t *pmdp)
-{
-   if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
-   return;
-}
-
-/*
- * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
- * which was inherited from x86. For the purposes of powerpc pte_basic_t and
- * pmd_t are equivalent
- */
-#define pteval_t pte_basic_t
-#define pmdval_t pmd_t
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
-   return pmd_val(pmd) & _PAGE_NUMA_MASK;
-}
-
-# else
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_PRESENT;
 }
-#endif /* CONFIG_NUMA_BALANCING */
 
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
diff --git a/arch/powerpc/include/asm/pte-common.h 
b/arch/powerpc/include/asm/pte-common.h
index e040c35..8d1569c 100644
-

[PATCH 08/10] x86: mm: Restore original pte_special check

2014-11-20 Thread Mel Gorman
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had SPECIAL
and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no
longer modifying _PAGE_PRESENT it should be safe to restore the original
pte_special behaviour.

Signed-off-by: Mel Gorman 
---
 arch/x86/include/asm/pgtable.h | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f8799e0..5241332 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -131,13 +131,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-   /*
-* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
-* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
-* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
-*/
-   return (pte_flags(pte) & _PAGE_SPECIAL) &&
-   (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+   return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa

2014-11-20 Thread Mel Gorman
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like overkill
but if this is hit then it's a serious bug that could result in corruption
so do not even try recovering. It would have been more comprehensive to
check VMA flags in pte_protnone_numa but it would have made the API ugly
just for a debugging check.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 3 +++
 mm/memory.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3013eb8..6458229f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1274,6 +1274,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index a725c08..42d652d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3115,6 +3115,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries

2014-11-20 Thread Mel Gorman
If a PTE or PMD is already marked NUMA when scanning to mark entries
for NUMA hinting then it is not necessary to update the entry and
incur a TLB flush penalty. Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 14 --
 mm/mprotect.c|  4 
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6458229f..a7ea9b8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
return 0;
}
 
-   ret = 1;
-   entry = pmdp_get_and_clear(mm, addr, pmd);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
+   if (!prot_numa || !pmd_protnone_numa(*pmd)) {
+   ret = 1;
+   entry = pmdp_get_and_clear(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
+   }
spin_unlock(ptl);
}
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 33dfafb..eb890d0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;
+
+   /* Avoid TLB flush if possible */
+   if (pte_protnone_numa(oldpte))
+   continue;
}
 
ptent = ptep_modify_prot_start(mm, addr, pte);
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-11-20 Thread Mel Gorman
On Thu, Nov 20, 2014 at 10:38:56AM +, David Laight wrote:
> From:  Mel Gorman
> > Convert existing users of pte_numa and friends to the new helper. Note
> > that the kernel is broken after this patch is applied until the other
> > page table modifiers are also altered. This patch layout is to make
> > review easier.
> 
> Doesn't that break bisection?
> 

Yes, for automatic NUMA balancing at least. The patch structure is to
to make reviewers job easier and besides, bisecting within patches 2-6
is pointless. If desired, I can collapse patches 2-6 together for the
final submission.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2

2014-11-21 Thread Mel Gorman
On Thu, Nov 20, 2014 at 04:50:25PM -0500, Sasha Levin wrote:
> On 11/20/2014 05:19 AM, Mel Gorman wrote:
> > V1 failed while running under kvm-tools very quickly and a second report
> > indicated that it happens on bare metal as well. This version survived
> > an overnight run of trinity running under kvm-tools here but verification
> > from Sasha would be appreciated.
> 
> Hi Mel,
> 
> I tried giving it a spin, but it won't apply at all on the latest -mm
> tree:
> 
> $ git am -3 numa/*
> Applying: mm: numa: Do not dereference pmd outside of the lock during NUMA 
> hinting fault
> Applying: mm: Add p[te|md] protnone helpers for use by NUMA balancing
> Applying: mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
> fatal: sha1 information is lacking or useless (mm/huge_memory.c).
> Repository lacks necessary blobs to fall back on 3-way merge.
> Cannot fall back to three-way merge.
> 
> Did I miss a prerequisite?
> 

No. V2 was still against 3.18-rc4 as that was what I had vanilla kernel
test data for. V3 will be against latest mmotm.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

2014-11-21 Thread Mel Gorman
On Thu, Nov 20, 2014 at 11:54:06AM -0800, Linus Torvalds wrote:
> On Thu, Nov 20, 2014 at 2:19 AM, Mel Gorman  wrote:
> > This is a preparatory patch that introduces protnone helpers for automatic
> > NUMA balancing.
> 
> Oh, I hadn't noticed that you had renamed these things. It was
> probably already true in your V1 version.
> 
> I do *not* think that "pte_protnone_numa()" makes sense as a name. It
> only confuses people to think that there is still/again something
> NUMA-special about the PTE. The whole point of the protnone changes
> was to make it really very very clear that from a hardware standpoint,
> this is *exactly* about protnone, and nothing else.
> 
> The fact that we then use protnone PTE's for numa faults is a VM
> internal issue, it should *not* show up in the architecture page table
> helpers.
> 
> I'm not NAK'ing this name, but I really think it's a very important
> part of the whole patch series - to stop the stupid confusion about
> NUMA entries. As far as the page tables are concerned, this has
> absolutely _zero_ to do with NUMA.
> 
> We made that mistake once. We're fixing it. Let the naming *show* that
> it's fixed, and this is "pte_protnone()".
> 
> The places that use this for NUMA handling might have a comment or
> something. But they'll be in the VM where this matters, not in the
> architecture page table description files. The comment would be
> something like "if the vma is accessible, but the PTE is marked
> protnone, this is a autonuma entry".
> 

I feared that people would eventually make the mistake of thinking that
pte_protnone() would return true for PROT_NONE VMAs that do *not* have
the page table bit set. I'll use the old name as you suggest and expand
the comment. It'll be in v3.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault

2014-11-21 Thread Mel Gorman
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h | 4 
 mm/huge_memory.c| 3 ++-
 mm/migrate.c| 6 --
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 01aad3e..a3edcdf 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
-extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
@@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
-static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t 
*pmd)
-{
-}
 static inline int migrate_misplaced_page(struct page *page,
 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..a2cd021 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1283,8 +1283,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * check_same as the page may no longer be mapped.
 */
if (unlikely(pmd_trans_migrating(*pmdp))) {
+   page = pmd_page(*pmdp);
spin_unlock(ptl);
-   wait_migrate_huge_page(vma->anon_vma, pmdp);
+   wait_on_page_locked(page);
goto out;
}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 41945cb..11d86b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd)
return PageLocked(page);
 }
 
-void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
-{
-   struct page *page = pmd_page(*pmd);
-   wait_on_page_locked(page);
-}
-
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v3

2014-11-21 Thread Mel Gorman
The main change here is to rebase on mmotm-20141119 as the series had
significant conflicts that were non-obvious to resolve. The main blockers
for merging are independent testing from Sasha (trinity), independent
testing from Aneesh (ppc64 support) and acks from Ben and Paul on the
powerpc patches.

Changelog since V2
o Rename *_protnone_numa to _protnone and extend docs   (linus)
o Rebase to mmotm-20141119 for pre-merge testing(mel)
o Conver WARN_ON to VM_WARN_ON  (aneesh)

Changelog since V1
o ppc64 paranoia checks and clarifications  (aneesh)
o Fix trinity regression (hopefully)
o Reduce unnecessary TLB flushes(mel)

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking it
would mark PTEs as not present and use another bit to distinguish between
NUMA hinting faults and other types of faults. It was universally loved
by everybody and caused no problems whatsoever. That last sentence might
be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them relatively
minor contributions. At their suggestion, acked-bys are in there but I've
no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for
that. I tested trinity under kvm-tool and passed and ran a few other basic
tests. At the time of writing, only the short-lived tests have completed
but testing of V2 indicated that long-term testing had no surprises. In
most cases I'm leaving out detail as it's not that interesting.

specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher.  Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.

autonumabench
 3.18.0-rc53.18.0-rc5
 mmotm-20141119 protnone-v3r3
UserNUMA01   32380.24 (  0.00%)21642.92 ( 33.16%)
UserNUMA01_THEADLOCAL22481.02 (  0.00%)22283.22 (  0.88%)
UserNUMA023137.00 (  0.00%) 3116.54 (  0.65%)
UserNUMA02_SMT1614.03 (  0.00%) 1543.53 (  4.37%)
System  NUMA01 322.97 (  0.00%) 1465.89 (-353.88%)
System  NUMA01_THEADLOCAL   91.87 (  0.00%)   49.32 ( 46.32%)
System  NUMA02  37.83 (  0.00%)   14.61 ( 61.38%)
System  NUMA02_SMT   7.36 (  0.00%)7.45 ( -1.22%)
Elapsed NUMA01 716.63 (  0.00%)  599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL  553.98 (  0.00%)  539.94 (  2.53%)
Elapsed NUMA02  83.85 (  0.00%)   83.04 (  0.97%)
Elapsed NUMA02_SMT  86.57 (  0.00%)   79.15 (  8.57%)
CPU NUMA014563.00 (  0.00%) 3855.00 ( 15.52%)
CPU NUMA01_THEADLOCAL 4074.00 (  0.00%) 4136.00 ( -1.52%)
CPU NUMA023785.00 (  0.00%) 3770.00 (  0.40%)
CPU NUMA02_SMT1872.00 (  0.00%) 1959.00 ( -4.65%)

System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage
is great.  Overall time to complete the benchmark is comparable

  3.18.0-rc5  3.18.0-rc5
mmotm-20141119protnone-v3r3
User59612.5048586.44
System460.22 1537.45
Elapsed  1442.20 1304.29

NUMA alloc hit 5075182 5743353
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local   5075174 5743339
NUMA base PTE updates637061448   443106883
NUMA huge PMD updates  1243434  864747
NUMA page range updates 1273699656   885857347
NUMA hint faults   1658116 1214277
NUMA hint local faults  959487  754113
NUMA hint local percent 57  62
NUMA pages migrated546705661676398

The NUMA pages migrated look terrible but when I looked at 

[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

2014-11-21 Thread Mel Gorman
This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 15 +++
 arch/x86/include/asm/pgtable.h | 16 
 include/asm-generic/pgtable.h  | 20 
 3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index a8805fe..490bd6d 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -39,6 +39,21 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return (pte_val(pte) &
+   (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return pte_protnone(pmd_pte(pmd));
+}
+
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_NUMA_MASK;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 081d6f4..2e25780 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -502,6 +502,22 @@ static inline int pmd_present(pmd_t pmd)
 _PAGE_NUMA);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline int pmd_none(pmd_t pmd)
 {
/* Only check low word on 32-bit platforms, since it might be
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 177d597..d497d08 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -688,6 +688,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_NUMA_BALANCING
+/*
+ * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
+ * the only case the kernel cares is for NUMA balancing and is only ever set
+ * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
+ * _PAGE_PROTNONE so by by default, implement the helper as "always no". It
+ * is the responsibility of the caller to distinguish between PROT_NONE
+ * protections and NUMA hinting fault protections.
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return 0;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-11-21 Thread Mel Gorman
Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other
page table modifiers are also altered. This patch layout is to make
review easier.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 arch/powerpc/mm/fault.c |  5 -
 arch/powerpc/mm/pgtable.c   | 11 ---
 arch/powerpc/mm/pgtable_64.c|  3 ++-
 arch/x86/mm/gup.c   |  4 ++--
 include/uapi/linux/mempolicy.h  |  2 +-
 mm/gup.c| 10 +-
 mm/huge_memory.c| 16 +++
 mm/memory.c |  4 ++--
 mm/mprotect.c   | 39 ++---
 mm/pgtable-generic.c|  2 +-
 11 files changed, 40 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54..3e6ad3f 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  &pte_size);
-   if (pte_present(pte) && !pte_numa(pte)) {
+   if (pte_present(pte) && !pte_protnone(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb79907..b434153 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -398,8 +398,6 @@ good_area:
 * processors use the same I/D cache coherency mechanism
 * as embedded.
 */
-   if (error_code & DSISR_PROTFAULT)
-   goto bad_area;
 #endif /* CONFIG_PPC_STD_MMU */
 
/*
@@ -423,9 +421,6 @@ good_area:
flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
-   /* protection fault */
-   if (error_code & 0x0800)
-   goto bad_area;
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
}
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602..83dfcb5 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct 
vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte)
 {
-#ifdef CONFIG_DEBUG_VM
-   WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
-#endif
+   /*
+* When handling numa faults, we already have the pte marked
+* _PAGE_PRESENT, but we can be sure that it is not in hpte.
+* Hence we can use set_pte_at for them.
+*/
+   VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
+
/* Note: mm->context.id might not yet have been assigned as
 * this context might not have been activated yet when this
 * is called.
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 87ff0c1..435ebf7 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
+   WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
assert_spin_locked(&mm->page_table_lock);
WARN_ON(!pmd_trans_huge(pmd));
 #endif
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..f32e12c 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long 
addr,
struct page *page;
 
/* Similar to the PMD case, NUMA hinting must take slow path */
-   if (pte_numa(pte)) {
+   if (pte_protnone(pte)) {
pte_unmap(ptep);
return 0;
}
@@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
 * slowpath for accounting purposes and so that they
 * can be serialised against THP migration.
 */
-   if (pmd_numa(pmd))
+   if (pmd_protnone(pmd))
return 0;
  

[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

2014-11-21 Thread Mel Gorman
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Signed-off-by: Mel Gorman 
Acked-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/copro_fault.c |  8 ++--
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..0450d68 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
-   goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(dsisr & DSISR_PROTFAULT);
}
 
ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index b434153..1bcd378 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -389,17 +389,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -414,6 +403,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
@@ -423,6 +420,7 @@ good_area:
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
}
 
/*
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations

2014-11-21 Thread Mel Gorman
With PROT_NONE, the traditional page table manipulation functions are
sufficient.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
---
 include/linux/huge_mm.h |  3 +--
 mm/huge_memory.c| 33 +++--
 mm/memory.c | 10 ++
 mm/mempolicy.c  |  2 +-
 mm/migrate.c|  2 +-
 mm/mprotect.c   |  2 +-
 mm/pgtable-generic.c|  2 --
 7 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051b..554bbe3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot,
-   int prot_numa);
+   unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f81fddf..5618e22 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
-   pmd = pmd_mknonnuma(pmd);
+   pmd = pmd_modify(pmd, vma->vm_page_prot);
set_pmd_at(mm, haddr, pmdp, pmd);
-   VM_BUG_ON(pmd_protnone(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
 out_unlock:
@@ -1503,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot, int prot_numa)
+   unsigned long addr, pgprot_t newprot)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
ret = 1;
-   if (!prot_numa) {
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
-   if (pmd_protnone(entry))
-   entry = pmd_mknonnuma(entry);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
-   } else {
-   struct page *page = pmd_page(*pmd);
-
-   /*
-* Do not trap faults against the zero page. The
-* read-only data is likely to be read-cached on the
-* local CPU cache and it is less useful to know about
-* local vs remote hits on the zero page.
-*/
-   if (!is_huge_zero_page(page) &&
-   !pmd_protnone(*pmd)) {
-   pmdp_set_numa(mm, addr, pmd);
-   ret = HPAGE_PMD_NR;
-   }
-   }
+   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
spin_unlock(ptl);
}
 
diff --git a/mm/memory.c b/mm/memory.c
index eaa46f1..2100e0f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3114,9 +3114,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*
-   * ptep_modify_prot_start is not called as this is clearing
-   * the _PAGE_NUMA bit and it is not really expected that there
-   * would be concurrent hardware modifications to the PTE.
+   * We can safely just do a "set_pte_at()", because the old
+   * page table entry is not accessible, so there would be no
+   * concurrent hardware modifications to the PTE.
*/
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3125,7 +3125,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
}
 
-   pte = pte_mknonnuma(pte);
+   /* Make it present again */
+   pte = pte_modify(pte, vma->vm_page_prot);
+   pte = pte_mkyoung(pte);
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f22c559..530a0da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy

[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers

2014-11-21 Thread Mel Gorman
This patch removes the NUMA PTE bits and associated helpers. As a side-effect
it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in the
past due to an different migration mechanism and a dependance on page lock
to serialise migrations and warrants a closer look.

task_work hinting updateparallel fault
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
pmd_none
  do_huge_pmd_anonymous_page
  read? pmd_lock blocks until 
hinting complete, fail !pmd_none test
  write? 
__do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
  pmd_modify
  set_pmd_at

task_work hinting updateparallel migration
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
  do_huge_pmd_numa_page

migrate_misplaced_transhuge_page
pmd_lock waits for updates 
to complete, recheck pmd_same
  pmd_modify
  set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman 
---
 arch/powerpc/include/asm/pgtable.h|  54 +---
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/x86/include/asm/pgtable.h|  22 +
 arch/x86/include/asm/pgtable_64.h |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +
 include/asm-generic/pgtable.h | 155 --
 include/linux/swapops.h   |   2 +-
 8 files changed, 7 insertions(+), 283 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 490bd6d..7e4cb3d 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -53,64 +53,12 @@ static inline int pmd_protnone(pmd_t pmd)
 {
return pte_protnone(pmd_pte(pmd));
 }
-
-static inline int pte_present(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t pte)
-{
-   return pte_val(pte) & (_PAGE_PRESENT);
-}
-
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-pte_t *ptep)
-{
-   if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
-   return;
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-pmd_t *pmdp)
-{
-   if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
-   return;
-}
-
-/*
- * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
- * which was inherited from x86. For the purposes of powerpc pte_basic_t and
- * pmd_t are equivalent
- */
-#define pteval_t pte_basic_t
-#define pmdval_t pmd_t
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
-   return pmd_val(pmd) & _PAGE_NUMA_MASK;
-}
-
-# else
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_PRESENT;
 }
-#endif /* CONFIG_NUMA_BALANCING */
 
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
diff --git a/arch/powerpc/include/asm/pte-common.h 
b/arch/powerpc/include/asm/pte-common.h
index e040c35..8d1569c 100644
--- a/arch

[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page

2014-11-21 Thread Mel Gorman
Faults on the huge zero page are pointless and there is a BUG_ON
to catch them during fault time. This patch reintroduces a check
that avoids marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman 
---
 include/linux/huge_mm.h |  3 ++-
 mm/huge_memory.c| 13 -
 mm/memory.c |  1 -
 mm/mprotect.c   | 15 ++-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 554bbe3..ad9051b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot);
+   unsigned long addr, pgprot_t newprot,
+   int prot_numa);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5618e22..ad2a3ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1502,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot)
+   unsigned long addr, pgprot_t newprot, int prot_numa)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+
+   /*
+* Avoid trapping faults against the zero page. The read-only
+* data is likely to be read-cached on the local CPU and
+* local/remote hits to the zero page are not interesting.
+*/
+   if (prot_numa && is_huge_zero_pmd(*pmd)) {
+   spin_unlock(ptl);
+   return 0;
+   }
+
ret = 1;
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
diff --git a/mm/memory.c b/mm/memory.c
index 2100e0f..2ec07a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3136,7 +3136,6 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
-   BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
/*
 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/mprotect.c b/mm/mprotect.c
index dc65c0f..33dfafb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
+
+   /*
+* Avoid trapping faults against the zero or KSM
+* pages. See similar comment in change_huge_pmd.
+*/
+   if (prot_numa) {
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, oldpte);
+   if (!page || PageKsm(page))
+   continue;
+   }
+
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
 
@@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
-   newprot);
+   newprot, prot_numa);
 
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 08/10] x86: mm: Restore original pte_special check

2014-11-21 Thread Mel Gorman
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had SPECIAL
and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no
longer modifying _PAGE_PRESENT it should be safe to restore the original
pte_special behaviour.

Signed-off-by: Mel Gorman 
---
 arch/x86/include/asm/pgtable.h | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cf428a7..0dd5be3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -136,13 +136,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-   /*
-* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
-* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
-* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
-*/
-   return (pte_flags(pte) & _PAGE_SPECIAL) &&
-   (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+   return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa

2014-11-21 Thread Mel Gorman
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like overkill
but if this is hit then it's a serious bug that could result in corruption
so do not even try recovering. It would have been more comprehensive to
check VMA flags in pte_protnone_numa but it would have made the API ugly
just for a debugging check.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 3 +++
 mm/memory.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ad2a3ee..8546654 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1273,6 +1273,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index 2ec07a9..7d97af5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3109,6 +3109,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries

2014-11-21 Thread Mel Gorman
If a PTE or PMD is already marked NUMA when scanning to mark entries
for NUMA hinting then it is not necessary to update the entry and
incur a TLB flush penalty. Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 14 --
 mm/mprotect.c|  4 
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8546654..f2bf521 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
return 0;
}
 
-   ret = 1;
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
+   if (!prot_numa || !pmd_protnone(*pmd)) {
+   ret = 1;
+   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
+   }
spin_unlock(ptl);
}
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 33dfafb..109e7aa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;
+
+   /* Avoid TLB flush if possible */
+   if (pte_protnone(oldpte))
+   continue;
}
 
ptent = ptep_modify_prot_start(mm, addr, pte);
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

2014-12-02 Thread Mel Gorman
On Tue, Dec 02, 2014 at 09:38:39AM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-11-21 at 13:57 +0000, Mel Gorman wrote:
> 
> >  #ifdef CONFIG_NUMA_BALANCING
> > +/*
> > + * These work without NUMA balancing but the kernel does not care. See the
> > + * comment in include/asm-generic/pgtable.h
> > + */
> > +static inline int pte_protnone(pte_t pte)
> > +{
> > +   return (pte_val(pte) &
> > +   (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
> > +}
> 
> I would add a comment clarifying that this only works for user pages,
> ie, this accessor will always return "true" for a kernel page on ppc.
> 

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 490bd6d..7b889a3 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -41,7 +41,8 @@ static inline pgprot_t pte_pgprot(pte_t pte)  { return 
__pgprot(pte_val(pte) & PA
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * These work without NUMA balancing but the kernel does not care. See the
- * comment in include/asm-generic/pgtable.h
+ * comment in include/asm-generic/pgtable.h . On powerpc, this will only
+ * work for user pages and always return true for kernel pages.
  */
 static inline int pte_protnone(pte_t pte)
 {

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-12-03 Thread Mel Gorman
On Wed, Dec 03, 2014 at 08:53:37PM +0530, Aneesh Kumar K.V wrote:
> Benjamin Herrenschmidt  writes:
> 
> > On Tue, 2014-12-02 at 12:57 +0530, Aneesh Kumar K.V wrote:
> >> Now, hash_preload can possibly insert an hpte in hash page table even if
> >> the access is not allowed by the pte permissions. But i guess even that
> >> is ok. because we will fault again, end-up calling hash_page_mm where we
> >> handle that part correctly.
> >
> > I think we need a test case...
> >
> 
> I ran the subpageprot test that Paul had written. I modified it to ran
> with selftest. 
> 

It's implied but can I assume it passed? If so, Ben and Paul, can I
consider the series to be acked by you other than the minor comment
updates?

Thanks.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-12-04 Thread Mel Gorman
On Thu, Dec 04, 2014 at 08:01:57AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2014-12-03 at 15:52 +0000, Mel Gorman wrote:
> > 
> > It's implied but can I assume it passed? If so, Ben and Paul, can I
> > consider the series to be acked by you other than the minor comment
> > updates?
> 
> Yes. Assuming it passed :-)
> 
> Acked-by: Benjamin Herrenschmidt 
> 

Sweet, thanks.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-12-04 Thread Mel Gorman
On Wed, Dec 03, 2014 at 10:50:35PM +0530, Aneesh Kumar K.V wrote:
> Mel Gorman  writes:
> 
> > On Wed, Dec 03, 2014 at 08:53:37PM +0530, Aneesh Kumar K.V wrote:
> >> Benjamin Herrenschmidt  writes:
> >> 
> >> > On Tue, 2014-12-02 at 12:57 +0530, Aneesh Kumar K.V wrote:
> >> >> Now, hash_preload can possibly insert an hpte in hash page table even if
> >> >> the access is not allowed by the pte permissions. But i guess even that
> >> >> is ok. because we will fault again, end-up calling hash_page_mm where we
> >> >> handle that part correctly.
> >> >
> >> > I think we need a test case...
> >> >
> >> 
> >> I ran the subpageprot test that Paul had written. I modified it to ran
> >> with selftest. 
> >> 
> >
> > It's implied but can I assume it passed? 
> 
> Yes.
> 
> -bash-4.2# ./subpage_prot 
> test: subpage_prot
> tags: git_version:v3.17-rc3-13511-g0cd3756
> allocated malloc block of 0x400 bytes at 0x0x3fffb0d1
> testing malloc block...
> OK
> success: subpage_prot
> -bash-4.2# 
> 

Thanks for adding that and double checking. I won't pick up the patch as
part of this series because it's not directly related but I would strongly
suggest sending the patch separately.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v4

2014-12-04 Thread Mel Gorman
There are no functional changes here and I kept the mmotm-20141119 baseline
as that is what got tested but it rebases cleanly to current mmotm. The
series makes architectural changes but splitting this on a per-arch basis
would cause bisect-related brain damage. I'm hoping this can go through
Andrew without conflict. It's been tested by myself (standard tests),
Aneesh (ppc64) and Sasha (trinity) so there is some degree of confidence
that it's ok.

Changelog since V3
o Minor comment update  (benh)
o Add ack'ed bys

Changelog since V2
o Rename *_protnone_numa to _protnone and extend docs   (linus)
o Rebase to mmotm-20141119 for pre-merge testing(mel)
o Conver WARN_ON to VM_WARN_ON  (aneesh)

Changelog since V1
o ppc64 paranoia checks and clarifications  (aneesh)
o Fix trinity regression (hopefully)
o Reduce unnecessary TLB flushes(mel)

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking it
would mark PTEs as not present and use another bit to distinguish between
NUMA hinting faults and other types of faults. It was universally loved
by everybody and caused no problems whatsoever. That last sentence might
be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them relatively
minor contributions. At their suggestion, acked-bys are in there but I've
no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for
that. I tested trinity under kvm-tool and passed and ran a few other basic
tests. At the time of writing, only the short-lived tests have completed
but testing of V2 indicated that long-term testing had no surprises. In
most cases I'm leaving out detail as it's not that interesting.

specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher.  Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.

autonumabench
 3.18.0-rc53.18.0-rc5
 mmotm-20141119 protnone-v3r3
UserNUMA01   32380.24 (  0.00%)21642.92 ( 33.16%)
UserNUMA01_THEADLOCAL22481.02 (  0.00%)22283.22 (  0.88%)
UserNUMA023137.00 (  0.00%) 3116.54 (  0.65%)
UserNUMA02_SMT1614.03 (  0.00%) 1543.53 (  4.37%)
System  NUMA01 322.97 (  0.00%) 1465.89 (-353.88%)
System  NUMA01_THEADLOCAL   91.87 (  0.00%)   49.32 ( 46.32%)
System  NUMA02  37.83 (  0.00%)   14.61 ( 61.38%)
System  NUMA02_SMT   7.36 (  0.00%)7.45 ( -1.22%)
Elapsed NUMA01 716.63 (  0.00%)  599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL  553.98 (  0.00%)  539.94 (  2.53%)
Elapsed NUMA02  83.85 (  0.00%)   83.04 (  0.97%)
Elapsed NUMA02_SMT  86.57 (  0.00%)   79.15 (  8.57%)
CPU NUMA014563.00 (  0.00%) 3855.00 ( 15.52%)
CPU NUMA01_THEADLOCAL 4074.00 (  0.00%) 4136.00 ( -1.52%)
CPU NUMA023785.00 (  0.00%) 3770.00 (  0.40%)
CPU NUMA02_SMT1872.00 (  0.00%) 1959.00 ( -4.65%)

System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage
is great.  Overall time to complete the benchmark is comparable

  3.18.0-rc5  3.18.0-rc5
mmotm-20141119protnone-v3r3
User59612.5048586.44
System460.22 1537.45
Elapsed  1442.20 1304.29

NUMA alloc hit 5075182 5743353
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local   5075174 5743339
NUMA base PTE updates637061448   443106883
NUMA huge PMD updates  1243434  864747
NUMA page range updates 1273699656   885857347
NUMA h

[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault

2014-12-04 Thread Mel Gorman
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h | 4 
 mm/huge_memory.c| 3 ++-
 mm/migrate.c| 6 --
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 01aad3e..a3edcdf 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
-extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
@@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
-static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t 
*pmd)
-{
-}
 static inline int migrate_misplaced_page(struct page *page,
 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..a2cd021 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1283,8 +1283,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * check_same as the page may no longer be mapped.
 */
if (unlikely(pmd_trans_migrating(*pmdp))) {
+   page = pmd_page(*pmdp);
spin_unlock(ptl);
-   wait_migrate_huge_page(vma->anon_vma, pmdp);
+   wait_on_page_locked(page);
goto out;
}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 41945cb..11d86b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd)
return PageLocked(page);
 }
 
-void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
-{
-   struct page *page = pmd_page(*pmd);
-   wait_on_page_locked(page);
-}
-
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

2014-12-04 Thread Mel Gorman
This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar K.V 
Tested-by: Sasha Levin 
---
 arch/powerpc/include/asm/pgtable.h | 16 
 arch/x86/include/asm/pgtable.h | 16 
 include/asm-generic/pgtable.h  | 20 
 3 files changed, 52 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index a8805fe..7b889a3 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -39,6 +39,22 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h . On powerpc, this will only
+ * work for user pages and always return true for kernel pages.
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return (pte_val(pte) &
+   (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return pte_protnone(pmd_pte(pmd));
+}
+
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_NUMA_MASK;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 081d6f4..2e25780 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -502,6 +502,22 @@ static inline int pmd_present(pmd_t pmd)
 _PAGE_NUMA);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline int pmd_none(pmd_t pmd)
 {
/* Only check low word on 32-bit platforms, since it might be
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 177d597..d497d08 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -688,6 +688,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_NUMA_BALANCING
+/*
+ * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
+ * the only case the kernel cares is for NUMA balancing and is only ever set
+ * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
+ * _PAGE_PROTNONE so by by default, implement the helper as "always no". It
+ * is the responsibility of the caller to distinguish between PROT_NONE
+ * protections and NUMA hinting fault protections.
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return 0;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2014-12-04 Thread Mel Gorman
Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other
page table modifiers are also altered. This patch layout is to make
review easier.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
Acked-by: Benjamin Herrenschmidt 
Tested-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 arch/powerpc/mm/fault.c |  5 -
 arch/powerpc/mm/pgtable.c   | 11 ---
 arch/powerpc/mm/pgtable_64.c|  3 ++-
 arch/x86/mm/gup.c   |  4 ++--
 include/uapi/linux/mempolicy.h  |  2 +-
 mm/gup.c| 10 +-
 mm/huge_memory.c| 16 +++
 mm/memory.c |  4 ++--
 mm/mprotect.c   | 39 ++---
 mm/pgtable-generic.c|  2 +-
 11 files changed, 40 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54..3e6ad3f 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  &pte_size);
-   if (pte_present(pte) && !pte_numa(pte)) {
+   if (pte_present(pte) && !pte_protnone(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb79907..b434153 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -398,8 +398,6 @@ good_area:
 * processors use the same I/D cache coherency mechanism
 * as embedded.
 */
-   if (error_code & DSISR_PROTFAULT)
-   goto bad_area;
 #endif /* CONFIG_PPC_STD_MMU */
 
/*
@@ -423,9 +421,6 @@ good_area:
flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
-   /* protection fault */
-   if (error_code & 0x0800)
-   goto bad_area;
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
}
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602..83dfcb5 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct 
vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte)
 {
-#ifdef CONFIG_DEBUG_VM
-   WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
-#endif
+   /*
+* When handling numa faults, we already have the pte marked
+* _PAGE_PRESENT, but we can be sure that it is not in hpte.
+* Hence we can use set_pte_at for them.
+*/
+   VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
+
/* Note: mm->context.id might not yet have been assigned as
 * this context might not have been activated yet when this
 * is called.
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 87ff0c1..435ebf7 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
+   WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
assert_spin_locked(&mm->page_table_lock);
WARN_ON(!pmd_trans_huge(pmd));
 #endif
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..f32e12c 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long 
addr,
struct page *page;
 
/* Similar to the PMD case, NUMA hinting must take slow path */
-   if (pte_numa(pte)) {
+   if (pte_protnone(pte)) {
pte_unmap(ptep);
return 0;
}
@@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
 * slowpath for accounting purposes and so that they
 * can be serialised against THP migration.
 */
-   if (pmd_numa(pmd))
+   if (pmd_protnone(pmd))

[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

2014-12-04 Thread Mel Gorman
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Signed-off-by: Mel Gorman 
Acked-by: Aneesh Kumar K.V 
Tested-by: Sasha Levin 
---
 arch/powerpc/mm/copro_fault.c |  8 ++--
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..0450d68 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
-   goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(dsisr & DSISR_PROTFAULT);
}
 
ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index b434153..1bcd378 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -389,17 +389,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -414,6 +403,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
@@ -423,6 +420,7 @@ good_area:
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
}
 
/*
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations

2014-12-04 Thread Mel Gorman
With PROT_NONE, the traditional page table manipulation functions are
sufficient.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
Tested-by: Sasha Levin 
---
 include/linux/huge_mm.h |  3 +--
 mm/huge_memory.c| 33 +++--
 mm/memory.c | 10 ++
 mm/mempolicy.c  |  2 +-
 mm/migrate.c|  2 +-
 mm/mprotect.c   |  2 +-
 mm/pgtable-generic.c|  2 --
 7 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051b..554bbe3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot,
-   int prot_numa);
+   unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f81fddf..5618e22 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
-   pmd = pmd_mknonnuma(pmd);
+   pmd = pmd_modify(pmd, vma->vm_page_prot);
set_pmd_at(mm, haddr, pmdp, pmd);
-   VM_BUG_ON(pmd_protnone(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
 out_unlock:
@@ -1503,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot, int prot_numa)
+   unsigned long addr, pgprot_t newprot)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
ret = 1;
-   if (!prot_numa) {
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
-   if (pmd_protnone(entry))
-   entry = pmd_mknonnuma(entry);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
-   } else {
-   struct page *page = pmd_page(*pmd);
-
-   /*
-* Do not trap faults against the zero page. The
-* read-only data is likely to be read-cached on the
-* local CPU cache and it is less useful to know about
-* local vs remote hits on the zero page.
-*/
-   if (!is_huge_zero_page(page) &&
-   !pmd_protnone(*pmd)) {
-   pmdp_set_numa(mm, addr, pmd);
-   ret = HPAGE_PMD_NR;
-   }
-   }
+   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
spin_unlock(ptl);
}
 
diff --git a/mm/memory.c b/mm/memory.c
index eaa46f1..2100e0f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3114,9 +3114,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*
-   * ptep_modify_prot_start is not called as this is clearing
-   * the _PAGE_NUMA bit and it is not really expected that there
-   * would be concurrent hardware modifications to the PTE.
+   * We can safely just do a "set_pte_at()", because the old
+   * page table entry is not accessible, so there would be no
+   * concurrent hardware modifications to the PTE.
*/
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3125,7 +3125,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
}
 
-   pte = pte_mknonnuma(pte);
+   /* Make it present again */
+   pte = pte_modify(pte, vma->vm_page_prot);
+   pte = pte_mkyoung(pte);
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f22c559..530a0da 100644
--- a/mm/mempol

[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers

2014-12-04 Thread Mel Gorman
This patch removes the NUMA PTE bits and associated helpers. As a side-effect
it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in the
past due to an different migration mechanism and a dependance on page lock
to serialise migrations and warrants a closer look.

task_work hinting updateparallel fault
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
pmd_none
  do_huge_pmd_anonymous_page
  read? pmd_lock blocks until 
hinting complete, fail !pmd_none test
  write? 
__do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
  pmd_modify
  set_pmd_at

task_work hinting updateparallel migration
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
  do_huge_pmd_numa_page

migrate_misplaced_transhuge_page
pmd_lock waits for updates 
to complete, recheck pmd_same
  pmd_modify
  set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman 
Tested-by: Sasha Levin 
---
 arch/powerpc/include/asm/pgtable.h|  54 +---
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/x86/include/asm/pgtable.h|  22 +
 arch/x86/include/asm/pgtable_64.h |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +
 include/asm-generic/pgtable.h | 155 --
 include/linux/swapops.h   |   2 +-
 8 files changed, 7 insertions(+), 283 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 7b889a3..0a85e33 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -54,64 +54,12 @@ static inline int pmd_protnone(pmd_t pmd)
 {
return pte_protnone(pmd_pte(pmd));
 }
-
-static inline int pte_present(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t pte)
-{
-   return pte_val(pte) & (_PAGE_PRESENT);
-}
-
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-pte_t *ptep)
-{
-   if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
-   return;
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-pmd_t *pmdp)
-{
-   if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
-   return;
-}
-
-/*
- * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
- * which was inherited from x86. For the purposes of powerpc pte_basic_t and
- * pmd_t are equivalent
- */
-#define pteval_t pte_basic_t
-#define pmdval_t pmd_t
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
-   return pmd_val(pmd) & _PAGE_NUMA_MASK;
-}
-
-# else
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_PRESENT;
 }
-#endif /* CONFIG_NUMA_BALANCING */
 
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
diff --git a/arch/powerpc/include/asm/pte-common.h 
b/arch/powerpc/include/asm/pte-common.h
index e040c35..

[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page

2014-12-04 Thread Mel Gorman
Faults on the huge zero page are pointless and there is a BUG_ON
to catch them during fault time. This patch reintroduces a check
that avoids marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman 
---
 include/linux/huge_mm.h |  3 ++-
 mm/huge_memory.c| 13 -
 mm/memory.c |  1 -
 mm/mprotect.c   | 15 ++-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 554bbe3..ad9051b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot);
+   unsigned long addr, pgprot_t newprot,
+   int prot_numa);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5618e22..ad2a3ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1502,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot)
+   unsigned long addr, pgprot_t newprot, int prot_numa)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+
+   /*
+* Avoid trapping faults against the zero page. The read-only
+* data is likely to be read-cached on the local CPU and
+* local/remote hits to the zero page are not interesting.
+*/
+   if (prot_numa && is_huge_zero_pmd(*pmd)) {
+   spin_unlock(ptl);
+   return 0;
+   }
+
ret = 1;
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
diff --git a/mm/memory.c b/mm/memory.c
index 2100e0f..2ec07a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3136,7 +3136,6 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
-   BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
/*
 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/mprotect.c b/mm/mprotect.c
index dc65c0f..33dfafb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
+
+   /*
+* Avoid trapping faults against the zero or KSM
+* pages. See similar comment in change_huge_pmd.
+*/
+   if (prot_numa) {
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, oldpte);
+   if (!page || PageKsm(page))
+   continue;
+   }
+
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
 
@@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
-   newprot);
+   newprot, prot_numa);
 
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 08/10] x86: mm: Restore original pte_special check

2014-12-04 Thread Mel Gorman
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had SPECIAL
and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no
longer modifying _PAGE_PRESENT it should be safe to restore the original
pte_special behaviour.

Signed-off-by: Mel Gorman 
---
 arch/x86/include/asm/pgtable.h | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cf428a7..0dd5be3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -136,13 +136,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-   /*
-* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
-* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
-* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
-*/
-   return (pte_flags(pte) & _PAGE_SPECIAL) &&
-   (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+   return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa

2014-12-04 Thread Mel Gorman
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like overkill
but if this is hit then it's a serious bug that could result in corruption
so do not even try recovering. It would have been more comprehensive to
check VMA flags in pte_protnone_numa but it would have made the API ugly
just for a debugging check.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 3 +++
 mm/memory.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ad2a3ee..8546654 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1273,6 +1273,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index 2ec07a9..7d97af5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3109,6 +3109,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries

2014-12-04 Thread Mel Gorman
If a PTE or PMD is already marked NUMA when scanning to mark entries
for NUMA hinting then it is not necessary to update the entry and
incur a TLB flush penalty. Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 14 --
 mm/mprotect.c|  4 
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8546654..f2bf521 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
return 0;
}
 
-   ret = 1;
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
+   if (!prot_numa || !pmd_protnone(*pmd)) {
+   ret = 1;
+   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
+   }
spin_unlock(ptl);
}
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 33dfafb..109e7aa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;
+
+   /* Avoid TLB flush if possible */
+   if (pte_protnone(oldpte))
+   continue;
}
 
ptent = ptep_modify_prot_start(mm, addr, pte);
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault

2015-01-05 Thread Mel Gorman
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h | 4 
 mm/huge_memory.c| 3 ++-
 mm/migrate.c| 6 --
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index fab9b32..78baed5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -67,7 +67,6 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
-extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
@@ -76,9 +75,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
-static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t 
*pmd)
-{
-}
 static inline int migrate_misplaced_page(struct page *page,
 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..a2cd021 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1283,8 +1283,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * check_same as the page may no longer be mapped.
 */
if (unlikely(pmd_trans_migrating(*pmdp))) {
+   page = pmd_page(*pmdp);
spin_unlock(ptl);
-   wait_migrate_huge_page(vma->anon_vma, pmdp);
+   wait_on_page_locked(page);
goto out;
}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 344cdf6..e6a5ff1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1685,12 +1685,6 @@ bool pmd_trans_migrating(pmd_t pmd)
return PageLocked(page);
 }
 
-void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
-{
-   struct page *page = pmd_page(*pmd);
-   wait_on_page_locked(page);
-}
-
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

2015-01-05 Thread Mel Gorman
Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other
page table modifiers are also altered. This patch layout is to make
review easier.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
Acked-by: Benjamin Herrenschmidt 
Tested-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 arch/powerpc/mm/fault.c |  5 -
 arch/powerpc/mm/pgtable.c   | 11 ---
 arch/powerpc/mm/pgtable_64.c|  3 ++-
 arch/x86/mm/gup.c   |  4 ++--
 include/uapi/linux/mempolicy.h  |  2 +-
 mm/gup.c| 10 +-
 mm/huge_memory.c| 16 +++
 mm/memory.c |  4 ++--
 mm/mprotect.c   | 39 ++---
 mm/pgtable-generic.c|  2 +-
 11 files changed, 40 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 510bdfb..625407e 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -212,7 +212,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
/* Look up the Linux PTE for the backing page */
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size);
-   if (pte_present(pte) && !pte_numa(pte)) {
+   if (pte_present(pte) && !pte_protnone(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb79907..b434153 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -398,8 +398,6 @@ good_area:
 * processors use the same I/D cache coherency mechanism
 * as embedded.
 */
-   if (error_code & DSISR_PROTFAULT)
-   goto bad_area;
 #endif /* CONFIG_PPC_STD_MMU */
 
/*
@@ -423,9 +421,6 @@ good_area:
flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
-   /* protection fault */
-   if (error_code & 0x0800)
-   goto bad_area;
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
}
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602..83dfcb5 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct 
vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte)
 {
-#ifdef CONFIG_DEBUG_VM
-   WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
-#endif
+   /*
+* When handling numa faults, we already have the pte marked
+* _PAGE_PRESENT, but we can be sure that it is not in hpte.
+* Hence we can use set_pte_at for them.
+*/
+   VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
+
/* Note: mm->context.id might not yet have been assigned as
 * this context might not have been activated yet when this
 * is called.
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 4fe5f64..91bb883 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
+   WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) ==
+   (_PAGE_PRESENT | _PAGE_USER));
assert_spin_locked(&mm->page_table_lock);
WARN_ON(!pmd_trans_huge(pmd));
 #endif
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index d754782..3b1d819 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long 
addr,
struct page *page;
 
/* Similar to the PMD case, NUMA hinting must take slow path */
-   if (pte_numa(pte)) {
+   if (pte_protnone(pte)) {
pte_unmap(ptep);
return 0;
}
@@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
 * slowpath for accounting purposes and so that they
 * can be serialised against THP migration.
 */
-   if (pmd_numa(pmd))
+   if (pmd_protnone(pmd))
return 0;
  

[PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

2015-01-05 Thread Mel Gorman
This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar K.V 
Tested-by: Sasha Levin 
---
 arch/powerpc/include/asm/pgtable.h | 16 
 arch/x86/include/asm/pgtable.h | 16 
 include/asm-generic/pgtable.h  | 20 
 3 files changed, 52 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index a8805fe..7b889a3 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -39,6 +39,22 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h . On powerpc, this will only
+ * work for user pages and always return true for kernel pages.
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return (pte_val(pte) &
+   (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return pte_protnone(pmd_pte(pmd));
+}
+
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_NUMA_MASK;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e8a5454..8b92203 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -503,6 +503,22 @@ static inline int pmd_present(pmd_t pmd)
 _PAGE_NUMA);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline int pmd_none(pmd_t pmd)
 {
/* Only check low word on 32-bit platforms, since it might be
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 177d597..d497d08 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -688,6 +688,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_NUMA_BALANCING
+/*
+ * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
+ * the only case the kernel cares is for NUMA balancing and is only ever set
+ * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
+ * _PAGE_PROTNONE so implement the helper as "always no" by default. It is
+ * the responsibility of the caller to distinguish between PROT_NONE
+ * protections and NUMA hinting fault protections.
+ */
+static inline int pte_protnone(pte_t pte)
+{
+   return 0;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+   return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v5

2015-01-05 Thread Mel Gorman
Changelog since V4
o Rebase to 3.19-rc2(mel)

Changelog since V3
o Minor comment update  (benh)
o Add ack'ed bys

Changelog since V2
o Rename *_protnone_numa to _protnone and extend docs   (linus)
o Rebase to mmotm-20141119 for pre-merge testing(mel)
o Conver WARN_ON to VM_WARN_ON  (aneesh)

Changelog since V1
o ppc64 paranoia checks and clarifications  (aneesh)
o Fix trinity regression (hopefully)
o Reduce unnecessary TLB flushes(mel)

Automatic NUMA balancing depends on protecting PTEs to trap a fault and
gather reference locality information. Very broadly speaking it marks PTEs
as not present and uses another bit to distinguish between NUMA hinting
faults and other types of faults. This approach is not universally loved,
ultimately resulted in swap space shrinking and has had a number of
problems with Xen support. This series is very heavily based on patches
from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions
with normal change protections that should be less problematic. This was
tested on a few different workloads that showed automatic NUMA balancing
was still active with mostly comparable results.

specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher.  Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.

autonumabench
  3.19.0-rc23.19.0-rc2
 vanilla protnone-v5r1
Time System-NUMA01  268.99 (  0.00%) 1350.70 (-402.14%)
Time System-NUMA01_THEADLOCAL   110.14 (  0.00%)   50.68 ( 53.99%)
Time System-NUMA02   20.14 (  0.00%)   31.12 (-54.52%)
Time System-NUMA02_SMT7.40 (  0.00%)6.57 ( 11.22%)
Time Elapsed-NUMA01 687.57 (  0.00%)  528.51 ( 23.13%)
Time Elapsed-NUMA01_THEADLOCAL  540.29 (  0.00%)  554.36 ( -2.60%)
Time Elapsed-NUMA02  84.98 (  0.00%)   78.87 (  7.19%)
Time Elapsed-NUMA02_SMT  77.32 (  0.00%)   87.07 (-12.61%)

System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters.
Overall time to complete the benchmark is comparable

  3.19.0-rc2  3.19.0-rc2
 vanillaprotnone-v5r1
User58100.8948351.17
System407.74 1439.22
Elapsed  1411.44 1250.55


NUMA alloc hit 5398081 5536696
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local   5398073 5536668
NUMA base PTE updates62271   442576477
NUMA huge PMD updates  1215268  863690
NUMA page range updates 1244939437   884785757
NUMA hint faults   1696858 1221541
NUMA hint local faults 1046842  791219
NUMA hint local percent 61  64
NUMA pages migrated604443059291698

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be simply
down to bad luck but any modifications that affect that workload would be
related to scan rates and migrations, not the protection mechanism. For
all other workloads, migration activity was comparable.

Overall, headline performance figures are comparable but the overhead
is higher, mostly in interrupts. To some extent, higher overhead from
this approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.

 arch/powerpc/include/asm/pgtable.h|  54 ++--
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   2 +-
 arch/powerpc/mm/copro_fault.c |   8 +-
 arch/powerpc/mm/fault.c   |  

[PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

2015-01-05 Thread Mel Gorman
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Signed-off-by: Mel Gorman 
Acked-by: Aneesh Kumar K.V 
Tested-by: Sasha Levin 
---
 arch/powerpc/mm/copro_fault.c |  8 ++--
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..0450d68 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
-   goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(dsisr & DSISR_PROTFAULT);
}
 
ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index b434153..1bcd378 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -389,17 +389,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -414,6 +403,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
@@ -423,6 +420,7 @@ good_area:
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
}
 
/*
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers

2015-01-05 Thread Mel Gorman
This patch removes the NUMA PTE bits and associated helpers. As a side-effect
it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in the
past due to an different migration mechanism and a dependance on page lock
to serialise migrations and warrants a closer look.

task_work hinting updateparallel fault
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
pmd_none
  do_huge_pmd_anonymous_page
  read? pmd_lock blocks until 
hinting complete, fail !pmd_none test
  write? 
__do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
  pmd_modify
  set_pmd_at

task_work hinting updateparallel migration
--
change_pmd_range
  change_huge_pmd
__pmd_trans_huge_lock
  pmdp_get_and_clear
__handle_mm_fault
  do_huge_pmd_numa_page

migrate_misplaced_transhuge_page
pmd_lock waits for updates 
to complete, recheck pmd_same
  pmd_modify
  set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman 
Tested-by: Sasha Levin 
---
 arch/powerpc/include/asm/pgtable.h|  54 +---
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/x86/include/asm/pgtable.h|  22 +
 arch/x86/include/asm/pgtable_64.h |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +
 include/asm-generic/pgtable.h | 155 --
 include/linux/swapops.h   |   2 +-
 8 files changed, 7 insertions(+), 283 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 7b889a3..0a85e33 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -54,64 +54,12 @@ static inline int pmd_protnone(pmd_t pmd)
 {
return pte_protnone(pmd_pte(pmd));
 }
-
-static inline int pte_present(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t pte)
-{
-   return pte_val(pte) & (_PAGE_PRESENT);
-}
-
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-pte_t *ptep)
-{
-   if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
-   return;
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-pmd_t *pmdp)
-{
-   if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
-   VM_BUG_ON(1);
-
-   pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
-   return;
-}
-
-/*
- * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
- * which was inherited from x86. For the purposes of powerpc pte_basic_t and
- * pmd_t are equivalent
- */
-#define pteval_t pte_basic_t
-#define pmdval_t pmd_t
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
-   return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
-   return pmd_val(pmd) & _PAGE_NUMA_MASK;
-}
-
-# else
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline int pte_present(pte_t pte)
 {
return pte_val(pte) & _PAGE_PRESENT;
 }
-#endif /* CONFIG_NUMA_BALANCING */
 
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
diff --git a/arch/powerpc/include/asm/pte-common.h 
b/arch/powerpc/include/asm/pte-common.h
index e040c35..

[PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations

2015-01-05 Thread Mel Gorman
With PROT_NONE, the traditional page table manipulation functions are
sufficient.

Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
Tested-by: Sasha Levin 
---
 include/linux/huge_mm.h |  3 +--
 mm/huge_memory.c| 33 +++--
 mm/memory.c | 10 ++
 mm/mempolicy.c  |  2 +-
 mm/migrate.c|  2 +-
 mm/mprotect.c   |  2 +-
 mm/pgtable-generic.c|  2 --
 7 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051b..554bbe3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot,
-   int prot_numa);
+   unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f81fddf..5618e22 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
-   pmd = pmd_mknonnuma(pmd);
+   pmd = pmd_modify(pmd, vma->vm_page_prot);
set_pmd_at(mm, haddr, pmdp, pmd);
-   VM_BUG_ON(pmd_protnone(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
 out_unlock:
@@ -1503,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot, int prot_numa)
+   unsigned long addr, pgprot_t newprot)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
ret = 1;
-   if (!prot_numa) {
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
-   if (pmd_protnone(entry))
-   entry = pmd_mknonnuma(entry);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
-   } else {
-   struct page *page = pmd_page(*pmd);
-
-   /*
-* Do not trap faults against the zero page. The
-* read-only data is likely to be read-cached on the
-* local CPU cache and it is less useful to know about
-* local vs remote hits on the zero page.
-*/
-   if (!is_huge_zero_page(page) &&
-   !pmd_protnone(*pmd)) {
-   pmdp_set_numa(mm, addr, pmd);
-   ret = HPAGE_PMD_NR;
-   }
-   }
+   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
spin_unlock(ptl);
}
 
diff --git a/mm/memory.c b/mm/memory.c
index 47aa715..debe3f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3113,9 +3113,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*
-   * ptep_modify_prot_start is not called as this is clearing
-   * the _PAGE_NUMA bit and it is not really expected that there
-   * would be concurrent hardware modifications to the PTE.
+   * We can safely just do a "set_pte_at()", because the old
+   * page table entry is not accessible, so there would be no
+   * concurrent hardware modifications to the PTE.
*/
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3124,7 +3124,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
}
 
-   pte = pte_mknonnuma(pte);
+   /* Make it present again */
+   pte = pte_modify(pte, vma->vm_page_prot);
+   pte = pte_mkyoung(pte);
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e0961b..4fcbf12 100644
--- a/mm/mempol

[PATCH 07/10] mm: numa: Do not trap faults on the huge zero page

2015-01-05 Thread Mel Gorman
Faults on the huge zero page are pointless and there is a BUG_ON
to catch them during fault time. This patch reintroduces a check
that avoids marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman 
---
 include/linux/huge_mm.h |  3 ++-
 mm/huge_memory.c| 13 -
 mm/memory.c |  1 -
 mm/mprotect.c   | 15 ++-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 554bbe3..ad9051b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 unsigned long new_addr, unsigned long old_end,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot);
+   unsigned long addr, pgprot_t newprot,
+   int prot_numa);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5618e22..ad2a3ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1502,7 +1502,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot)
+   unsigned long addr, pgprot_t newprot, int prot_numa)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+
+   /*
+* Avoid trapping faults against the zero page. The read-only
+* data is likely to be read-cached on the local CPU and
+* local/remote hits to the zero page are not interesting.
+*/
+   if (prot_numa && is_huge_zero_pmd(*pmd)) {
+   spin_unlock(ptl);
+   return 0;
+   }
+
ret = 1;
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
diff --git a/mm/memory.c b/mm/memory.c
index debe3f4..3c50046 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3135,7 +3135,6 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
-   BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
/*
 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/mprotect.c b/mm/mprotect.c
index dc65c0f..33dfafb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
+
+   /*
+* Avoid trapping faults against the zero or KSM
+* pages. See similar comment in change_huge_pmd.
+*/
+   if (prot_numa) {
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, oldpte);
+   if (!page || PageKsm(page))
+   continue;
+   }
+
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
 
@@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
-   newprot);
+   newprot, prot_numa);
 
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 08/10] x86: mm: Restore original pte_special check

2015-01-05 Thread Mel Gorman
Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had SPECIAL
and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no
longer modifying _PAGE_PRESENT it should be safe to restore the original
pte_special behaviour.

Signed-off-by: Mel Gorman 
---
 arch/x86/include/asm/pgtable.h | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b9a13e9..4673d6e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -137,13 +137,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-   /*
-* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
-* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
-* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
-*/
-   return (pte_flags(pte) & _PAGE_SPECIAL) &&
-   (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+   return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries

2015-01-05 Thread Mel Gorman
If a PTE or PMD is already marked NUMA when scanning to mark entries
for NUMA hinting then it is not necessary to update the entry and
incur a TLB flush penalty. Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 14 --
 mm/mprotect.c|  4 
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8546654..f2bf521 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
return 0;
}
 
-   ret = 1;
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
-   entry = pmd_modify(entry, newprot);
-   ret = HPAGE_PMD_NR;
-   set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
+   if (!prot_numa || !pmd_protnone(*pmd)) {
+   ret = 1;
+   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   entry = pmd_modify(entry, newprot);
+   ret = HPAGE_PMD_NR;
+   set_pmd_at(mm, addr, pmd, entry);
+   BUG_ON(pmd_write(entry));
+   }
spin_unlock(ptl);
}
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 33dfafb..109e7aa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;
+
+   /* Avoid TLB flush if possible */
+   if (pte_protnone(oldpte))
+   continue;
}
 
ptent = ptep_modify_prot_start(mm, addr, pte);
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa

2015-01-05 Thread Mel Gorman
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like overkill
but if this is hit then it's a serious bug that could result in corruption
so do not even try recovering. It would have been more comprehensive to
check VMA flags in pte_protnone_numa but it would have made the API ugly
just for a debugging check.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 3 +++
 mm/memory.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ad2a3ee..8546654 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1273,6 +1273,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index 3c50046..9df2d09 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3108,6 +3108,9 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
bool migrated = false;
int flags = 0;
 
+   /* A PROT_NONE fault should not end up here */
+   BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

2014-07-10 Thread Mel Gorman
On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
> 
> On 09.07.14 00:59, Stewart Smith wrote:
> >Hi!
> >
> >Thanks for review, much appreciated!
> >
> >Alexander Graf  writes:
> >>On 08.07.14 07:06, Stewart Smith wrote:
> >>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
> >>>   int i, need_vpa_update;
> >>>   int srcu_idx;
> >>>   struct kvm_vcpu *vcpus_to_update[threads_per_core];
> >>>+  phys_addr_t phy_addr, tmp;
> >>Please put the variable declarations into the if () branch so that the
> >>compiler can catch potential leaks :)
> >ack. will fix.
> >
> >>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
> >>>   srcu_idx = srcu_read_lock(&vc->kvm->srcu);
> >>>+  /* If we have a saved list of L2/L3, restore it */
> >>>+  if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
> >>>+  phy_addr = virt_to_phys((void *)vc->mpp_buffer);
> >>>+#if defined(CONFIG_PPC_4K_PAGES)
> >>>+  phy_addr = (phy_addr + 8*4096) & ~(8*4096);
> >>get_free_pages() is automatically aligned to the order, no?
> >That's what Paul reckoned too, and then we've attempted to find anywhere
> >that documents that behaviour. Happen to be able to point to docs/source
> >that say this is part of API?
> 
> Phew - it's probably buried somewhere. I could only find this
> document saying that we always get order-aligned allocations:
> 
> http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
> 
> Mel, do you happen to have any pointer to something that explicitly
> (or even properly implicitly) says that get_free_pages() returns
> order-aligned memory?
> 

I did not read the whole thread so I lack context and will just answer
this part.

There is no guarantee that pages are returned in PFN order for multiple
requests to the page allocator. This is the relevant comment in
rmqueue_bulk

/*
 * Split buddy pages returned by expand() are received here
 * in physical page order. The page is added to the callers and
 * list and the list head then moves forward. From the callers
 * perspective, the linked list is ordered by page number in
 * some conditions. This is useful for IO devices that can
 * merge IO requests if the physical pages are ordered
 * properly.
 */

It will probably be true early in the lifetime of the system but the milage
will vary on systems with a lot of uptime. If you depend on this behaviour
for correctness then you will have a bad day.

High-order page requests to the page allocator are guaranteed to be in physical
order. However, this does not apply to vmalloc() where allocations are
only guaranteed to be virtually contiguous.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

2014-07-10 Thread Mel Gorman
On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote:
> 
> On 10.07.14 15:07, Mel Gorman wrote:
> >On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
> >>On 09.07.14 00:59, Stewart Smith wrote:
> >>>Hi!
> >>>
> >>>Thanks for review, much appreciated!
> >>>
> >>>Alexander Graf  writes:
> >>>>On 08.07.14 07:06, Stewart Smith wrote:
> >>>>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore 
> >>>>>*vc)
> >>>>> int i, need_vpa_update;
> >>>>> int srcu_idx;
> >>>>> struct kvm_vcpu *vcpus_to_update[threads_per_core];
> >>>>>+phys_addr_t phy_addr, tmp;
> >>>>Please put the variable declarations into the if () branch so that the
> >>>>compiler can catch potential leaks :)
> >>>ack. will fix.
> >>>
> >>>>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore 
> >>>>>*vc)
> >>>>> srcu_idx = srcu_read_lock(&vc->kvm->srcu);
> >>>>>+/* If we have a saved list of L2/L3, restore it */
> >>>>>+if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {
> >>>>>+phy_addr = virt_to_phys((void *)vc->mpp_buffer);
> >>>>>+#if defined(CONFIG_PPC_4K_PAGES)
> >>>>>+phy_addr = (phy_addr + 8*4096) & ~(8*4096);
> >>>>get_free_pages() is automatically aligned to the order, no?
> >>>That's what Paul reckoned too, and then we've attempted to find anywhere
> >>>that documents that behaviour. Happen to be able to point to docs/source
> >>>that say this is part of API?
> >>Phew - it's probably buried somewhere. I could only find this
> >>document saying that we always get order-aligned allocations:
> >>
> >>http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
> >>
> >>Mel, do you happen to have any pointer to something that explicitly
> >>(or even properly implicitly) says that get_free_pages() returns
> >>order-aligned memory?
> >>
> >I did not read the whole thread so I lack context and will just answer
> >this part.
> >
> >There is no guarantee that pages are returned in PFN order for multiple
> >requests to the page allocator. This is the relevant comment in
> >rmqueue_bulk
> >
> > /*
> >  * Split buddy pages returned by expand() are received here
> >  * in physical page order. The page is added to the callers 
> > and
> >  * list and the list head then moves forward. From the 
> > callers
> >  * perspective, the linked list is ordered by page number in
> >  * some conditions. This is useful for IO devices that can
> >  * merge IO requests if the physical pages are ordered
> >  * properly.
> >  */
> >
> >It will probably be true early in the lifetime of the system but the milage
> >will vary on systems with a lot of uptime. If you depend on this behaviour
> >for correctness then you will have a bad day.
> >
> >High-order page requests to the page allocator are guaranteed to be in 
> >physical
> >order. However, this does not apply to vmalloc() where allocations are
> >only guaranteed to be virtually contiguous.
> 
> Hrm, ok to be very concrete:
> 
>   Does __get_free_pages(..., 4); on a 4k page size system give me a
> 64k aligned pointer? :)
> 

Yes.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: mm: BUG in unmap_page_range

2014-08-06 Thread Mel Gorman
On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote:
> > -#define pmd_mknonnuma pmd_mknonnuma
> > -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> > +/*
> > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
> > + * which was inherited from x86. For the purposes of powerpc pte_basic_t is
> > + * equivalent
> > + */
> > +#define pteval_t pte_basic_t
> > +#define pmdval_t pmd_t
> > +static inline pteval_t pte_flags(pte_t pte)
> >  {
> > -   return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
> > +   return pte_val(pte) & PAGE_PROT_BITS;
> 
> PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
> to check further to find out why the mask doesn't include
> _PAGE_PRESENT. 
> 

Dumb of me, not sure how I managed that. For the purposes of what is required
it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask
that defines what bits are of interest to the generic helpers which is what
this version attempts to do. It's not tested on powerpc at all unfortunately.

---8<---
mm: Remove misleading ARCH_USES_NUMA_PROT_NONE

ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
_PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
fault scanner. This was found to be conceptually confusing with a lot of
implicit assumptions and it was asked that an alternative be found.

Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap
PTE bits and shrunk the maximum possible swap size but it did not go far
enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
but the relics still exist.

This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
duplication in powerpc vs the generic implementation by defining the types
the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
This necessitated that a PTE bit mask be created that identified the bits
that distinguish present from NUMA pte entries but it is expected this
will only differ between arches based on _PAGE_PROTNONE. The naming for
the generic helpers was taken from x86 originally but ppc64 has types that
are equivalent for the purposes of the helper so they are mapped instead
of duplicating code.

Signed-off-by: Mel Gorman 
---
 arch/powerpc/include/asm/pgtable.h| 57 ---
 arch/powerpc/include/asm/pte-common.h |  5 +++
 arch/x86/Kconfig  |  1 -
 arch/x86/include/asm/pgtable_types.h  |  7 +
 include/asm-generic/pgtable.h | 27 ++---
 init/Kconfig  | 11 ---
 6 files changed, 33 insertions(+), 75 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index d98c1ec..beeb09e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
-
 static inline int pte_present(pte_t pte)
 {
-   return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
+   return pte_val(pte) & _PAGE_NUMA_MASK;
 }
 
 #define pte_present_nonuma pte_present_nonuma
@@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte)
return pte_val(pte) & (_PAGE_PRESENT);
 }
 
-#define pte_numa pte_numa
-static inline int pte_numa(pte_t pte)
-{
-   return (pte_val(pte) &
-   (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-
-#define pte_mknonnuma pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-   pte_val(pte) &= ~_PAGE_NUMA;
-   pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
-   return pte;
-}
-
-#define pte_mknuma pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-   /*
-* We should not set _PAGE_NUMA on non present ptes. Also clear the
-* present bit so that hash_page will return 1 and we collect this
-* as numa fault.
-*/
-   if (pte_present(pte)) {
-   pte_val(pte) |= _PAGE_NUMA;
-   pte_val(pte) &= ~_PAGE_PRESENT;
-   } else
-   VM_BUG_ON(1);
-   return pte;
-}
-
 #define ptep_set_numa ptep_set_numa
 static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep)
@@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, 
unsigned long addr,
return;
 }
 
-#define pmd_numa pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-   return pte_numa(pmd_pte(pmd));
-}
-
 #define pmdp_set_numa pmdp_set_numa

Re: [PATCH -V3] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE

2013-12-06 Thread Mel Gorman
On Fri, Dec 06, 2013 at 12:08:22AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
> On archs like ppc64 that don't use _PAGE_PROTNONE and also have
> a separate page table outside linux pagetable, we just need to
> make sure that when calling change_prot_numa we flush the
> hardware page table entry so that next page access  result in a numa
> fault.
> 
> We still need to make sure we use the numa faulting logic only
> when CONFIG_NUMA_BALANCING is set. This implies the migrate-on-fault
> (Lazy migration) via mbind will only work if CONFIG_NUMA_BALANCING
> is set.
> 
> Signed-off-by: Aneesh Kumar K.V 

You're right on that there is no direct dependance on numa balancing and
use of prot_none. The BUILD_BUG_ON was to flag very clearly that arches
wanting to support automatic NUMA balancing must ensure such things as

o _PAGE_NUMA is defined
o setting _PAGE_NUMA traps a fault and the fault can be uniquely
  identified as being a numa hinting fault
o that pte_present still returns true for pte_numa pages even though the
  underlying present bit may be cleared. Otherwise operations like
  following and copying ptes will get confused
o shortly, arches will also need to avoid taking references on pte_numa
  pages in get_user_pages to account for hinting faults properly

I guess the _PAGE_NUMA parts will already be caught by other checks and
the rest will fall out during testing so it's ok to remove.

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/3] mm: dirty accountable change only apply to non prot numa case

2014-02-11 Thread Mel Gorman
On Tue, Feb 11, 2014 at 04:04:54PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> So move it within the if loop
> 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] mm: Use ptep/pmdp_set_numa for updating _PAGE_NUMA bit

2014-02-11 Thread Mel Gorman
On Tue, Feb 11, 2014 at 04:04:55PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions. ppc64 also 
> doesn't implement
> flush_tlb_range. ppc64 require the tlb flushing to be batched within ptl 
> locks. The reason
> to do that is to ensure that the hash page table is in sync with linux page 
> table.
> We track the hpte index in linux pte and if we clear them without flushing 
> hash and drop the
> ptl lock, we can have another cpu update the pte and can end up with double 
> hash. We also want
> to keep set_pte_at simpler by not requiring them to do hash flush for 
> performance reason.
> Hence cannot use them while updating _PAGE_NUMA bit. Add new functions for 
> marking pte/pmd numa
> 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: mm: Add new set flag argument to pte/pmd update function

2014-02-11 Thread Mel Gorman
On Tue, Feb 11, 2014 at 04:04:53PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> We will use this later to set the _PAGE_NUMA bit.
> 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap

2014-03-03 Thread Mel Gorman
On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote:
> 
> On a powerpc machine with CONFIG_NUMA_BALANCING=y and CONFIG_SPARSEMEM_VMEMMAP
> not enabled,  kernel panics.
> 

This?

---8<---
sched: numa: Do not group tasks if last cpu is not set

On configurations with vmemmap disabled, the following partial is observed

[  299.268623] CPU: 47 PID: 4366 Comm: numa01 Tainted: G  D  
3.14.0-rc5-vanilla #4
[  299.278295] Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 
04/26/2012
[  299.287452] task: 880c670bc110 ti: 880c66db6000 task.ti: 
880c66db6000
[  299.296642] RIP: 0010:[]  [] 
task_numa_fault+0x50f/0x8b0
[  299.306778] RSP: :880c66db7670  EFLAGS: 00010282
[  299.313769] RAX: 33ee RBX: 880c670bc110 RCX: 0001
[  299.322590] RDX: 0001 RSI: 0003 RDI: 
[  299.331394] RBP: 880c66db76c8 R08:  R09: 000166b0
[  299.340203] R10: 880c7ffecd80 R11:  R12: 01ff
[  299.348989] R13: 00ff R14:  R15: 0003
[  299.357763] FS:  7f5a60a3f700() GS:88106f2c() 
knlGS:
[  299.367510] CS:  0010 DS:  ES:  CR0: 80050033
[  299.374913] CR2: 37da CR3: 000868ed4000 CR4: 07e0
[  299.383726] Stack:
[  299.387414]  0003  00010003 
00010003
[  299.396564]  811888f4 880c66db7698 0003 
880c7f9b3ac0
[  299.405730]  880c66ccebd8  0003 
880c66db7718
[  299.414907] Call Trace:
[  299.419095]  [] ? migrate_misplaced_page+0xb4/0x140
[  299.427301]  [] do_numa_page+0x18c/0x1f0
[  299.434554]  [] handle_mm_fault+0x617/0xf70
[  ..]  SNIPPED

The oops occurs in task_numa_group looking up cpu_rq(LAST__CPU_MASK). The
bug exists for all configurations but will manifest differently. On vmemmap
configurations, it looks up garbage and on !vmemmap configuraitons it
will oops. This patch adds the necessary check and also fixes the type
for LAST__PID_MASK and LAST__CPU_MASK which are currently signed instead
of unsigned integers.

Signed-off-by: Mel Gorman 
Cc: sta...@vger.kernel.org

diff --git a/include/linux/page-flags-layout.h 
b/include/linux/page-flags-layout.h
index da52366..6f661d9 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -63,10 +63,10 @@
 
 #ifdef CONFIG_NUMA_BALANCING
 #define LAST__PID_SHIFT 8
-#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+#define LAST__PID_MASK  ((1UL << LAST__PID_SHIFT)-1)
 
 #define LAST__CPU_SHIFT NR_CPUS_BITS
-#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
+#define LAST__CPU_MASK  ((1UL << LAST__CPU_SHIFT)-1)
 
 #define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7815709..b44a8b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1463,6 +1463,9 @@ static void task_numa_group(struct task_struct *p, int 
cpupid, int flags,
int cpu = cpupid_to_cpu(cpupid);
int i;
 
+   if (unlikely(cpu == LAST__CPU_MASK && !cpu_online(cpu)))
+   return;
+
if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
2*nr_node_ids*sizeof(unsigned long);
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Panic on ppc64 with numa_balancing and !sparsemem_vmemmap

2014-03-03 Thread Mel Gorman
On Tue, Mar 04, 2014 at 12:45:19AM +0530, Aneesh Kumar K.V wrote:
> Mel Gorman  writes:
> 
> > On Wed, Feb 19, 2014 at 11:32:00PM +0530, Srikar Dronamraju wrote:
> >> 
> >> On a powerpc machine with CONFIG_NUMA_BALANCING=y and 
> >> CONFIG_SPARSEMEM_VMEMMAP
> >> not enabled,  kernel panics.
> >> 
> >
> > This?
> 
> This one fixed that crash on ppc64
> 
> http://mid.gmane.org/1393578122-6500-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> 

Thanks.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: hugetlb: ensure hugepage access is denied if hugepages are not supported

2014-04-09 Thread Mel Gorman

On Thu, Apr 03, 2014 at 04:14:13PM -0700, Nishanth Aravamudan wrote:
> In KVM guests on Power, in a guest not backed by hugepages, we see the
> following:
> 
> AnonHugePages: 0 kB
> HugePages_Total:   0
> HugePages_Free:0
> HugePages_Rsvd:0
> HugePages_Surp:0
> Hugepagesize: 64 kB
> 
> HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
> are not supported at boot-time, but this is only checked in
> hugetlb_init(). Extract the check to a helper function, and use it in a
> few relevant places.
> 
> This does make hugetlbfs not supported (not registered at all) in this
> environment. I believe this is fine, as there are no valid hugepages and
> that won't change at runtime.
> 
> Signed-off-by: Nishanth Aravamudan 

Acked-by: Mel Gorman 

This patch looks ok but the changelog misses important information from
the original report which is probably why it fell through the cracks.
Add the fact that you encountered a problem during mount to the changelog
and resend it directly to Andrew. This part from your original report;

Currently, I am seeing the following when I `mount -t hugetlbfs
/none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I
think it's related to the fact that hugetlbfs is properly not
correctly setting itself up in this state?:

Unable to handle kernel paging request for data at address 0x0031
Faulting instruction address: 0xc0245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries


It probably slipped through the cracks because from the changelog this
looks like a minor formatting issue and not a functional fix.

> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d19b30a..cc8fcc7 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1017,6 +1017,11 @@ static int __init init_hugetlbfs_fs(void)
>   int error;
>   int i;
>  
> + if (!hugepages_supported()) {
> + printk(KERN_ERR "hugetlbfs: Disabling because there are no 
> supported hugepage sizes\n");
> + return -ENOTSUPP;
> + }
> +
>   error = bdi_init(&hugetlbfs_backing_dev_info);
>   if (error)
>   return error;

KERN_ERR feels like overkill for this type of issue. KERN_INFO?

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 0/2] Automatic NUMA balancing and PROT_NONE handling followup

2015-03-05 Thread Mel Gorman
Dave Chinner reported a problem due to excessive NUMA balancing activity and
bisected it. These are two patches that address two major issues with that
series. The first patch is almost certainly unrelated to what he saw due
to fact his vmstats showed no huge page activity but the fix is important.
The second patch restores performance of one benchmark to similar levels
to 3.19-vanilla but it still has to be tested on his workload. While I
have a test configuration for his workload, I don't have either the KVM
setup or suitable storage to test against. It also needs to be reviewed
and tested on ppc64.

 arch/powerpc/include/asm/pgtable-ppc64.h | 16 
 arch/x86/include/asm/pgtable.h   | 14 ++
 include/asm-generic/pgtable.h| 19 +++
 mm/huge_memory.c | 23 ++-
 mm/mprotect.c|  5 +
 5 files changed, 72 insertions(+), 5 deletions(-)

-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] mm: thp: Return the correct value for change_huge_pmd

2015-03-05 Thread Mel Gorman
The wrong value is being returned by change_huge_pmd since commit
10c1045f28e8 ("mm: numa: avoid unnecessary TLB flushes when setting
NUMA hinting entries") which allows a fallthrough that tries to adjust
non-existent PTEs. This patch corrects it.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8cb5a82..194c0f019774 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1482,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+   ret = 1;
 
/*
 * Avoid trapping faults against the zero page. The read-only
@@ -1490,11 +1491,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 */
if (prot_numa && is_huge_zero_pmd(*pmd)) {
spin_unlock(ptl);
-   return 0;
+   return ret;
}
 
if (!prot_numa || !pmd_protnone(*pmd)) {
-   ret = 1;
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
ret = HPAGE_PMD_NR;
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] mm: numa: Do not clear PTEs or PMDs for NUMA hinting faults

2015-03-05 Thread Mel Gorman
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

   Across the board the 4.0-rc1 numbers are much slower, and the
   degradation is far worse when using the large memory footprint
   configs. Perf points straight at the cause - this is from 4.0-rc1
   on the "-o bhash=101073" config:

   -   56.07%56.07%  [kernel][k] 
default_send_IPI_mask_sequence_phys
  - default_send_IPI_mask_sequence_phys
 - 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
 smp_call_function_many
   - native_flush_tlb_others
  - 99.85% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.73% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
  0.63% native_send_call_func_single_ipi
 generic_exec_single
 smp_call_function_single

This was bisected to commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma
and remaining page table manipulations") which clears PTEs and PMDs to make
them PROT_NONE. This is tidy but tests on some benchmarks indicate that
there are many more hinting faults trapped resulting in excessive migration.
This is the result for the old autonuma benchmark for example.

autonumabench
   4.0.0-rc1 4.0.0-rc1  
  3.19.0
 vanillanoclear-v1  
 vanilla
Time User-NUMA01  32883.59 (  0.00%)27401.21 ( 16.67%)
25695.96 ( 21.86%)
Time User-NUMA01_THEADLOCAL   17453.20 (  0.00%)17491.98 ( -0.22%)
17404.36 (  0.28%)
Time User-NUMA02   2063.70 (  0.00%) 2059.94 (  0.18%) 
2037.65 (  1.26%)
Time User-NUMA02_SMT983.70 (  0.00%)  967.95 (  1.60%)  
981.02 (  0.27%)
Time System-NUMA01  602.44 (  0.00%)  182.16 ( 69.76%)  
194.70 ( 67.68%)
Time System-NUMA01_THEADLOCAL78.10 (  0.00%)   84.84 ( -8.63%)  
 98.52 (-26.15%)
Time System-NUMA026.47 (  0.00%)9.74 (-50.54%)  
  9.28 (-43.43%)
Time System-NUMA02_SMT5.06 (  0.00%)3.97 ( 21.54%)  
  3.79 ( 25.10%)
Time Elapsed-NUMA01 755.96 (  0.00%)  602.20 ( 20.34%)  
558.84 ( 26.08%)
Time Elapsed-NUMA01_THEADLOCAL  382.22 (  0.00%)  384.98 ( -0.72%)  
382.54 ( -0.08%)
Time Elapsed-NUMA02  49.38 (  0.00%)   49.23 (  0.30%)  
 49.83 ( -0.91%)
Time Elapsed-NUMA02_SMT  47.70 (  0.00%)   46.82 (  1.84%)  
 46.59 (  2.33%)
Time CPU-NUMA014429.00 (  0.00%) 4580.00 ( -3.41%) 
4632.00 ( -4.58%)
Time CPU-NUMA01_THEADLOCAL 4586.00 (  0.00%) 4565.00 (  0.46%) 
4575.00 (  0.24%)
Time CPU-NUMA024191.00 (  0.00%) 4203.00 ( -0.29%) 
4107.00 (  2.00%)
Time CPU-NUMA02_SMT2072.00 (  0.00%) 2075.00 ( -0.14%) 
2113.00 ( -1.98%)

Note the system CPU usage with the patch applied and how it's similar to
3.19-vanilla. The NUMA hinting activity is also restored to similar levels.

 4.0.0-rc1   4.0.0-rc1  3.19.0
   vanillanoclear-v1r13 vanilla
NUMA alloc hit 1437560 1241466 1202922
NUMA alloc miss  0   0   0
NUMA interleave hit  0   0   0
NUMA alloc local   1436781 1240849 1200683
NUMA base PTE updates304513172   223926293   222840103
NUMA huge PMD updates   594467  437025  434894
NUMA page range updates  608880276   447683093   445505831
NUMA hint faults733491  598990  601358
NUMA hint local faults  511530  314936  371571
NUMA hint local percent 69  52  61
NUMA pages migrated   26366701 5424102 7073177

Signed-off-by: Mel Gorman 
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 16 
 arch/x86/include/asm/pgtable.h   | 14 ++
 include/asm-generic/pgtable.h| 19 +++
 mm/huge_memory.c | 19 ---
 mm/mprotect.c|  5 +
 5 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 43e6ad424c7f..bb654c192028 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/in

[PATCH 2/4] mm: numa: Remove migrate_ratelimited

2015-03-07 Thread Mel Gorman
This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE
scanning even if migrate rate limited") so remove it.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h |  5 -
 mm/migrate.c| 20 
 2 files changed, 25 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78baed5f2952..cac1c0904d5f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -69,7 +69,6 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node);
-extern bool migrate_ratelimited(int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
@@ -80,10 +79,6 @@ static inline int migrate_misplaced_page(struct page *page,
 {
return -EAGAIN; /* can't migrate now */
 }
-static inline bool migrate_ratelimited(int node)
-{
-   return false;
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..6aa9a4222ea9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1554,30 +1554,10 @@ static struct page *alloc_misplaced_dst_page(struct 
page *page,
  * page migration rate limiting control.
  * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
  * window of time. Default here says do not migrate more than 1280M per second.
- * If a node is rate-limited then PTE NUMA updates are also rate-limited. 
However
- * as it is faults that reset the window, pte updates will happen 
unconditionally
- * if there has not been a fault since @pteupdate_interval_millisecs after the
- * throttle window closed.
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
-static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
 static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
 
-/* Returns true if NUMA migration is currently rate limited */
-bool migrate_ratelimited(int node)
-{
-   pg_data_t *pgdat = NODE_DATA(node);
-
-   if (time_after(jiffies, pgdat->numabalancing_migrate_next_window +
-   msecs_to_jiffies(pteupdate_interval_millisecs)))
-   return false;
-
-   if (pgdat->numabalancing_migrate_nr_pages < ratelimit_pages)
-   return false;
-
-   return true;
-}
-
 /* Returns true if the node is migrate rate-limited after the update */
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
unsigned long nr_pages)
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-07 Thread Mel Gorman
-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)5.69 
( 59.31%)
CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)0.12 
( 77.97%)
CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)0.24 
( 39.93%)
CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)1.24 
( 43.47%)
CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
2.80 ( -4.23%)
Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 1167.96 
(  0.34%)
Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 4039.20 
( -0.13%)
Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  455.42 
( 12.99%)
Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  207.94 
( 61.01%)

The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
is almost cut in half. It's still not as low as 3.19-vanilla but it's
much closer

 4.0.0-rc1   4.0.0-rc1  3.19.0
   vanilla  slowscan-v2 vanilla
NUMA alloc hit   146138883   121929782   104019526
NUMA alloc miss   1314632811456356 7806370
NUMA interleave hit  0   0   0
NUMA alloc local 146060848   121865921   103953085
NUMA base PTE updates242201535   117237258   216624143
NUMA huge PMD updates   113270   52121  127782
NUMA page range updates  300195775   143923210   282048527
NUMA hint faults 18038802587299060   147235021
NUMA hint local faults727845323293925861866265
NUMA hint local percent 40  37  42
NUMA pages migrated   711752624139530223237799

Note the big differences in faults trapped and pages migrated. 3.19-vanilla
still migrated fewer pages but if necessary the threshold at which we
start throttling migrations can be lowered.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h | 9 +
 kernel/sched/fair.c   | 8 ++--
 mm/huge_memory.c  | 3 ++-
 mm/memory.c   | 3 ++-
 4 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..a419b65770d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1625,11 +1625,11 @@ struct task_struct {
 
/*
 * numa_faults_locality tracks if faults recorded during the last
-* scan window were remote/local. The task scan period is adapted
-* based on the locality of the faults with different weights
-* depending on whether they were shared or private faults
+* scan window were remote/local or failed to migrate. The task scan
+* period is adapted based on the locality of the faults with different
+* weights depending on whether they were shared or private faults
 */
-   unsigned long numa_faults_locality[2];
+   unsigned long numa_faults_locality[3];
 
unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1719,6 +1719,7 @@ struct task_struct {
 #define TNF_NO_GROUP   0x02
 #define TNF_SHARED 0x04
 #define TNF_FAULT_LOCAL0x08
+#define TNF_MIGRATE_FAIL 0x10
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..bcfe32088b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct 
*p,
/*
 * If there were no record hinting faults then either the task is
 * completely idle or all activity is areas that are not of interest
-* to automatic numa balancing. Scan slower
+* to automatic numa balancing. Related to that, if there were failed
+* migration then it implies we are migrating too quickly or the local
+* node is overloaded. In either case, scan slower
 */
-   if (local + shared == 0) {
+   if (local + shared == 0 || p->numa_faults_locality[2]) {
p->numa_scan_period = min(p->numa_scan_period_max,
p->numa_scan_period << 1);
 
@@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int 
pages, int flags)
 
if (migrated)
p->numa_pages_migrated += pages;
+   if (flags & TNF_MIGRATE_FAIL)
+   p->numa_faults_locality[2] += pages;
 
p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae13ad31e113..f508fda07d34 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1353,7 +1353,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (migrated) {
  

[PATCH 3/4] mm: numa: Mark huge PTEs young when clearing NUMA hinting faults

2015-03-07 Thread Mel Gorman
Base PTEs are marked young when the NUMA hinting information is cleared
but the same does not happen for huge pages which this patch addresses.
Note that migrated pages are not marked young as the base page migration
code does not assume that migrated pages have been referenced. This could
be addressed but beyond the scope of this series which is aimed at Dave
Chinners shrink workload that is unlikely to be affected by this issue.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 194c0f019774..ae13ad31e113 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1359,6 +1359,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
pmd = pmd_modify(pmd, vma->vm_page_prot);
+   pmd = pmd_mkyoung(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 0/4] Automatic NUMA balancing and PROT_NONE handling followup v2r8

2015-03-07 Thread Mel Gorman
Dave Chinner reported a problem due to excessive NUMA balancing activity
and bisected it. The first patch in this series corrects a major problem
that is unlikely to affect Dave but is still serious. Patch 2 is a minor
cleanup that was spotted while looking at scan rate control. Patch 3 is
minor and unlikely to make a difference but is still an inconsistentcy
between base and THP handling. Patch 4 is the important one, it slows
PTE scan updates if migrations are failing or throttled. Details of the
performance impact on local tests is included in the patch.

 include/linux/migrate.h |  5 -
 include/linux/sched.h   |  9 +
 kernel/sched/fair.c |  8 ++--
 mm/huge_memory.c|  8 +---
 mm/memory.c |  3 ++-
 mm/migrate.c| 20 
 6 files changed, 18 insertions(+), 35 deletions(-)

-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/4] mm: thp: Return the correct value for change_huge_pmd

2015-03-07 Thread Mel Gorman
The wrong value is being returned by change_huge_pmd since commit
10c1045f28e8 ("mm: numa: avoid unnecessary TLB flushes when setting
NUMA hinting entries") which allows a fallthrough that tries to adjust
non-existent PTEs. This patch corrects it.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8cb5a82..194c0f019774 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1482,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+   ret = 1;
 
/*
 * Avoid trapping faults against the zero page. The read-only
@@ -1490,11 +1491,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 */
if (prot_numa && is_huge_zero_pmd(*pmd)) {
spin_unlock(ptl);
-   return 0;
+   return ret;
}
 
if (!prot_numa || !pmd_protnone(*pmd)) {
-   ret = 1;
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
ret = HPAGE_PMD_NR;
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-07 Thread Mel Gorman
On Sat, Mar 07, 2015 at 05:36:58PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman  wrote:
> 
> > Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
> > 
> > Across the board the 4.0-rc1 numbers are much slower, and the 
> > degradation is far worse when using the large memory footprint 
> > configs. Perf points straight at the cause - this is from 4.0-rc1 on 
> > the "-o bhash=101073" config:
> > 
> > [...]
> 
> >4.0.0-rc1   4.0.0-rc1  3.19.0
> >  vanilla  slowscan-v2 vanilla
> > User53384.2956093.1146119.12
> > System692.14  311.64  306.41
> > Elapsed  1236.87 1328.61 1039.88
> > 
> > Note that the system CPU usage is now similar to 3.19-vanilla.
> 
> Similar, but still worse, and also the elapsed time is still much 
> worse. User time is much higher, although it's the same amount of work 
> done on every kernel, right?
> 

Elapsed time is primarily worse on one benchmark -- numa01 which is an
adverse workload. The user time differences are also dominated by that
benchmark

   4.0.0-rc1 4.0.0-rc1  
  3.19.0
 vanilla slowscan-v2r7  
 vanilla
Time User-NUMA01  32883.59 (  0.00%)35288.00 ( -7.31%)
25695.96 ( 21.86%)
Time User-NUMA01_THEADLOCAL   17453.20 (  0.00%)17765.79 ( -1.79%)
17404.36 (  0.28%)
Time User-NUMA02   2063.70 (  0.00%) 2063.22 (  0.02%) 
2037.65 (  1.26%)
Time User-NUMA02_SMT983.70 (  0.00%)  976.01 (  0.78%)  
981.02 (  0.27%)


> > I also tested with a workload very similar to Dave's. The machine 
> > configuration and storage is completely different so it's not an 
> > equivalent test unfortunately. It's reporting the elapsed time and 
> > CPU time while fsmark is running to create the inodes and when 
> > runnig xfsrepair afterwards
> > 
> > xfsrepair
> > 4.0.0-rc1 4.0.0-rc1 
> >3.19.0
> >   vanilla   slowscan-v2 
> >   vanilla
> > Min  real-fsmark1157.41 (  0.00%) 1150.38 (  0.61%) 
> > 1164.44 ( -0.61%)
> > Min  syst-fsmark3998.06 (  0.00%) 3988.42 (  0.24%) 
> > 4016.12 ( -0.45%)
> > Min  real-xfsrepair  497.64 (  0.00%)  456.87 (  8.19%)  
> > 442.64 ( 11.05%)
> > Min  syst-xfsrepair  500.61 (  0.00%)  263.41 ( 47.38%)  
> > 194.97 ( 61.05%)
> > Ameanreal-fsmark1166.63 (  0.00%) 1155.97 (  0.91%) 
> > 1166.28 (  0.03%)
> > Ameansyst-fsmark4020.94 (  0.00%) 4004.19 (  0.42%) 
> > 4025.87 ( -0.12%)
> > Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  
> > 447.66 ( 11.85%)
> > Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  
> > 202.93 ( 60.97%)
> > Stddev   real-fsmark   6.55 (  0.00%)3.97 ( 39.30%)
> > 1.44 ( 77.98%)
> > Stddev   syst-fsmark  16.22 (  0.00%)   15.09 (  6.96%)
> > 9.76 ( 39.86%)
> > Stddev   real-xfsrepair   11.17 (  0.00%)3.41 ( 69.43%)
> > 5.57 ( 50.17%)
> > Stddev   syst-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)
> > 5.69 ( 59.31%)
> > CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)
> > 0.12 ( 77.97%)
> > CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)
> > 0.24 ( 39.93%)
> > CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)
> > 1.24 ( 43.47%)
> > CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
> > 2.80 ( -4.23%)
> > Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 
> > 1167.96 (  0.34%)
> > Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 
> > 4039.20 ( -0.13%)
> > Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  
> > 455.42 ( 12.99%)
> > Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  
> > 207.94 ( 61.01%)
> > 
> > The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
> > is almost cut in half. It's still not as low as 3.19-vanilla but it's
> > much closer
> > 
> >  4.0.0-rc1   4.0.0-rc1  3.19.0
> >vanilla  slowscan

Re: [PATCH 1/4] mm: thp: Return the correct value for change_huge_pmd

2015-03-07 Thread Mel Gorman
On Sat, Mar 07, 2015 at 12:31:03PM -0800, Linus Torvalds wrote:
> On Sat, Mar 7, 2015 at 7:20 AM, Mel Gorman  wrote:
> >
> > if (!prot_numa || !pmd_protnone(*pmd)) {
> > -   ret = 1;
> > entry = pmdp_get_and_clear_notify(mm, addr, pmd);
> > entry = pmd_modify(entry, newprot);
> > ret = HPAGE_PMD_NR;
> 
> Hmm. I know I acked this already, but the return value - which correct
> - is still potentially something we could improve upon.
> 
> In particular, we don't need to flush the TLB's if the old entry was
> not present. Sadly, we don't have a helper function for that.
> 
> But the code *could* do something like
> 
> entry = pmdp_get_and_clear_notify(mm, addr, pmd);
> ret = pmd_tlb_cacheable(entry) ? HPAGE_PMD_NR : 1;
> entry = pmd_modify(entry, newprot);
> 
> where pmd_tlb_cacheable() on x86 would test if _PAGE_PRESENT (bit #0) is set.
> 

I agree with you in principle. pmd_tlb_cacheable looks and sounds very
similar to pte_accessible().

> In particular, that would mean that as we change *from* a protnone
> (whether NUMA or really protnone) we wouldn't need to flush the TLB.
> 
> In fact, we could make it even more aggressive: it's not just an old
> non-present TLB entry that doesn't need flushing - we can avoid the
> flushing whenever we strictly increase the access rigths. So we could
> have something that takes the old entry _and_ the new protections into
> account, and avoids the TLB flush if the new entry is strictly more
> permissive.
> 
> This doesn't explain the extra TLB flushes Dave sees, though, because
> the old code didn't make those kinds of optimizations either. But
> maybe something like this is worth doing.
> 

I think it is worth doing although it'll be after LSF/MM before I do it. I
severely doubt this is what Dave is seeing because the vmstats indicated
there was no THP activity but it's still a good idea.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Mel Gorman
On Sun, Mar 08, 2015 at 11:02:23AM +0100, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar  wrote:
> > >
> > > And the patch Dave bisected to is a relatively simple patch. Why 
> > > not simply revert it to see whether that cures much of the 
> > > problem?
> > 
> > So the problem with that is that "pmd_set_numa()" and friends simply 
> > no longer exist. So we can't just revert that one patch, it's the 
> > whole series, and the whole point of the series.
> 
> Yeah.
> 
> > What confuses me is that the only real change that I can see in that 
> > patch is the change to "change_huge_pmd()". Everything else is 
> > pretty much a 100% equivalent transformation, afaik. Of course, I 
> > may be wrong about that, and missing something silly.
> 
> Well, there's a difference in what we write to the pte:
> 
>  #define _PAGE_BIT_NUMA  (_PAGE_BIT_GLOBAL+1)
>  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
> 
> and our expectation was that the two should be equivalent methods from 
> the POV of the NUMA balancing code, right?
> 

Functionally yes but performance-wise no. We are now using the global bit
for NUMA faults at the very least.

> > And the changes to "change_huge_pmd()" were basically re-done
> > differently by subsequent patches anyway.
> > 
> > The *only* change I see remaining is that change_huge_pmd() now does
> > 
> >entry = pmdp_get_and_clear_notify(mm, addr, pmd);
> >entry = pmd_modify(entry, newprot);
> >set_pmd_at(mm, addr, pmd, entry);
> > 
> > for all changes. It used to do that "pmdp_set_numa()" for the
> > prot_numa case, which did just
> > 
> >pmd_t pmd = *pmdp;
> >pmd = pmd_mknuma(pmd);
> >set_pmd_at(mm, addr, pmdp, pmd);
> > 
> > instead.
> > 
> > I don't like the old pmdp_set_numa() because it can drop dirty bits,
> > so I think the old code was actively buggy.
> 
> Could we, as a silly testing hack not to be applied, write a 
> hack-patch that re-introduces the racy way of setting the NUMA bit, to 
> confirm that it is indeed this difference that changes pte visibility 
> across CPUs enough to create so many more faults?
> 

This was already done and tested by Dave but while it helped, it was
not enough.  As the approach was inherently unsafe it was dropped and the
throttling approach taken. However, the fact it made little difference
may indicate that this is somehow related to the global bit being used.

> Because if the answer is 'yes', then we can safely say: 'we regressed 
> performance because correctness [not dropping dirty bits] comes before 
> performance'.
> 
> If the answer is 'no', then we still have a mystery (and a regression) 
> to track down.
> 
> As a second hack (not to be applied), could we change:
> 
>  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
> 
> to:
> 
>  #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
> 

In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
can be done.

> to double check that the position of the bit does not matter?
> 

It's worth checking in case it's a case of how the global bit is
treated. However, note that Dave is currently travelling for LSF/MM in
Boston and there is a chance he cannot test this week at all. I'm just
after landing in the hotel myself. I'll try find time during during one
of the breaks tomorrow but if the wireless is too crap then accessing the
test machine remotely might be an issue.

> I don't think we've exhaused all avenues of analysis here.
> 

True.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-09 Thread Mel Gorman
On Sun, Mar 08, 2015 at 08:40:25PM +, Mel Gorman wrote:
> > Because if the answer is 'yes', then we can safely say: 'we regressed 
> > performance because correctness [not dropping dirty bits] comes before 
> > performance'.
> > 
> > If the answer is 'no', then we still have a mystery (and a regression) 
> > to track down.
> > 
> > As a second hack (not to be applied), could we change:
> > 
> >  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
> > 
> > to:
> > 
> >  #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
> > 
> 
> In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
> as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
> can be done.
> 

More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just
updating the value should crash. For the purposes of testing the idea, I
thought the straight-forward option was to break soft dirty page tracking
and steal their bit for testing (patch below). Took most of the day to
get access to the test machine so tests are not long running and only
the autonuma one has completed;

autonumabench
  3.19.0 4.0.0-rc1  
   4.0.0-rc1 4.0.0-rc1
 vanilla   vanilla  
   slowscan-v2r7protnone-v3
Time User-NUMA01  25695.96 (  0.00%)32883.59 (-27.97%)
35288.00 (-37.33%)35236.21 (-37.13%)
Time User-NUMA01_THEADLOCAL   17404.36 (  0.00%)17453.20 ( -0.28%)
17765.79 ( -2.08%)17590.10 ( -1.07%)
Time User-NUMA02   2037.65 (  0.00%) 2063.70 ( -1.28%) 
2063.22 ( -1.25%) 2072.95 ( -1.73%)
Time User-NUMA02_SMT981.02 (  0.00%)  983.70 ( -0.27%)  
976.01 (  0.51%)  983.42 ( -0.24%)
Time System-NUMA01  194.70 (  0.00%)  602.44 (-209.42%) 
 209.42 ( -7.56%)  737.36 (-278.72%)
Time System-NUMA01_THEADLOCAL98.52 (  0.00%)   78.10 ( 20.73%)  
 92.70 (  5.91%)   80.69 ( 18.10%)
Time System-NUMA029.28 (  0.00%)6.47 ( 30.28%)  
  6.06 ( 34.70%)6.63 ( 28.56%)
Time System-NUMA02_SMT3.79 (  0.00%)5.06 (-33.51%)  
  3.39 ( 10.55%)3.60 (  5.01%)
Time Elapsed-NUMA01 558.84 (  0.00%)  755.96 (-35.27%)  
833.63 (-49.17%)  804.50 (-43.96%)
Time Elapsed-NUMA01_THEADLOCAL  382.54 (  0.00%)  382.22 (  0.08%)  
395.45 ( -3.37%)  388.12 ( -1.46%)
Time Elapsed-NUMA02  49.83 (  0.00%)   49.38 (  0.90%)  
 50.21 ( -0.76%)   48.99 (  1.69%)
Time Elapsed-NUMA02_SMT  46.59 (  0.00%)   47.70 ( -2.38%)  
 48.55 ( -4.21%)   49.50 ( -6.25%)
Time CPU-NUMA014632.00 (  0.00%) 4429.00 (  4.38%) 
4258.00 (  8.07%) 4471.00 (  3.48%)
Time CPU-NUMA01_THEADLOCAL 4575.00 (  0.00%) 4586.00 ( -0.24%) 
4515.00 (  1.31%) 4552.00 (  0.50%)
Time CPU-NUMA024107.00 (  0.00%) 4191.00 ( -2.05%) 
4120.00 ( -0.32%) 4244.00 ( -3.34%)
Time CPU-NUMA02_SMT2113.00 (  0.00%) 2072.00 (  1.94%) 
2017.00 (  4.54%) 1993.00 (  5.68%)

  3.19.0   4.0.0-rc1   4.0.0-rc1   4.0.0-rc1
 vanilla vanillaslowscan-v2r7protnone-v3
User46119.1253384.2956093.1155882.82
System306.41  692.14  311.64  828.36
Elapsed  1039.88 1236.87 1328.61 1292.92

So just using a different bit doesn't seem to be it either

3.19.0   4.0.0-rc1   4.0.0-rc1   4.0.0-rc1
   vanilla vanillaslowscan-v2r7protnone-v3
NUMA alloc hit 1202922 1437560 1472578 1499274
NUMA alloc miss  0   0   0   0
NUMA interleave hit  0   0   0   0
NUMA alloc local   1200683 1436781 1472226 1498680
NUMA base PTE updates222840103   304513172   121532313   337431414
NUMA huge PMD updates   434894  594467  237170  658715
NUMA page range updates  445505831   608880276   242963353   674693494
NUMA hint faults601358  733491  334334  820793
NUMA hint local faults  371571  511530  227171  565003
NUMA hint local percent 61  69  67  68
NUMA pages migrated707317726366701 860708231288355

Patch to use a bit other than the global bit for prot none is below.

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 8c7c10802e9c..1f243323693c 100644
--- a/arch/x86/include/asm/pgtable_types.h

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-10 Thread Mel Gorman
On Mon, Mar 09, 2015 at 09:02:19PM +, Mel Gorman wrote:
> On Sun, Mar 08, 2015 at 08:40:25PM +0000, Mel Gorman wrote:
> > > Because if the answer is 'yes', then we can safely say: 'we regressed 
> > > performance because correctness [not dropping dirty bits] comes before 
> > > performance'.
> > > 
> > > If the answer is 'no', then we still have a mystery (and a regression) 
> > > to track down.
> > > 
> > > As a second hack (not to be applied), could we change:
> > > 
> > >  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
> > > 
> > > to:
> > > 
> > >  #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
> > > 
> > 
> > In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
> > as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
> > can be done.
> > 
> 
> More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just
> updating the value should crash. For the purposes of testing the idea, I
> thought the straight-forward option was to break soft dirty page tracking
> and steal their bit for testing (patch below). Took most of the day to
> get access to the test machine so tests are not long running and only
> the autonuma one has completed;
> 

And the xfsrepair workload also does not show any benefit from using a
different bit either

   3.19.0 4.0.0-rc1 
4.0.0-rc1 4.0.0-rc1
  vanilla   vanilla 
slowscan-v2r7protnone-v3r17
Min  real-fsmark1164.44 (  0.00%) 1157.41 (  0.60%) 1150.38 
(  1.21%) 1173.22 ( -0.75%)
Min  syst-fsmark4016.12 (  0.00%) 3998.06 (  0.45%) 3988.42 
(  0.69%) 4037.90 ( -0.54%)
Min  real-xfsrepair  442.64 (  0.00%)  497.64 (-12.43%)  456.87 
( -3.21%)  489.60 (-10.61%)
Min  syst-xfsrepair  194.97 (  0.00%)  500.61 (-156.76%)  
263.41 (-35.10%)  544.56 (-179.30%)
Ameanreal-fsmark1166.28 (  0.00%) 1166.63 ( -0.03%) 1155.97 
(  0.88%) 1183.19 ( -1.45%)
Ameansyst-fsmark4025.87 (  0.00%) 4020.94 (  0.12%) 4004.19 
(  0.54%) 4061.64 ( -0.89%)
Ameanreal-xfsrepair  447.66 (  0.00%)  507.85 (-13.45%)  459.58 
( -2.66%)  498.71 (-11.40%)
Ameansyst-xfsrepair  202.93 (  0.00%)  519.88 (-156.19%)  
281.63 (-38.78%)  569.21 (-180.50%)
Stddev   real-fsmark   1.44 (  0.00%)6.55 (-354.10%)
3.97 (-175.65%)9.20 (-537.90%)
Stddev   syst-fsmark   9.76 (  0.00%)   16.22 (-66.27%)   15.09 
(-54.69%)   17.47 (-79.13%)
Stddev   real-xfsrepair5.57 (  0.00%)   11.17 (-100.68%)
3.41 ( 38.66%)6.77 (-21.63%)
Stddev   syst-xfsrepair5.69 (  0.00%)   13.98 (-145.78%)   
19.94 (-250.49%)   20.03 (-252.05%)
CoeffVar real-fsmark   0.12 (  0.00%)0.56 (-353.96%)
0.34 (-178.11%)0.78 (-528.79%)
CoeffVar syst-fsmark   0.24 (  0.00%)0.40 (-66.48%)0.38 
(-55.53%)0.43 (-77.55%)
CoeffVar real-xfsrepair1.24 (  0.00%)2.20 (-76.89%)0.74 
( 40.25%)1.36 ( -9.17%)
CoeffVar syst-xfsrepair2.80 (  0.00%)2.69 (  4.06%)7.08 
(-152.54%)3.52 (-25.51%)
Max  real-fsmark1167.96 (  0.00%) 1171.98 ( -0.34%) 1159.25 
(  0.75%) 1195.41 ( -2.35%)
Max  syst-fsmark4039.20 (  0.00%) 4033.84 (  0.13%) 4024.53 
(  0.36%) 4079.45 ( -1.00%)
Max  real-xfsrepair  455.42 (  0.00%)  523.40 (-14.93%)  464.40 
( -1.97%)  505.82 (-11.07%)
Max  syst-xfsrepair  207.94 (  0.00%)  533.37 (-156.50%)  
309.38 (-48.78%)  593.62 (-185.48%)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-12 Thread Mel Gorman
On Tue, Mar 10, 2015 at 04:55:52PM -0700, Linus Torvalds wrote:
> On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner  wrote:
> > On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
> >>
> >> What's your virtual environment setup? Kernel config, and
> >> virtualization environment to actually get that odd fake NUMA thing
> >> happening?
> >
> > I don't have the exact .config with me (test machines at home
> > are shut down because I'm half a world away), but it's pretty much
> > this (copied and munged from a similar test vm on my laptop):
> 
> [ snip snip ]
> 
> Ok, I hate debugging by symptoms anyway, so I didn't do any of this,
> and went back to actually *thinking* about the code instead of trying
> to reproduce this and figure things out by trial and error.
> 
> And I think I figured it out.
> 

I believe you're correct and it matches what was observed. I'm still
travelling and wireless is dirt but managed to queue a test using pmd_dirty

  3.19.0 4.0.0-rc1  
   4.0.0-rc1
 vanilla   vanilla  
  ptewrite-v1r20
Time User-NUMA01  25695.96 (  0.00%)32883.59 (-27.97%)
24012.80 (  6.55%)
Time User-NUMA01_THEADLOCAL   17404.36 (  0.00%)17453.20 ( -0.28%)
17950.54 ( -3.14%)
Time User-NUMA02   2037.65 (  0.00%) 2063.70 ( -1.28%) 
2046.88 ( -0.45%)
Time User-NUMA02_SMT981.02 (  0.00%)  983.70 ( -0.27%)  
983.68 ( -0.27%)
Time System-NUMA01  194.70 (  0.00%)  602.44 (-209.42%) 
 158.90 ( 18.39%)
Time System-NUMA01_THEADLOCAL98.52 (  0.00%)   78.10 ( 20.73%)  
107.66 ( -9.28%)
Time System-NUMA029.28 (  0.00%)6.47 ( 30.28%)  
  9.25 (  0.32%)
Time System-NUMA02_SMT3.79 (  0.00%)5.06 (-33.51%)  
  3.92 ( -3.43%)
Time Elapsed-NUMA01 558.84 (  0.00%)  755.96 (-35.27%)  
532.41 (  4.73%)
Time Elapsed-NUMA01_THEADLOCAL  382.54 (  0.00%)  382.22 (  0.08%)  
390.48 ( -2.08%)
Time Elapsed-NUMA02  49.83 (  0.00%)   49.38 (  0.90%)  
 49.79 (  0.08%)
Time Elapsed-NUMA02_SMT  46.59 (  0.00%)   47.70 ( -2.38%)  
 47.77 ( -2.53%)
Time CPU-NUMA014632.00 (  0.00%) 4429.00 (  4.38%) 
4539.00 (  2.01%)
Time CPU-NUMA01_THEADLOCAL 4575.00 (  0.00%) 4586.00 ( -0.24%) 
4624.00 ( -1.07%)
Time CPU-NUMA024107.00 (  0.00%) 4191.00 ( -2.05%) 
4129.00 ( -0.54%)
Time CPU-NUMA02_SMT2113.00 (  0.00%) 2072.00 (  1.94%) 
2067.00 (  2.18%)

  3.19.0   4.0.0-rc1   4.0.0-rc1
 vanilla vanillaptewrite-v1r20
User46119.1253384.2944994.10
System306.41  692.14  279.78
Elapsed  1039.88 1236.87 1022.92

There are still some difference but it's much closer to what it was.
The balancing stats are almost looking similar to 3.19

NUMA base PTE updates222840103   304513172   230724075
NUMA huge PMD updates   434894  594467  450274
NUMA page range updates  445505831   608880276   461264363
NUMA hint faults601358  733491  626176
NUMA hint local faults  371571  511530  359215
NUMA hint local percent 61  69  57
NUMA pages migrated707317726366701 6829196

XFS repair on the same machine is not fully restore either but a big
enough move in the right direction to indicate this was the relevant
change.

xfsrepair
   3.19.0 4.0.0-rc1 
4.0.0-rc1
  vanilla   vanilla
ptewrite-v1r20
Ameanreal-fsmark1166.28 (  0.00%) 1166.63 ( -0.03%) 1184.97 
( -1.60%)
Ameansyst-fsmark4025.87 (  0.00%) 4020.94 (  0.12%) 4071.10 
( -1.12%)
Ameanreal-xfsrepair  447.66 (  0.00%)  507.85 (-13.45%)  460.94 
( -2.97%)
Ameansyst-xfsrepair  202.93 (  0.00%)  519.88 (-156.19%)  
282.45 (-39.19%)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-12 Thread Mel Gorman
On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote:
> On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman  wrote:
> >
> > I believe you're correct and it matches what was observed. I'm still
> > travelling and wireless is dirt but managed to queue a test using pmd_dirty
> 
> Ok, thanks.
> 
> I'm not entirely happy with that change, and I suspect the whole
> heuristic should be looked at much more (maybe it should also look at
> whether it's executable, for example), but it's a step in the right
> direction.
> 

I can follow up when I'm back in work properly. As you have already pulled
this in directly, can you also consider pulling in "mm: thp: return the
correct value for change_huge_pmd" please? The other two patches were very
minor can be resent through the normal paths later.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Mel Gorman
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote:
> >  - something completely different that I am entirely missing
> 
> So I think there's something I'm missing. For non-shared mappings, I
> still have the idea that pte_dirty should be the same as pte_write.
> And yet, your testing of 3.19 shows that it's a big difference.
> There's clearly something I'm completely missing.
> 

Minimally, there is still the window where we clear the PTE to set the
protections. During that window, a fault can occur. In the old code which
was inherently racy and unsafe, the fault might still go ahead deferring
a potential migration for a short period. In the current code, it'll stall
on the lock, notice the PTE is changed and refault so the overhead is very
different but functionally correct.

In the old code, pte_write had complex interactions with background
cleaning and sync in the case of file mappings (not applicable to Dave's
case but still it's unpredictable behaviour). pte_dirty is close but there
are interactions with the application as the timing of writes vs the PTE
scanner matter.

Even if we restored the original behaviour, it would still be very difficult
to understand all the interactions between userspace and kernel.  The patch
below should be tested because it's clearer what the intent is. Using
the VMA flags is coarse but it's not vulnerable to timing artifacts that
behave differently depending on the machine. My preliminary testing shows
it helps but not by much. It does not restore performance to where it was
but it's easier to understand which is important if there are changes in
the scheduler later.

In combination, I also think that slowing PTE scanning when migration fails
is the correct action even if it is unrelated to the patch Dave bisected
to. It's stupid to increase scanning rates and incurs more faults when
migrations are failing so I'll be testing that next.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 626e93db28ba..2f12e9fcf1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
flags |= TNF_FAULT_LOCAL;
}
 
-   /*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
-*
-* FIXME! This checks "pmd_dirty()" as an approximation of
-* "is this a read-only page", since checking "pmd_write()"
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
-*/
-   if (!pmd_dirty(pmd))
+   /* See similar comment in do_numa_page for explanation */
+   if (!(vma->vm_flags & VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
diff --git a/mm/memory.c b/mm/memory.c
index 411144f977b1..20beb6647dba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
/*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
+* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
+* much anyway since they can be in shared cache state. This misses
+* the case where a mapping is writable but the process never writes
+* to it but pte_write gets cleared during protection updates and
+* pte_dirty has unpredictable behaviour between PTE scan updates,
+* background writeback, dirty balancing and application behaviour.
 *
-* FIXME! This checks "pmd_dirty()" as an approximation of
-* "is this a read-only page", since checking "pmd_write()"
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
+* TODO: Note that the ideal here would be to avoid a situation where a
+* NUMA fault is taken immediately followed by a write fault in
+* some cases which would have lower overhead overall but would be
+* invasive as the fault paths would need to be unified.
 */
-   if (!pte_dirty(pte))
+   if (!(vma->vm_flags & VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-20 Thread Mel Gorman
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner  wrote:
> >
> > My recollection wasn't faulty - I pulled it from an earlier email.
> > That said, the original measurement might have been faulty. I ran
> > the numbers again on the 3.19 kernel I saved away from the original
> > testing. That came up at 235k, which is pretty much the same as
> > yesterday's test. The runtime,however, is unchanged from my original
> > measurements of 4m54s (pte_hack came in at 5m20s).
> 
> Ok. Good. So the "more than an order of magnitude difference" was
> really about measurement differences, not quite as real. Looks like
> more a "factor of two" than a factor of 20.
> 
> Did you do the profiles the same way? Because that would explain the
> differences in the TLB flush percentages too (the "1.4% from
> tlb_invalidate_range()" vs "pretty much everything from migration").
> 
> The runtime variation does show that there's some *big* subtle
> difference for the numa balancing in the exact TNF_NO_GROUP details.

TNF_NO_GROUP affects whether the scheduler tries to group related processes
together. Whether migration occurs depends on what node a process is
scheduled on. If processes are aggressively grouped inappropriately then it
is possible there is a bug that causes the load balancer to move processes
off a node (possible migration) with NUMA balancing trying to pull it back
(another possible migration). Small bugs there can result in excessive
migration.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-20 Thread Mel Gorman
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
> And the VM_WRITE test should be stable and not have any subtle
> interaction with the other changes that the numa pte things
> introduced. It would be good to see if the profiles then pop something
> *else* up as the performance difference (which I'm sure will remain,
> since the 7m50s was so far off).
> 

As a side-note, I did test a patch that checked pte_write and preserved
it across both faults and setting the protections. It did not alter
migration activity much but there was a  drop in minor faults - 20% drop in
autonumabench, 58% drop in xfsrepair workload. I'm assuming this is due to
refaults to mark pages writable.  The patch looks and is hacky so I won't
post it to save people bleaching their eyes. I'll spend some time soon
(hopefully today) at a smooth way of falling through to WP checks after
trapping a NUMA fault.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-23 Thread Mel Gorman
On Fri, Mar 20, 2015 at 10:02:23AM -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner  wrote:
> >
> > Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
> > 7m36s. IOWs's a bit better, but not significantly. page migrations
> > are pretty much unchanged, too:
> >
> >558,632  migrate:mm_migrate_pages ( +-  6.38% )
> 
> Ok. That was kind of the expected thing.
> 
> I don't really know the NUMA fault rate limiting code, but one thing
> that strikes me is that if it tries to balance the NUMA faults against
> the *regular* faults, then maybe just the fact that we end up taking
> more COW faults after a NUMA fault then means that the NUMA rate
> limiting code now gets over-eager (because it sees all those extra
> non-numa faults).
> 
> Mel, does that sound at all possible? I really have never looked at
> the magic automatic rate handling..
> 

It should not be trying to balance against regular faults as it has no
information on it. The trapping of additional faults to mark the PTE
writable will alter timing so it indirectly affects how many migration
faults there but this is only a side-effect IMO.

There is more overhead now due to losing the writable information and
that should be reduced so I tried a few approaches.  Ultimately, the one
that performed the best and was easiest to understand simply preserved
the writable bit across the protection update and page fault. I'll post
it later when I stick a changelog on it.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags

2015-03-23 Thread Mel Gorman
Threads that share writable data within pages are grouped together as
related tasks. This decision is based on whether the PTE is marked dirty
which is subject to timing races between the PTE scanner update and when the
application writes the page. If the page is file-backed, then background
flushes and sync also affect placement. This is unpredictable behaviour
which is impossible to reason about so this patch makes grouping decisions
based on the VMA flags.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 13 ++---
 mm/memory.c  | 19 +++
 2 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 626e93db28ba..2f12e9fcf1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
flags |= TNF_FAULT_LOCAL;
}
 
-   /*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
-*
-* FIXME! This checks "pmd_dirty()" as an approximation of
-* "is this a read-only page", since checking "pmd_write()"
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
-*/
-   if (!pmd_dirty(pmd))
+   /* See similar comment in do_numa_page for explanation */
+   if (!(vma->vm_flags & VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
diff --git a/mm/memory.c b/mm/memory.c
index 411144f977b1..20beb6647dba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
/*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
+* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
+* much anyway since they can be in shared cache state. This misses
+* the case where a mapping is writable but the process never writes
+* to it but pte_write gets cleared during protection updates and
+* pte_dirty has unpredictable behaviour between PTE scan updates,
+* background writeback, dirty balancing and application behaviour.
 *
-* FIXME! This checks "pmd_dirty()" as an approximation of
-* "is this a read-only page", since checking "pmd_write()"
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
+* TODO: Note that the ideal here would be to avoid a situation where a
+* NUMA fault is taken immediately followed by a write fault in
+* some cases which would have lower overhead overall but would be
+* invasive as the fault paths would need to be unified.
 */
-   if (!pte_dirty(pte))
+   if (!(vma->vm_flags & VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
-- 
2.1.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-23 Thread Mel Gorman
These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd"). It was known that the 
performance
in 3.19 was still better even if is far less safe. This series aims to
restore the performance without compromising on safety.

Dave, you already tested patch 1 on its own but it would be nice to test
patches 1+2 and 1+2+3 separately just to be certain.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

autonumabench
  3.19.0 4.0.0-rc4  
   4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
 vanilla   vanilla  
vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
Time System-NUMA01  124.00 (  0.00%)  161.86 (-30.53%)  
107.13 ( 13.60%)  103.13 ( 16.83%)  145.01 (-16.94%)
Time System-NUMA01_THEADLOCAL   115.54 (  0.00%)  107.64 (  6.84%)  
131.87 (-14.13%)   83.30 ( 27.90%)   92.35 ( 20.07%)
Time System-NUMA029.35 (  0.00%)   10.44 (-11.66%)  
  8.95 (  4.28%)   10.72 (-14.65%)8.16 ( 12.73%)
Time System-NUMA02_SMT3.87 (  0.00%)4.63 (-19.64%)  
  4.57 (-18.09%)3.99 ( -3.10%)3.36 ( 13.18%)
Time Elapsed-NUMA01 570.06 (  0.00%)  567.82 (  0.39%)  
515.78 (  9.52%)  517.26 (  9.26%)  543.80 (  4.61%)
Time Elapsed-NUMA01_THEADLOCAL  393.69 (  0.00%)  384.83 (  2.25%)  
384.10 (  2.44%)  384.31 (  2.38%)  380.73 (  3.29%)
Time Elapsed-NUMA02  49.09 (  0.00%)   49.33 ( -0.49%)  
 48.86 (  0.47%)   48.78 (  0.63%)   50.94 ( -3.77%)
Time Elapsed-NUMA02_SMT  47.51 (  0.00%)   47.15 (  0.76%)  
 47.98 ( -0.99%)   48.12 ( -1.28%)   49.56 ( -4.31%)

  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
User46334.6046391.9444383.9543971.8944372.12
System252.84  284.66  252.61  201.24  249.00
Elapsed  1062.14 1050.96  998.68 1000.94 1026.78

Overall the system CPU usage is comparable and the test is naturally a bit 
variable. The
slowing of the scanner hurts numa01 but on this machine it is an adverse 
workload and
patches that dramatically help it often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   
4.0.0-rc4
   vanilla 
vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Minor Faults   2097811 2656646 2597249 1981230 
1636841
Major Faults   362 450 365 364  
   365

Note the impact preserving the write bit across protection updates and fault 
reduces
faults.

NUMA alloc hit 1229008 1217015 1191660 1178322 
1199681
NUMA alloc miss  0   0   0   0  
 0
NUMA interleave hit  0   0   0   0  
 0
NUMA alloc local   1228514 1216317 1190871 1177448 
1199021
NUMA base PTE updates245706197   240041607   238195516   244704842   
115012800
NUMA huge PMD updates   479530  468448  464868  477573  
224487
NUMA page range updates  491225557   479886983   476207932   48918   
229950144
NUMA hint faults659753  656503  641678  656926  
294842
NUMA hint local faults  381604  373963  360478  337585  
186249
NUMA hint local percent 57  56  56  51  
63
NUMA pages migrated5412140 6374899 6266530 5277468 
5755096
AutoNUMA cost5121%   5083%   4994%   5097%  
 2388%

Here the impact of slowing the PTE scanner on migratrion failures is obvious as 
"NUMA base PTE updates" and
"NUMA huge PMD updates" are massively reduced even though the headline 
performance
is very similar.

As xfsrepair was the reported workload here is the impact of the series on it.

xfsrepair
   3.19.0 4.0.0-rc4 
4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
  vanilla   vanilla  
vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
Min  real-fsmark1183.29 (  0.00%) 1165.73 (  

[PATCH 2/3] mm: numa: Preserve PTE write permissions across a NUMA hinting fault

2015-03-23 Thread Mel Gorman
Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to
set the writable bit again. This patch preserves the writable bit when
trapping NUMA hinting faults. The impact is obvious from the number
of minor faults trapped during the basis balancing benchmark and the
system CPU usage;

autonumabench
   4.0.0-rc4 4.0.0-rc4
baseline  preserve
Time System-NUMA01  107.13 (  0.00%)  103.13 (  3.73%)
Time System-NUMA01_THEADLOCAL   131.87 (  0.00%)   83.30 ( 36.83%)
Time System-NUMA028.95 (  0.00%)   10.72 (-19.78%)
Time System-NUMA02_SMT4.57 (  0.00%)3.99 ( 12.69%)
Time Elapsed-NUMA01 515.78 (  0.00%)  517.26 ( -0.29%)
Time Elapsed-NUMA01_THEADLOCAL  384.10 (  0.00%)  384.31 ( -0.05%)
Time Elapsed-NUMA02  48.86 (  0.00%)   48.78 (  0.16%)
Time Elapsed-NUMA02_SMT  47.98 (  0.00%)   48.12 ( -0.29%)

 4.0.0-rc4   4.0.0-rc4
  baselinepreserve
User  44383.9543971.89
System  252.61  201.24
Elapsed 998.68 1000.94

Minor Faults   2597249 1981230
Major Faults   365 364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair 
workload

4.0.0-rc4 4.0.0-rc4
 baseline  preserve
Ameanreal-xfsrepair  454.14 (  0.00%)  442.36 (  2.60%)
Ameansyst-xfsrepair  277.20 (  0.00%)  204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse. The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse. It's not generally safe
to just mark the page writable during the fault if it's a write fault as
it may have been read-only for COW so that approach was discarded.

Signed-off-by: Mel Gorman 
---
 mm/huge_memory.c | 9 -
 mm/memory.c  | 8 +++-
 mm/mprotect.c| 3 +++
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f12e9fcf1a2..0a42d1521aa4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1260,6 +1260,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int target_nid, last_cpupid = -1;
bool page_locked;
bool migrated = false;
+   bool was_writable;
int flags = 0;
 
/* A PROT_NONE fault should not end up here */
@@ -1354,7 +1355,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
+   was_writable = pmd_write(pmd);
pmd = pmd_modify(pmd, vma->vm_page_prot);
+   if (was_writable)
+   pmd = pmd_mkwrite(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
@@ -1478,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+   bool preserve_write = prot_numa && pmd_write(*pmd);
ret = 1;
 
/*
@@ -1493,9 +1498,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (!prot_numa || !pmd_protnone(*pmd)) {
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+   if (preserve_write)
+   entry = pmd_mkwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
-   BUG_ON(pmd_write(entry));
+   BUG_ON(!preserve_write && pmd_write(entry));
}
spin_unlock(ptl);
}
diff --git a/mm/memory.c b/mm/memory.c
index 20beb6647dba..d20e12da3a3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3035,6 +3035,7 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
int last_cpupid;
int target_nid;
bool migrated = false;
+   bool was_writable = pte_write(pte);
int flags = 0;
 
/* A PROT_NONE fault should not end up here */
@@ -3059,6 +3060,8 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
/* Make it present again */
pte = pte_modify(pte, vma->vm_page_prot);
pte = pte_mkyoung(pte);
+   if (was_writable)
+   pte = pte_mkwrite(pte);
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);
 
@@ -3075,11 +3078,6 @@ static int do_numa_page(struct mm_struct *mm, struct 

[PATCH 3/3] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-23 Thread Mel Gorman
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%56.07%  [kernel][k] 
default_send_IPI_mask_sequence_phys
  - default_send_IPI_mask_sequence_phys
 - 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
 smp_call_function_many
   - native_flush_tlb_others
  - 99.85% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.73% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
  0.63% native_send_call_func_single_ipi
 generic_exec_single
 smp_call_function_single

This is showing excessive migration activity even though excessive migrations
are meant to get throttled. Normally, the scan rate is tuned on a per-task
basis depending on the locality of faults.  However, if migrations fail
for any reason then the PTE scanner may scan faster if the faults continue
to be remote. This means there is higher system CPU overhead and fault
trapping at exactly the time we know that migrations cannot happen. This
patch tracks when migration failures occur and slows the PTE scanner.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h | 9 +
 kernel/sched/fair.c   | 8 ++--
 mm/huge_memory.c  | 3 ++-
 mm/memory.c   | 3 ++-
 4 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..a419b65770d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1625,11 +1625,11 @@ struct task_struct {
 
/*
 * numa_faults_locality tracks if faults recorded during the last
-* scan window were remote/local. The task scan period is adapted
-* based on the locality of the faults with different weights
-* depending on whether they were shared or private faults
+* scan window were remote/local or failed to migrate. The task scan
+* period is adapted based on the locality of the faults with different
+* weights depending on whether they were shared or private faults
 */
-   unsigned long numa_faults_locality[2];
+   unsigned long numa_faults_locality[3];
 
unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1719,6 +1719,7 @@ struct task_struct {
 #define TNF_NO_GROUP   0x02
 #define TNF_SHARED 0x04
 #define TNF_FAULT_LOCAL0x08
+#define TNF_MIGRATE_FAIL 0x10
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..bcfe32088b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct 
*p,
/*
 * If there were no record hinting faults then either the task is
 * completely idle or all activity is areas that are not of interest
-* to automatic numa balancing. Scan slower
+* to automatic numa balancing. Related to that, if there were failed
+* migration then it implies we are migrating too quickly or the local
+* node is overloaded. In either case, scan slower
 */
-   if (local + shared == 0) {
+   if (local + shared == 0 || p->numa_faults_locality[2]) {
p->numa_scan_period = min(p->numa_scan_period_max,
p->numa_scan_period << 1);
 
@@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int 
pages, int flags)
 
if (migrated)
p->numa_pages_migrated += pages;
+   if (flags & TNF_MIGRATE_FAIL)
+   p->numa_faults_locality[2] += pages;
 
p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a42d1521aa4..51b3e7c64622 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1350,7 +1350,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (migrated) {
flags |= TNF_MIGRATED;
page_nid = target_nid;
-   }
+   } else
+   flags |= TNF_MIGRATE_FAIL;
 
goto out;
 clear_pmdnuma:
diff --git a/mm/memory.c b/

Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-24 Thread Mel Gorman
On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 12:24:00PM +0000, Mel Gorman wrote:
> > These are three follow-on patches based on the xfsrepair workload Dave
> > Chinner reported was problematic in 4.0-rc1 due to changes in page table
> > management -- https://lkml.org/lkml/2015/3/1/226.
> > 
> > Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
> > read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
> > Return the correct value for change_huge_pmd"). It was known that the 
> > performance
> > in 3.19 was still better even if is far less safe. This series aims to
> > restore the performance without compromising on safety.
> > 
> > Dave, you already tested patch 1 on its own but it would be nice to test
> > patches 1+2 and 1+2+3 separately just to be certain.
> 
>  3.19  4.0-rc4+p1  +p2  +p3
> mm_migrate_pages  266,750  572,839  558,632  223,706  201,429
> run time4m54s7m50s7m20s5m07s4m31s
> 

Excellent, this is in line with predictions and roughly matches what I
was seeing on bare metal + real NUMA + spinning disk instead of KVM +
fake NUMA + SSD.

Editting slightly;

> numa stats form p1+p2:numa_pte_updates 46109698
> numa stats form p1+p2+p3: numa_pte_updates 24460492

The big drop in PTE updates matches what I expected -- migration
failures should not lead to increased scan rates which is what patch 3
fixes. I'm also pleased that there was not a drop in performance.

> 
> OK, the summary with all patches applied:
> 
> config  3.19   4.0-rc1  4.0-rc4  4.0-rc5+
> defaults   8m08s 9m34s9m14s6m57s
> -o ag_stride=-14m04s 4m38s4m11s4m06s
> -o bhash=1010736m04s17m43s7m35s6m13s
> -o ag_stride=-1,bhash=101073   4m54s 9m58s7m50s4m31s
> 
> So it looks like the patch set fixes the remaining regression and in
> 2 of the four cases actually improves performance
> 

\o/

Linus, these three patches plus the small fixlet for pmd_mkyoung (to match
pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to
you before 4.0 assuming nothing else goes pear shaped.

> Thanks, Linus and Mel, for tracking this tricky problem down! 
> 

Thanks Dave for persisting with this and collecting the necessary data.
FWIW, I've marked the xfsrepair test case as a "large memory test".
It'll take time before the test machines have historical data for it but
in theory if this regresses again then I should spot it eventually.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 1/2] mm: meminit: initialise more memory for inode/dentry hash tables in early boot

2016-03-03 Thread Mel Gorman
On Thu, Mar 03, 2016 at 03:01:40PM +0800, Li Zhang wrote:
> From: Li Zhang 
> 
> This patch is based on Mel Gorman's old patch in the mailing list,
> https://lkml.org/lkml/2015/5/5/280 which is dicussed but it is
> fixed with a completion to wait for all memory initialised in
> page_alloc_init_late(). It is to fix the oom problem on X86
> with 24TB memory which allocates memory in late initialisation.
> But for Power platform with 32TB memory, it causes a call trace
> in vfs_caches_init->inode_init() and inode hash table needs more
> memory.
> So this patch allocates 1GB for 0.25TB/node for large system
> as it is mentioned in https://lkml.org/lkml/2015/5/1/627
> 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 2/2] powerpc/mm: Enable page parallel initialisation

2016-03-03 Thread Mel Gorman
On Thu, Mar 03, 2016 at 03:01:41PM +0800, Li Zhang wrote:
> From: Li Zhang 
> 
> Parallel initialisation has been enabled for X86,
> boot time is improved greatly.
> On Power8, for small memory, it is improved greatly.
> Here is the result from my test on Power8 platform:
> 
> For 4GB memory: 57% is improved
> For 50GB memory: 22% is improve
> 
> Signed-off-by: Li Zhang 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC] mm: Fix memory corruption caused by deferred page initialization

2016-04-04 Thread Mel Gorman
On Thu, Mar 31, 2016 at 01:27:34PM +1100, Gavin Shan wrote:
> >So the issue is only existing when CONFIG_NO_BOOTMEM=n. The alternative fix 
> >would
> >be similar to what we have on !CONFIG_NO_BOOTMEM: In early stage, all page 
> >structs
> >for bootmem reserved pages are initialized and mark them with PG_reserved. 
> >I'm
> >not sure it's worthy to fix it as we won't support bootmem as Michael 
> >mentioned.
> >
> 
> Mel, could you please confirm if we need a fix on !CONFIG_NO_BOOTMEM? If we 
> need,
> I'll respin and send a patch for review.
> 

Given that CONFIG_NO_BOOTMEM is not supported and bootmem is meant to be
slowly retiring, I would suggest instead making deferred memory init
depend on NO_BOOTMEM. 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations

2012-02-28 Thread Mel Gorman
On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote:
> While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
> Overcommit) on powerpc, we tripped the following:
> 
> kernel BUG at mm/bootmem.c:483!
> cpu 0x0: Vector: 700 (Program Check) at [c0c03940]
> pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c
> lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
> sp: c0c03bc0
>msr: 80021032
>   current = 0xc0b0cce0
>   paca= 0xc1d8
> pid   = 0, comm = swapper
> kernel BUG at mm/bootmem.c:483!
> enter ? for help
> [c0c03c80] c0a64bcc
> .sparse_early_usemaps_alloc_node+0x84/0x29c
> [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c
> [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294
> [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460
> [c0c03f90] c0009670 .start_here_common+0x1c/0x2c
> 
> This is
> 
> BUG_ON(limit && goal + size > limit);
> 
> and after some debugging, it seems that
> 
>   goal = 0x700
>   limit = 0x800
> 
> and sparse_early_usemaps_alloc_node ->
> sparse_early_usemaps_alloc_pgdat_section -> alloc_bootmem_section calls
> 
>   return alloc_bootmem_section(usemap_size() * count, section_nr);
> 
> This is on a system with 8TB available via the AMS pool, and as a quirk
> of AMS in firmware, all of that memory shows up in node 0. So, we end up
> with an allocation that will fail the goal/limit constraints. In theory,
> we could "fall-back" to alloc_bootmem_node() in
> sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
> defined, we'll BUG_ON() instead. A simple solution appears to be to
> disable the limit check if the size of the allocation in
> alloc_bootmem_secition exceeds the section size.
> 
> Signed-off-by: Nishanth Aravamudan 
> Cc: Dave Hansen 
> Cc: Anton Blanchard 
> Cc: Paul Mackerras 
> Cc: Ben Herrenschmidt 
> Cc: Robert Jennings 
> Cc: linux...@kvack.org
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  include/linux/mmzone.h |2 ++
>  mm/bootmem.c   |5 -
>  2 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 650ba2f..4176834 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -967,6 +967,8 @@ static inline unsigned long early_pfn_to_nid(unsigned 
> long pfn)
>   * PA_SECTION_SHIFT  physical address to/from section number
>   * PFN_SECTION_SHIFT pfn to/from section number
>   */
> +#define BYTES_PER_SECTION(1UL << SECTION_SIZE_BITS)
> +
>  #define SECTIONS_SHIFT   (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
>  
>  #define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index 668e94d..5cbbc76 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -770,7 +770,10 @@ void * __init alloc_bootmem_section(unsigned long size,
>  
>   pfn = section_nr_to_pfn(section_nr);
>   goal = pfn << PAGE_SHIFT;
> - limit = section_nr_to_pfn(section_nr + 1) << PAGE_SHIFT;
> + if (size > BYTES_PER_SECTION)
> + limit = 0;
> + else
> + limit = section_nr_to_pfn(section_nr + 1) << PAGE_SHIFT;

As it's ok to spill the allocation over to an adjacent section, why not
just make limit==0 unconditionally. That would avoid defining
BYTES_PER_SECTION.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] bootmem/sparsemem: remove limit constraint in alloc_bootmem_section

2012-03-01 Thread Mel Gorman
On Wed, Feb 29, 2012 at 10:12:33AM -0800, Nishanth Aravamudan wrote:
> 
>
> Signed-off-by: Nishanth Aravamudan 
> 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] page_alloc.c: remove argument to pageblock_default_order

2012-05-10 Thread Mel Gorman
On Thu, May 03, 2012 at 04:37:49PM -0700, Andrew Morton wrote:
> On Thu,  3 May 2012 22:45:12 +0530
> rajman mekaco  wrote:
> 
> > When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not defined, then
> > pageblock_default_order has an argument to it.
> > 
> > However, free_area_init_core will call it without any argument
> > anyway.
> > 
> > Remove the argument to pageblock_default_order when
> > CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not defined.
> > 
> > Signed-off-by: rajman mekaco 
> > ---
> >  mm/page_alloc.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a712fb9..4b95412 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4274,7 +4274,7 @@ static inline void __init 
> > set_pageblock_order(unsigned int order)
> >   * at compile-time. See include/linux/pageblock-flags.h for the values of
> >   * pageblock_order based on the kernel config
> >   */
> > -static inline int pageblock_default_order(unsigned int order)
> > +static inline int pageblock_default_order(void)
> >  {
> > return MAX_ORDER-1;
> >  }
> 
> Interesting.  It has been that way since at least 3.1.
> 

/me slaps self

> It didn't break the build because pageblock_default_order() is only
> ever invoked by set_pageblock_order(), with:
> 
>   set_pageblock_order(pageblock_default_order());
> 
> and set_pageblock_order() is a macro:
> 
> #define set_pageblock_order(x)do {} while (0)
> 
> There's yet another reason not to use macros, dammit - they hide bugs.
> 
> 
> Mel, can you have a think about this please?  Can we just kill off
> pageblock_default_order() and fold its guts into
> set_pageblock_order(void)?  Only ia64 and powerpc can define
> CONFIG_HUGETLB_PAGE_SIZE_VARIABLE.
> 

This looks reasonable to me.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-13 Thread Mel Gorman
On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote:
> (cc linuxppc-dev@lists.ozlabs.org)
> 
> On Mon, 11 Oct 2010 15:30:22 +0100
> Mel Gorman  wrote:
> 
> > On Sat, Oct 09, 2010 at 04:57:18AM -0500, pac...@kosh.dhis.org wrote:
> > > (What a big Cc: list... scripts/get_maintainer.pl made me do it.)
> > > 
> > > This will be a long story with a weak conclusion, sorry about that, but 
> > > it's
> > > been a long bug-hunt.
> > > 
> > > With recent kernels I've seen a bug that appears to corrupt random 4-byte
> > > chunks of memory. It's not easy to reproduce. It seems to happen only once
> > > per boot, pretty quickly after userspace has gotten started, and 
> > > sometimes it
> > > doesn't happen at all.
> > > 
> > 
> > A corruption of 4 bytes could be consistent with a pointer value being
> > written to an incorrect location.
> 
> It's corruption of user memory, which is unusual.  I'd be wondering if
> there was a pre-existing bug which 6dda9d55bf545013597 has exposed -
> previously the corruption was hitting something harmless.  Something
> like a missed CPU cache writeback or invalidate operation.
> 

This seems somewhat plausible although it's hard to tell for sure. But
lets say we had the following situation in memory

[<MAX_ORDER_NR_PAGES>][<MAX_ORDER_NR_PAGES>]
INITRDmemmap array

initrd gets freed and someone else very early in boot gets allocated in
there. Lets further guess that the struct pages in the memmap area are
managing the page frame where the INITRD was because it makes the situation
slightly easier to trigger. As pages get freed in the memmap array, we could
reference memory where initrd used to be but the physical memory is mapped
at two virtual addresses.

CPU A   CPU B
Reads kernelspace 
virtual (gets cache line)
Writes userspace virtual (gets different cache line)
IO, writes buffer 
destined for userspace (via cache line)
Cache line eviction, writeback to memory

This is somewhat contrived but I can see how it might happen even on one
CPU particularly if the L1 cache is virtual and is loose about checking
physical tags.

> How sensitive/vulnerable is PPC32 to such things?
> 

I can not tell you specifically but if the above scenario is in any way
plausible, I believe it would depend on what sort of L1 cache the CPU
has. Maybe this particular version has a virtual cache with no physical
tagging and is depending on the OS not to make virtual aliasing mistakes.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Mel Gorman
On Wed, Oct 13, 2010 at 12:52:05PM -0500, pac...@kosh.dhis.org wrote:
> Mel Gorman writes:
> > 
> > On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote:
> > > 
> > > It's corruption of user memory, which is unusual.  I'd be wondering if
> > > there was a pre-existing bug which 6dda9d55bf545013597 has exposed -
> > > previously the corruption was hitting something harmless.  Something
> > > like a missed CPU cache writeback or invalidate operation.
> > > 
> > 
> > This seems somewhat plausible although it's hard to tell for sure. But
> > lets say we had the following situation in memory
> > 
> > [<MAX_ORDER_NR_PAGES>][<MAX_ORDER_NR_PAGES>]
> > INITRDmemmap array
> 
> I don't use initrd, so this isn't exactly what happened here. But it could be
> close. Let me throw out some more information and see if it triggers any
> ideas.
> 

Ok.

> First, I tried a new test after seeing the corruption happen:
> # md5sum /sbin/e2fsck ; echo 1 > /proc/sys/vm/drop_caches ; md5sum 
> /sbin/e2fsck
> And got 2 different answers. The second answer was the correct one.
> 
> Since applying the suggested patch which changed MAX_ORDER-1 to MAX_ORDER-2,
> I've been trying to isolate exactly when the corruption happens. Since I
> don't know much about kernel code, my main method is stuffing the area full
> of printk's.
> 
> First I duplicated the affected function __free_one_page, since it's inlined
> at 2 different places, so I could apply the patch to just one of them. This
> proved that the problem is happening when called from free_one_page.
> 
> The patch which fixes (or at least covers up) the bug will only matter when
> order==MAX_ORDER-2, otherwise everything is the same. So I added a lot of
> printk's to show what's happening when order==MAX_ORDER-2. I found that, very
> repeatably, 126 such instances occur during boot, and 61 of them pass the
> page_is_buddy(higher_page, higher_buddy, order + 1) test, causing them to
> call list_add_tail.
> 
> Next, since the bug appears when this code decides to call list_add_tail,
> I made my own wrapper for list_add_tail, which allowed me to force some of
> the calls to do list_add instead. Eventually I found that of the 61 calls,
> the last one makes the difference. Allowing the first 60 calls to go through
> to list_add_tail, and switching the last one to list_add, the symptom goes
> away.
> 
> dump_stack() for that last call gave me a backtrace like this:
> [c0303e80] [c0008124] show_stack+0x4c/0x144 (unreliable)
> [c0303ec0] [c0068a84] free_one_page+0x28c/0x5b0
> [c0303f20] [c0069588] __free_pages_ok+0xf8/0x120
> [c0303f40] [c02d28c8] free_all_bootmem_core+0xf0/0x1f8
> [c0303f70] [c02d29fc] free_all_bootmem+0x2c/0x6c
> [c0303f90] [c02cc7dc] mem_init+0x70/0x2ac
> [c0303fc0] [c02c66a4] start_kernel+0x150/0x27c
> [c0303ff0] [3438] 0x3438
> 
> And this might be interesting: the PFN of the page being added in that
> critical 61st call is 130048, which exactly matches the number of available
> pages:
> 
>   free_area_init_node: node 0, pgdat c02fee6c, node_mem_map c033
> DMA zone: 1024 pages used for memmap
> DMA zone: 0 pages reserved
> DMA zone: 130048 pages, LIFO batch:31
>   Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 130048
> 
> Suspicious?
> 

A bit but I still don't know why it would cause corruption. Maybe this is still
a caching issue but the difference in timing between list_add and list_add_tail
is enough to hide the bug. It's also possible there are some registers
ioremapped after the memmap array and reading them is causing some
problem.

Andrew, what is the right thing to do here? We could flail around looking
for explanations as to why the bug causes a user buffer corruption but never
get an answer or do we go with this patch, preferably before 2.6.36 releases?

 CUT HERE 
mm, page-allocator: Do not check the state of a non-existant buddy during free

There is a bug in commit [6dda9d55: page allocator: reduce fragmentation
in buddy allocator by adding buddies that are merging to the tail of the
free lists] that means a buddy at order MAX_ORDER is checked for
merging. A page of this order never exists so at times, an effectively
random piece of memory is being checked.

Alan Curry has reported that this is causing memory corruption in userspace
data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32). It is not clear
why this is happening. It could be a cache coherency problem where pages
mapped in both user and kernel space are getting different cache lines due
to the bad read from kernel space (http://lkml.org/lkml/2010

Re: powerpc: Move 64bit heap above 1TB on machines with 1TB segments

2009-09-22 Thread Mel Gorman
Anton Blanchard  wrote on 22/09/2009 03:52:35:

> If we are using 1TB segments and we are allowed to randomise the heap, we
can
> put it above 1TB so it is backed by a 1TB segment. Otherwise the heap
will be
> in the bottom 1TB which always uses 256MB segments and this may result in
a
> performance penalty.
>
> This functionality is disabled when heap randomisation is turned off:
>
> echo 1 > /proc/sys/kernel/randomize_va_space
>
> which may be useful when trying to allocate the maximum amount of 16M or
16G
> pages.
>
> On a microbenchmark that repeatedly touches 32GB of memory with a stride
of
> 256MB + 4kB (designed to stress 256MB segments while still mapping nicely
into
> the L1 cache), we see the improvement:
>
> Force malloc to use heap all the time:
> # export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1
>
> Disable heap randomization:
> # echo 1 > /proc/sys/kernel/randomize_va_space
> # time ./test
> 12.51s
>
> Enable heap randomization:
> # echo 2 > /proc/sys/kernel/randomize_va_space
> # time ./test
> 1.70s
>
> Signed-off-by: Anton Blanchard 
> ---
>
> I've cc-ed Mel on this one. As you can see it definitely helps the base
> page size performance, but I'm a bit worried of the impact of taking away
> another of our 1TB slices.
>

Unfortunately, I am not sensitive to issues surrounding 1TB segments or how
they are currently being used. However, as this clearly helps performance
for large amounts of memory, is it worth providing an option to
libhugetlbfs to locate 16MB pages above 1TB when they are otherwise being
unused?

> Index: linux.trees.git/arch/powerpc/kernel/process.c
> ===
> --- linux.trees.git.orig/arch/powerpc/kernel/process.c   2009-09-17
> 15:47:46.0 +1000
> +++ linux.trees.git/arch/powerpc/kernel/process.c   2009-09-17 15:
> 49:11.0 +1000
> @@ -1165,7 +1165,22 @@ static inline unsigned long brk_rnd(void
>
>  unsigned long arch_randomize_brk(struct mm_struct *mm)
>  {
> -   unsigned long ret = PAGE_ALIGN(mm->brk + brk_rnd());
> +   unsigned long base = mm->brk;
> +   unsigned long ret;
> +
> +#ifdef CONFIG_PPC64
> +   /*
> +* If we are using 1TB segments and we are allowed to randomise
> +* the heap, we can put it above 1TB so it is backed by a 1TB
> +* segment. Otherwise the heap will be in the bottom 1TB
> +* which always uses 256MB segments and this may result in a
> +* performance penalty.
> +*/
> +   if (!is_32bit_task() && (mmu_highuser_ssize == MMU_SEGSIZE_1T))
> +  base = max_t(unsigned long, mm->brk, 1UL << SID_SHIFT_1T);
> +#endif
> +
> +   ret = PAGE_ALIGN(base + brk_rnd());
>
> if (ret < mm->brk)
>return mm->brk;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: powerpc: Move 64bit heap above 1TB on machines with 1TB segments

2009-09-23 Thread Mel Gorman
Benjamin Herrenschmidt  wrote on 22/09/2009
22:08:22:

> > Unfortunately, I am not sensitive to issues surrounding 1TB segments or
how
> > they are currently being used. However, as this clearly helps
performance
> > for large amounts of memory, is it worth providing an option to
> > libhugetlbfs to locate 16MB pages above 1TB when they are otherwise
being
> > unused?
>
> AFAIK, that is already the case, at least the kernel will hand out pages
> above 1T preferentially iirc.
>
> There were talks about making huge pages below 1T not even come up
> untily you ask for them with MAP_FIXED, dunno where that went.
>

Confirmed, huge pages are already being placed above the 1TB mark. I hadn't
given thought previously to where hugepages were being placed except within
segment boundaries. The patch works as advertised and doesn't appear to
collide with huge pages in any obvious way as far as I can tell.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers

2009-10-02 Thread Mel Gorman
oid *v);
>  extern struct memory_block *find_memory_block(struct mem_section *);
>  #define CONFIG_MEM_BLOCK_SIZE(PAGES_PER_SECTION<  enum mem_add_context { BOOT, HOTPLUG };
> Index: b/mm/page_alloc.c
> ===
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -48,6 +48,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -4985,23 +4986,55 @@ void set_pageblock_flags_group(struct pa
>  int set_migratetype_isolate(struct page *page)
>  {
>   struct zone *zone;
> - unsigned long flags;
> + unsigned long flags, pfn, iter;
> + long immobile = 0;

So, the count in the structure is unsigned long, but long here. Why the
difference in types?

> + struct memory_isolate_notify arg;
> + int notifier_ret;
>   int ret = -EBUSY;
>   int zone_idx;
>  
>   zone = page_zone(page);
>   zone_idx = zone_idx(zone);
> +
> + pfn = page_to_pfn(page);
> + arg.start_addr = (unsigned long)page_address(page);
> + arg.nr_pages = pageblock_nr_pages;
> + arg.pages_found = 0;
> +
>   spin_lock_irqsave(&zone->lock, flags);
>   /*
>* In future, more migrate types will be able to be isolation target.
>*/
> - if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE &&
> - zone_idx != ZONE_MOVABLE)
> - goto out;
> - set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> - move_freepages_block(zone, page, MIGRATE_ISOLATE);
> - ret = 0;
> -out:
> + do {
> + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE &&
> + zone_idx == ZONE_MOVABLE) {

So, this condition requires the zone be MOVABLE and the migrate type
be movable. That prevents MIGRATE_RESERVE regions in ZONE_MOVABLE being
off-lined even though they can likely be off-lined. It also prevents
MIGRATE_MOVABLE sections in other zones being off-lined.

Did you mean || here instead of && ?

Might want to expand the comment explaining this condition instead of
leaving it in the old location which is confusing.

> + ret = 0;
> + break;
> + }

Why do you wrap all this in a do {} while(0)  instead of preserving the
out: label and using goto?

> +
> + /*
> +  * If all of the pages in a zone are used by a balloon,
> +  * the range can be still be isolated.  The balloon will
> +  * free these pages from the memory notifier chain.
> +  */
> + notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg);
> + notifier_ret = notifier_to_errno(ret);
> + if (notifier_ret || !arg.pages_found)
> + break;
> +
> + for (iter = pfn; iter < (pfn + pageblock_nr_pages); iter++)
> + if (page_count(pfn_to_page(iter)))
> +         immobile++;
> +
> + if (arg.pages_found == immobile)

and here you compare a signed with an unsigned type. Probably harmless
but why do it?

> + ret = 0;
> + } while (0);
> +

So the out label would go here and you'd get rid of the do {} while(0)
loop.

> + if (!ret) {
> + set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> + move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + }
> +
>   spin_unlock_irqrestore(&zone->lock, flags);
>   if (!ret)
>   drain_all_pages();
> 

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


  1   2   >