Re: [PATCH 1/8] THP: Use real address for NUMA policy

2013-08-27 Thread Alex Thorlton
> Here's more up-to-date version: https://lkml.org/lkml/2012/8/20/337

These don't seem to give us a noticeable performance change either:

With THP:

real22m34.279s
user10797m35.984s
sys 39m18.188s

Without THP:

real4m48.957s
user2118m23.208s
sys 113m12.740s

Looks like we got a few minutes faster on the with THP case, but it's
still significantly slower, and that could just be a fluke result; we're
still floating at about a 5x performance degradation.

I talked with one of our performance/benchmarking experts last week and
he's done a bit more research into the actual problem here, so I've got
a bit more information:

The real performance hit, based on our testing, seems to be coming from
the increased latency that comes into play on large NUMA systems when a
process has to go off-node to read from/write to memory.

To give an extreme example, say we have a 16 node system with 8 cores
per node. If we have a job that shares a 2MB data structure between 128
threads, with THP on, the first thread to touch the structure will 
allocate all 2MB of space for that structure in a 2MB page, local to its
socket.  This means that all the memory accessses for the other 120
threads will be remote acceses.  With THP off, each thread could locally
allocate a number of 4K pages sufficient to hold the chunk of the
structure on which it needs to work, significantly reducing the number
of remote accesses that each thread will need to perform.

So, with that in mind, do we agree that a per-process tunable (or
something similar) to control THP seems like a reasonable method to
handle this issue?

Just want to confirm that everyone likes this approach before moving
forward with another revision of the patch.  I'm currently in favor of
moving this to a per-mm tunable, since that seems to make more sense
when it comes to threaded jobs. Also, a decent chunk of the code I've
already written can be reused with this approach, and prctl will still
be an appropriate place from which to control the behavior. Andrew
Morton suggested possibly controlling this through the ELF header, but
I'm going to lean towards the per-mm route unless anyone has a major
objection to it.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 00/10] split page table lock for PMD tables

2013-10-04 Thread Alex Thorlton
Kirill,

I've pasted in my results for 512 cores below.  Things are looking 
really good here.  I don't have a test for HUGETLBFS, but if you want to
pass me the one you used, I can run that too.  I suppose I could write
one, but why reinvent the wheel? :)

Sorry for the delay on these results.  I hit some strange issues with
running thp_memscale on systems with either of the following
combinations of configuration options set:

[thp off]
HUGETLBFS=y
HUGETLB_PAGE=y
NUMA_BALANCING=y
NUMA_BALANCING_DEFAULT_ENABLED=y

[thp on or off]
HUGETLBFS=n
HUGETLB_PAGE=n
NUMA_BALANCING=y
NUMA_BALANCING_DEFAULT_ENABLED=y

I'm getting segfaults intermittently, as well as some weird RCU sched
errors.  This happens in vanilla 3.12-rc2, so it doesn't have anything
to do with your patches, but I thought I'd let you know.  There didn't
used to be any issues with this test, so I think there's a subtle kernel
bug here.  That's, of course, an entirely separate issue though.

As far as these patches go, I think everything looks good (save for the
bit of discussion you were having with Andrew earlier, which I think
you've worked out).  My testing shows that the page fault rates are
actually better on this threaded test than in the non-threaded case!

- Alex

On Fri, Sep 27, 2013 at 04:16:17PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton noticed that some massively threaded workloads work poorly,
> if THP enabled. This patchset fixes this by introducing split page table
> lock for PMD tables. hugetlbfs is not covered yet.
> 
> This patchset is based on work by Naoya Horiguchi.
> 
> Please review and consider applying.
> 
> Changes:
>  v4:
>   - convert hugetlb to new locking;
>  v3:
>   - fix USE_SPLIT_PMD_PTLOCKS;
>   - fix warning in fs/proc/task_mmu.c;
>  v2:
>   - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
>   - s/huge_pmd_lock/pmd_lock/g;
>   - assume pgtable_pmd_page_ctor() can fail;
>   - fix format line in task_mem() for VmPTE;
> 
> THP off, v3.12-rc2:
> ---
> 
>  Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
> 
> 1037072.835207 task-clock#   57.426 CPUs utilized 
>( +-  3.59% )
> 95,093 context-switches  #0.092 K/sec 
>( +-  3.93% )
>140 cpu-migrations#0.000 K/sec 
>( +-  5.28% )
> 10,000,550 page-faults   #0.010 M/sec 
>( +-  0.00% )
>  2,455,210,400,261 cycles#2.367 GHz   
>( +-  3.62% ) [83.33%]
>  2,429,281,882,056 stalled-cycles-frontend   #   98.94% frontend cycles idle  
>( +-  3.67% ) [83.33%]
>  1,975,960,019,659 stalled-cycles-backend#   80.48% backend  cycles idle  
>( +-  3.88% ) [66.68%]
> 46,503,296,013 instructions  #0.02  insns per cycle
>  #   52.24  stalled cycles per 
> insn  ( +-  3.21% ) [83.34%]
>  9,278,997,542 branches  #8.947 M/sec 
>( +-  4.00% ) [83.34%]
> 89,881,640 branch-misses #0.97% of all branches   
>( +-  1.17% ) [83.33%]
> 
>   18.059261877 seconds time elapsed   
>( +-  2.65% )
> 
> THP on, v3.12-rc2:
> --
> 
>  Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
> 
> 3114745.395974 task-clock#   73.875 CPUs utilized 
>( +-  1.84% )
>267,356 context-switches  #0.086 K/sec 
>( +-  1.84% )
> 99 cpu-migrations#0.000 K/sec 
>( +-  1.40% )
> 58,313 page-faults   #0.019 K/sec 
>( +-  0.28% )
>  7,416,635,817,510 cycles#2.381 GHz   
>( +-  1.83% ) [83.33%]
>  7,342,619,196,993 stalled-cycles-frontend   #   99.00% frontend cycles idle  
>( +-  1.88% ) [83.33%]
>  6,267,671,641,967 stalled-cycles-backend#   84.51% backend  cycles idle  
>( +-  2.03% ) [66.67%]
>117,819,935,165 instructions  #0.02  insns per cycle
>  #   62.32  stalled cycles per 
> insn  ( +-  4.39% ) [83.34%]
> 28,899,314,777 branches  #9.278 M/sec 
>( +-  4.48% ) [83.34%]
> 71,787,032 branch-misses #0.25% of all branches   
>( +-  1.03% ) [83.33%]
> 
>   42.162306788 seconds time elapsed   
>( +-  1.73% )

THP on, v3.12-rc2:
-

Re: [PATCHv4 00/10] split page table lock for PMD tables

2013-10-04 Thread Alex Thorlton
On Fri, Oct 04, 2013 at 11:26:02PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > Kirill,
> > 
> > I've pasted in my results for 512 cores below.  Things are looking 
> > really good here.  I don't have a test for HUGETLBFS, but if you want to
> > pass me the one you used, I can run that too.  I suppose I could write
> > one, but why reinvent the wheel? :)
> 
> Patch below.

Good deal, thanks.  I'll get some test results put up soon.

> 
> > Sorry for the delay on these results.  I hit some strange issues with
> > running thp_memscale on systems with either of the following
> > combinations of configuration options set:
> > 
> > [thp off]
> > HUGETLBFS=y
> > HUGETLB_PAGE=y
> > NUMA_BALANCING=y
> > NUMA_BALANCING_DEFAULT_ENABLED=y
> > 
> > [thp on or off]
> > HUGETLBFS=n
> > HUGETLB_PAGE=n
> > NUMA_BALANCING=y
> > NUMA_BALANCING_DEFAULT_ENABLED=y
> > 
> > I'm getting segfaults intermittently, as well as some weird RCU sched
> > errors.  This happens in vanilla 3.12-rc2, so it doesn't have anything
> > to do with your patches, but I thought I'd let you know.  There didn't
> > used to be any issues with this test, so I think there's a subtle kernel
> > bug here.  That's, of course, an entirely separate issue though.
> 
> I'll take a look next week, if nobody does it before.

I'm starting a bisect now.  Not sure how long it'll take, but I'll keep
you posted.

> 
> > 
> > As far as these patches go, I think everything looks good (save for the
> > bit of discussion you were having with Andrew earlier, which I think
> > you've worked out).  My testing shows that the page fault rates are
> > actually better on this threaded test than in the non-threaded case!
> > 
> > - Alex
> > 
> > THP on, v3.12-rc2:
> > --
> > 
> >  Performance counter stats for './thp_memscale -C 0 -m 0 -c 512 -b 512m' (5 
> > runs):
> > 
> >   568668865.944994 task-clock#  528.547 CPUs utilized   
> >  ( +-  0.21% ) [100.00%]
> >  1,491,589 context-switches  #0.000 M/sec   
> >  ( +-  0.25% ) [100.00%]
> >  1,085 CPU-migrations#0.000 M/sec   
> >  ( +-  1.80% ) [100.00%]
> >400,822 page-faults   #0.000 M/sec   
> >  ( +-  0.41% )
> > 1,306,612,476,049,478 cycles#2.298 GHz  
> > ( +-  0.23% ) [100.00%]
> > 1,277,211,694,318,724 stalled-cycles-frontend   #   97.75% frontend cycles 
> > idle ( +-  0.21% ) [100.00%]
> > 1,163,736,844,232,064 stalled-cycles-backend#   89.07% backend  cycles 
> > idle ( +-  0.20% ) [100.00%]
> > 53,855,178,678,230 instructions  #0.04  insns per cycle 
> >
> >  #   23.72  stalled cycles per 
> > insn  ( +-  1.15% ) [100.00%]
> > 21,041,661,816,782 branches  #   37.002 M/sec   
> >  ( +-  0.64% ) [100.00%]
> >606,665,092 branch-misses #0.00% of all branches 
> >  ( +-  0.63% )
> > 
> > 1075.909782795 seconds time elapsed 
> >  ( +-  0.21% )
> >
> > THP on, patched:
> > 
> > 
> >  Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):
> > 
> >15836198.490485 task-clock#  533.304 CPUs utilized   
> >  ( +-  0.95% ) [100.00%]
> >127,507 context-switches  #0.000 M/sec   
> >  ( +-  1.65% ) [100.00%]
> >  1,223 CPU-migrations#0.000 M/sec   
> >  ( +-  3.23% ) [100.00%]
> >302,080 page-faults   #0.000 M/sec   
> >  ( +-  6.88% )
> > 18,925,875,973,975 cycles#1.195 GHz 
> >  ( +-  0.43% ) [100.00%]
> > 18,325,469,464,007 stalled-cycles-frontend   #   96.83% frontend cycles 
> > idle ( +-  0.44% ) [100.00%]
> > 17,522,272,147,141 stalled-cycles-backend#   92.58% backend  cycles 
> > idle ( +-  0.49% ) [100.00%]
> >  2,686,490,067,197 instructions  #0.14  insns per cycle 
> >
> >  #6.82  stalled cycles per 
> > insn  ( +-  2.16% ) [100.00%]
> >944,712,646,402 branche

Re: [PATCHv2 0/9] split page table lock for PMD tables

2013-09-19 Thread Alex Thorlton
On Mon, Sep 16, 2013 at 02:25:31PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton noticed that some massivly threaded workloads work poorly,
> if THP enabled. This patchset fixes this by introducing split page table
> lock for PMD tables. hugetlbfs is not covered yet.
> 
> This patchset is based on work by Naoya Horiguchi.
> 
> Changes:
>  v2:
>   - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
>   - s/huge_pmd_lock/pmd_lock/g;
>   - assume pgtable_pmd_page_ctor() can fail;
>   - fix format line in task_mem() for VmPTE;
> 
> Benchmark (from Alex): 
> ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz
> Run on 4 socket Westmere with 128 GiB of RAM.

Kirill,

I'm hitting some performance issues with these patches on our larger
machines (>=128 cores/256 threads).  I've managed to livelock larger
systems with one of our tests (I'll publish this one soon), and I'm
actually seeing a performance hit on some of the smaller ones.  I'm
currently collecting some results to show the problems I'm hitting,
and trying to research what's causing the livelock.  For now I just
wanted to let you know that I'm seeing some issues.  I'll be in touch
with more details.

Sorry for the delayed response, I wanted to be really sure that I was
seeing problems before I put up a red flag.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] THP: Use real address for NUMA policy

2013-09-04 Thread Alex Thorlton
On Tue, Aug 27, 2013 at 12:01:01PM -0500, Robin Holt wrote:
> Alex,
> 
> Although the explanation seems plausible, have you verified this is
> actually possible?  You could make a simple pthread test case which
> allocates a getpagesize() *  area, prints its
> address and then each thread migrate and reference their page.  Have
> the task then sleep() before exit.  Look at the physical
> address space with dlook for those virtual addresses in both the THP
> and non-THP cases.
> 
> Thanks,
> Robin

Robin,

I tweaked one of our other tests to behave pretty much exactly as I
described, and I can see a very significant increase in performance with
THP turned off.  The test behaves as follows:

- malloc a large array
- Spawn a specified number of threads
- Have each thread touch small, evenly spaced chunks of the array (e.g.
  for 128 threads, the array is divided into 128 chunks, and each thread
  touches 1/128th of each chunk, dividing the array into 16,384 pieces)

With THP off, the majority of each thread's pages are node local.  With
THP on, most of the pages end up as THPs on the first thread's nodes,
since it is touching chunks that are close enough together to be
collapsed into THPs which will, of course, remain on the first node for
the duration of the test.

Here are some timings for 128 threads, allocating a total of 64gb:

THP on:

real1m6.394s
user16m1.160s
sys 75m25.232s

THP off:

real0m35.251s
user26m37.316s
sys 3m28.472s

The performance hit here isn't as severe as shown with the SPEC workload
that we originally used, but it still appears to consistently take about
twice as long with THP enabled.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] THP: Use real address for NUMA policy

2013-09-04 Thread Alex Thorlton
> Robin,
> 
> I tweaked one of our other tests to behave pretty much exactly as I
> - malloc a large array
> - Spawn a specified number of threads
> - Have each thread touch small, evenly spaced chunks of the array (e.g.
>   for 128 threads, the array is divided into 128 chunks, and each thread
>   touches 1/128th of each chunk, dividing the array into 16,384 pieces)

Forgot to mention that the threads don't touch their chunks of memory
concurrently, i.e. thread 2 has to wait for thread 1 to finish first.
This is important to note, since the pages won't all get stuck on the
first node without this behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] thp: support split page table lock

2013-09-06 Thread Alex Thorlton
On Thu, Sep 05, 2013 at 05:27:46PM -0400, Naoya Horiguchi wrote:
> Thp related code also uses per process mm->page_table_lock now.
> So making it fine-grained can provide better performance.
> 
> This patch makes thp support split page table lock by using page->ptl
> of the pages storing "pmd_trans_huge" pmds.
> 
> Some functions like pmd_trans_huge_lock() and page_check_address_pmd()
> are expected by their caller to pass back the pointer of ptl, so this
> patch adds to those functions new arguments for that. Rather than that,
> this patch gives only straightforward replacement.
> 
> ChangeLog v3:
>  - fixed argument of huge_pmd_lockptr() in copy_huge_pmd()
>  - added missing declaration of ptl in do_huge_pmd_anonymous_page()

I've applied these and tested them using the same tests program that I
used when I was working on the same issue, and I'm running into some
bugs.  Here's a stack trace:

general protection fault:  [#1] SMP 
Modules linked in:
CPU: 268 PID: 32381 Comm: memscale Not tainted
3.11.0-medusa-03121-g757f8ca #184
Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS
01/15/2013
task: 880fbdd82180 ti: 880fc0c5a000 task.ti: 880fc0c5a000
RIP: 0010:[]  []
pgtable_trans_huge_withdraw+0x38/0x60
RSP: 0018:880fc0c5bc88  EFLAGS: 00010297
RAX: ea17cebe8838 RBX: 0015309bd000 RCX: ea01f623b028
RDX: dead00100100 RSI: 8dcf77d84c30 RDI: 880fbda67580
RBP: 880fc0c5bc88 R08: 0013 R09: 00014da0
R10: 880fc0c5bc88 R11: 888f7efda000 R12: 8dcf77d84c30
R13: 880fc0c5bdf8 R14: 85cf401ff067 R15: 8b4de5fabff8
FS:  () GS:880fffd8()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7768b0b8 CR3: 01a0b000 CR4: 000407e0
Stack:
 880fc0c5bcc8 810f7643 880fc0c5bcc8 810d8297
 ea1456237510 7fc7b0e0  7fc7b0c0
 880fc0c5bda8 810d85ba 880fc0c5bd48 880fc0c5bd68
Call Trace:
 [] zap_huge_pmd+0x4c/0x101
 [] ? tlb_flush_mmu+0x58/0x75
 [] unmap_single_vma+0x306/0x7d6
 [] unmap_vmas+0x4f/0x82
 [] exit_mmap+0x8b/0x113
 [] ? __delayacct_add_tsk+0x170/0x182
 [] mmput+0x3e/0xc4
 [] do_exit+0x380/0x907
 [] ? vfs_write+0x149/0x1a3
 [] do_group_exit+0x72/0x9b
 [] SyS_exit_group+0x12/0x16
 [] system_call_fastpath+0x16/0x1b
Code: 51 20 48 8d 41 20 48 39 c2 75 0d 48 c7 87 28 03 00 00 00 00 00 00
eb 36 48 8d 42 e0 48 89 87 28 03 00 00 48 8b 51 20 48 8b 41 28 <48> 89
42 08 48 89 10 48 ba 00 01 10 00 00 00 ad de 48 b8 00 02 
RIP  [] pgtable_trans_huge_withdraw+0x38/0x60
 RSP 
---[ end trace e5413b388b6ea448 ]---
Fixing recursive fault but reboot is needed!
general protection fault:  [#2] SMP 
Modules linked in:
CPU: 268 PID: 1722 Comm: kworker/268:1 Tainted: G  D
3.11.0-medusa-03121-g757f8ca #184
Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS
01/15/2013
Workqueue: events vmstat_update
task: 880fc1a74280 ti: 880fc1a76000 task.ti: 880fc1a76000
RIP: 0010:[]  []
free_pcppages_bulk+0x97/0x329
RSP: 0018:880fc1a77c98  EFLAGS: 00010082
RAX: 880fffd94d68 RBX: dead002001e0 RCX: 880fffd94d50
RDX: 880fffd94d68 RSI: 001f RDI: 888f7efdac68
RBP: 880fc1a77cf8 R08: 0400 R09: 81a8bf00
R10: 884f7efdac00 R11: 81009bae R12: dead00200200
R13: 888f7efdac00 R14: 001f R15: 
FS:  () GS:880fffd8()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7768b0b8 CR3: 01a0b000 CR4: 000407e0
Stack:
 880fc1a77ce8 880fffd94d68 0010 880fffd94d50
 001ff9276a68 880fffd94d60  001f
 880fffd94d50 0292 880fc1a77d38 880fffd95d05
Call Trace:
 [] drain_zone_pages+0x33/0x42
 [] refresh_cpu_vm_stats+0xcc/0x11e
 [] vmstat_update+0x11/0x43
 [] process_one_work+0x260/0x389
 [] worker_thread+0x1e2/0x332
 [] ? process_one_work+0x389/0x389
 [] kthread+0xb3/0xbd
 [] ? process_one_work+0x389/0x389
 [] ? kthread_freezable_should_stop+0x5b/0x5b
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_freezable_should_stop+0x5b/0x5b
Code: 48 89 55 c8 48 39 14 08 74 ce 41 83 fe 03 44 0f 44 75 c4 48 83 c2
08 48 89 45 b0 48 89 55 a8 48 8b 45 a8 4c 8b 20 49 8d 5c 24 e0 <48> 8b
53 20 48 8b 43 28 48 89 42 08 48 89 10 48 ba 00 01 10 00 
RIP  [] free_pcppages_bulk+0x97/0x329
 RSP 
---[ end trace e5413b388b6ea449 ]---
BUG: unable to handle kernel paging request at ffd8
IP: [] kthread_data+0xb/0x11
PGD 1a0c067 PUD 1a0e067 PMD 0 
Oops:  [#3] SMP 
Modules linked in:
CPU: 268 PID: 1722 Comm: kworker/268:1 Tainted: G  D
3.11.0-medusa-03121-g757f8ca #184
Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS
01/15/2013
task: 880fc1a74280 ti: 880fc1a76000 task.ti: 880fc1a76000
RIP: 0010:[]  []
kthread_data+0xb/0x11
RSP: 0

Re: [PATCH 2/2] thp: support split page table lock

2013-09-09 Thread Alex Thorlton
On Mon, Sep 09, 2013 at 10:34:45AM +0800, Daniel J Blueman wrote:
> On Saturday, 7 September 2013 02:10:02 UTC+8, Naoya Horiguchi  wrote:
> >Hi Alex,
> >
> >On Fri, Sep 06, 2013 at 11:04:23AM -0500, Alex Thorlton wrote:
> >> On Thu, Sep 05, 2013 at 05:27:46PM -0400, Naoya Horiguchi wrote:
> >> > Thp related code also uses per process mm->page_table_lock now.
> >> > So making it fine-grained can provide better performance.
> >> >
> >> > This patch makes thp support split page table lock by using page->ptl
> >> > of the pages storing "pmd_trans_huge" pmds.
> >> >
> >> > Some functions like pmd_trans_huge_lock() and
> page_check_address_pmd()
> >> > are expected by their caller to pass back the pointer of ptl, so this
> >> > patch adds to those functions new arguments for that. Rather than
> that,
> >> > this patch gives only straightforward replacement.
> >> >
> >> > ChangeLog v3:
> >> >  - fixed argument of huge_pmd_lockptr() in copy_huge_pmd()
> >> >  - added missing declaration of ptl in do_huge_pmd_anonymous_page()
> >>
> >> I've applied these and tested them using the same tests program that I
> >> used when I was working on the same issue, and I'm running into some
> >> bugs.  Here's a stack trace:
> >
> >Thank you for helping testing. This bug is new to me.
> 
> With 3.11, this patch series and CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS,
> I consistently hit the same failure when exiting one of my
> stress-testers [1] when using eg 24 cores.
> 
> Doesn't happen with 8 cores, so likely needs enough virtual memory
> to use multiple split locks. Otherwise, this is very promising work!

Daniel,

Hillf Danton (dhi...@gmail.com) suggested putting the big
page_table_lock back into the two functions seen below.  I re-tested
with this change and it seems to resolve the issue.  

- Alex

--- a/mm/pgtable-generic.c  Sat Sep  7 15:17:52 2013
+++ b/mm/pgtable-generic.c  Sat Sep  7 15:20:28 2013
@@ -127,12 +127,14 @@ void pmdp_splitting_flush(struct vm_area
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable)
 {
+   spin_lock(&mm->page_table_lock);
/* FIFO */
if (!mm->pmd_huge_pte)
INIT_LIST_HEAD(&pgtable->lru);
else
list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
mm->pmd_huge_pte = pgtable;
+   spin_unlock(&mm->page_table_lock);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
@@ -144,6 +146,7 @@ pgtable_t pgtable_trans_huge_withdraw(st
 {
pgtable_t pgtable;

+   spin_lock(&mm->page_table_lock);
/* FIFO */
pgtable = mm->pmd_huge_pte;
if (list_empty(&pgtable->lru))
@@ -153,6 +156,7 @@ pgtable_t pgtable_trans_huge_withdraw(st
  struct page, lru);
list_del(&pgtable->lru);
}
+   spin_unlock(&mm->page_table_lock);
return pgtable;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] THP: Use real address for NUMA policy

2013-09-09 Thread Alex Thorlton
On Thu, Sep 05, 2013 at 01:15:10PM +0200, Ingo Molnar wrote:
> 
> * Alex Thorlton  wrote:
> 
> > > Robin,
> > > 
> > > I tweaked one of our other tests to behave pretty much exactly as I
> > > - malloc a large array
> > > - Spawn a specified number of threads
> > > - Have each thread touch small, evenly spaced chunks of the array (e.g.
> > >   for 128 threads, the array is divided into 128 chunks, and each thread
> > >   touches 1/128th of each chunk, dividing the array into 16,384 pieces)
> > 
> > Forgot to mention that the threads don't touch their chunks of memory
> > concurrently, i.e. thread 2 has to wait for thread 1 to finish first.
> > This is important to note, since the pages won't all get stuck on the
> > first node without this behavior.
> 
> Could you post the testcase please?
> 
> Thanks,
> 
>   Ingo

Sorry for the delay here, had to make sure that everything in my tests
was okay to push out to the public.  Here's a pointer to the test I
wrote:

ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz

Everything to compile the test should be there (just run make in the
thp_pthread directory).  To run the test use something like:

time ./thp_pthread -C 0 -m 0 -c  -b 

I ran:

time ./thp_pthread -C 0 -m 0 -c 128 -b 128g

On a 256 core machine, with ~500gb of memory and got these results:

THP off:

real0m57.797s
user46m22.156s
sys 6m14.220s

THP on:

real1m36.906s
user0m2.612s
sys 143m13.764s

I snagged some code from another test we use, so I can't vouch for the
usefulness/accuracy of all the output (actually, I know some of it is
wrong).  I've mainly been looking at the total run time.

Don't want to bloat this e-mail up with too many test results, but I
found this one really interesting.  Same machine, using all the cores,
with the same amount of memory.  This means that each cpu is actually
doing *less* work, since the chunk we reserve gets divided up evenly
amongst the cpus:

time ./thp_pthread -C 0 -m 0 -c 256 -b 128g

THP off:

real1m1.028s
user104m58.448s
sys 8m52.908s

THP on:

real2m26.072s
user60m39.404s
sys 337m10.072s

Seems that the test scales really well in the THP off case, but, once
again, with THP on, we really see the performance start to degrade.

I'm planning to start investigating possible ways to split up THPs, if
we detect that that majority of the references to a THP are off-node.
I've heard some horror stories about migrating pages in this situation
(i.e., process switches cpu and then all the pages follow it), but I
think we might be able to get some better results if we can cleverly
determine an appropriate time to split up pages.  I've heard a bit of
talk about doing something similar to this from a few people, but
haven't seen any code/test results.  If anybody has any input on that
topic, it would be greatly appreciated.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 00/10] split page table lock for PMD tables

2013-10-08 Thread Alex Thorlton
On Mon, Oct 07, 2013 at 12:48:20PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > > > Sorry for the delay on these results.  I hit some strange issues with
> > > > running thp_memscale on systems with either of the following
> > > > combinations of configuration options set:
> > > > 
> > > > [thp off]
> > > > HUGETLBFS=y
> > > > HUGETLB_PAGE=y
> > > > NUMA_BALANCING=y
> > > > NUMA_BALANCING_DEFAULT_ENABLED=y
> > > > 
> > > > [thp on or off]
> > > > HUGETLBFS=n
> > > > HUGETLB_PAGE=n
> > > > NUMA_BALANCING=y
> > > > NUMA_BALANCING_DEFAULT_ENABLED=y
> > > > 
> > > > I'm getting segfaults intermittently, as well as some weird RCU sched
> > > > errors.  This happens in vanilla 3.12-rc2, so it doesn't have anything
> > > > to do with your patches, but I thought I'd let you know.  There didn't
> > > > used to be any issues with this test, so I think there's a subtle kernel
> > > > bug here.  That's, of course, an entirely separate issue though.
> > > 
> > > I'll take a look next week, if nobody does it before.
> > 
> > I'm starting a bisect now.  Not sure how long it'll take, but I'll keep
> > you posted.
> 
> I don't see the issue. Could you share your kernel config?

I put my kernel config up on ftp at:

ftp://shell.sgi.com/collect/atconfig/config_bug

I've been investigating the issue today and the smallest run I've seen
the problem on was with 128 threads, so this might not be something that
most people will hit.

With the config I've shared here the problem appears to only be
intermittent at 128 threads.  It happened on every run of the test when
I ran it with 512 threads.

Just for something to compare to, here's a config that seems to behave
just fine for any number of threads:

ftp://shell.sgi.com/collect/atconfig/config_good

It looks like this is a problem all the way back to the current 3.8
stable tree.  I'm still working on tracing back to a kernel where this
problem doesn't show up.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] Change THP code to use pud_page(pud)->ptl lock page_table_lock

2013-08-30 Thread Alex Thorlton
This patch changes out the page_table_lock for the pud_page ptl in the
THP fault path; pretty self-explanatory.  I got lazy and commented out
the spinlock assertion in follow_trans_huge_pmd instead of digging up
the pud_page ptl in this function.  This is just a proof of concept, so
I didn't feel that it was too important to keep around for now. 

---
 mm/huge_memory.c | 4 ++--
 mm/memory.c  | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..d3b34e2f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1240,10 +1240,10 @@ struct page *follow_trans_huge_pmd(struct 
vm_area_struct *vma,
   pmd_t *pmd,
   unsigned int flags)
 {
-   struct mm_struct *mm = vma->vm_mm;
+// struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
 
-   assert_spin_locked(&mm->page_table_lock);
+// assert_spin_locked(&mm->page_table_lock);
 
if (flags & FOLL_WRITE && !pmd_write(*pmd))
goto out;
diff --git a/mm/memory.c b/mm/memory.c
index af84bc0..5b4e910 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1527,15 +1527,15 @@ struct page *follow_page_mask(struct vm_area_struct 
*vma,
split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
-   spin_lock(&mm->page_table_lock);
+   spin_lock(&pud_page(*pud)->ptl);
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
-   spin_unlock(&mm->page_table_lock);
+   spin_unlock(&pud_page(*pud)->ptl);
wait_split_huge_page(vma->anon_vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
 pmd, flags);
-   spin_unlock(&mm->page_table_lock);
+   spin_unlock(&pud_page(*pud)->ptl);
*page_mask = HPAGE_PMD_NR - 1;
goto out;
}
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] Increase locking granularity in THP page fault code

2013-08-30 Thread Alex Thorlton
We have a threaded page fault scaling test that is performing very
poorly due to the use of the page_table_lock in the THP fault code
path. By replacing the page_table_lock with the ptl on the pud_page
(need CONFIG_SPLIT_PTLOCKS for this to work), we're able to increase
the granularity of the locking here by a factor of 512, i.e. instead
of locking all 256TB of addressable memory, we only lock the 512GB that
is handled by a single pud_page.

The test I'm running creates 512 threads, pins each thread to a cpu, has
the threads allocate 512mb of memory each and then touch the first byte
of each 4k chunk of the allocated memory.  Here are the timings from
this test on 3.11-rc7, clean, THP on:

real22m50.904s
user15m26.072s
sys 11430m19.120s

And here are the timings with my modified kernel, THP on:

real0m37.018s
user21m39.164s
sys 155m9.132s

As you can see, we get a huge performance boost by locking a more
targeted chunk of memory instead of locking the whole page table.  At
this point, I'm comfortable saying that there are obvious benefits to
increasing the granularity of the locking, but I'm not sure that I've
done this in the best way possible.  Mainly, I'm not positive that using
the pud_page lock actually protects everything that we're concerned
about locking here.  I'm hoping that everyone can provide some input
on whether or not this seems like a reasonable move to make and, if so,
confirm that I've locked things appropriately.

As a side note, we still have some pretty significant scaling issues
with this test, both with THP on, and off.  I'm cleaning up the locking
here first as this is causing the biggest performance hit.

Alex Thorlton (1):
  Change THP code to use pud_page(pud)->ptl lock instead of
page_table_lock

 mm/huge_memory.c | 4 ++--
 mm/memory.c  | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/9] split page table lock for PMD tables

2013-09-26 Thread Alex Thorlton
> Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> 
> HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> HUGETLBFS is enabled.

Ah, that's got it!  I double checked my config a million times to
make sure that I wasn't going crazy, but I must've missed that. With
that fixed, it's performing exactly how I thought it should.  Looking
great to me!

Sorry for the confusion!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/9] split page table lock for PMD tables

2013-09-26 Thread Alex Thorlton
On Fri, Sep 27, 2013 at 12:38:07AM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > > Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> > > 
> > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> > > HUGETLBFS is enabled.
> > 
> > Ah, that's got it!  I double checked my config a million times to
> > make sure that I wasn't going crazy, but I must've missed that. With
> > that fixed, it's performing exactly how I thought it should.  Looking
> > great to me!
> 
> Can I use your Reviewed-by?

Sure, no problem.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/9] split page table lock for PMD tables

2013-09-26 Thread Alex Thorlton
On Fri, Sep 27, 2013 at 12:42:46AM +0300, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > Alex Thorlton wrote:
> > > > Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> > > > 
> > > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock 
> > > > if
> > > > HUGETLBFS is enabled.
> > > 
> > > Ah, that's got it!  I double checked my config a million times to
> > > make sure that I wasn't going crazy, but I must've missed that. With
> > > that fixed, it's performing exactly how I thought it should.  Looking
> > > great to me!
> > 
> > Can I use your Reviewed-by?
> 
> s/Reviewed-by/Tested-by/

That too.  Whatever you need, haha.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/9] split page table lock for PMD tables

2013-09-24 Thread Alex Thorlton
Kirill,

I've merged a couple of our e-mails together for this reply:

On Fri, Sep 20, 2013 at 03:31:37PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton noticed that some massivly threaded workloads work poorly,
> if THP enabled. This patchset fixes this by introducing split page table
> lock for PMD tables. hugetlbfs is not covered yet.
> 
> This patchset is based on work by Naoya Horiguchi.
> 
> Changes:
>  v2:
>   - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
>   - s/huge_pmd_lock/pmd_lock/g;
>   - assume pgtable_pmd_page_ctor() can fail;
>   - fix format line in task_mem() for VmPTE;
> 
> Benchmark (from Alex): 
> ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz
> Run on 4 socket Westmere with 128 GiB of RAM.

First off, this test, although still somewhat relevant here, wasn't
exactly crafted to show the particular page fault scaling problem that
I was looking to address by changing to the split PTLs.  I wrote this
one to show what happens when an application allocates memory in such
a way that all of its pages get stuck on one node due to the use of
THP.  This problem is much more evident on large machines, although
it appears to still be a significant issue even on this 4 socket.
 
> THP off:
> 
> 
>  Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 
> runs):
> 
> 1738259.808012 task-clock#   47.571 CPUs utilized 
>( +-  9.49% )
>147,359 context-switches  #0.085 K/sec 
>( +-  9.67% )
> 14 cpu-migrations#0.000 K/sec 
>( +- 13.25% )
> 24,410,139 page-faults   #0.014 M/sec 
>( +-  0.00% )
>  4,149,037,526,252 cycles#2.387 GHz   
>( +-  9.50% )
>  3,649,839,735,027 stalled-cycles-frontend   #   87.97% frontend cycles idle  
>( +-  6.60% )
>  2,455,558,969,567 stalled-cycles-backend#   59.18% backend  cycles idle  
>( +- 22.92% )
>  1,434,961,518,604 instructions  #0.35  insns per cycle
>  #2.54  stalled cycles per 
> insn  ( +- 92.86% )
>241,472,020,951 branches  #  138.916 M/sec 
>( +- 91.72% )
> 84,022,172 branch-misses #0.03% of all branches   
>( +-  3.16% )
> 
>   36.540185552 seconds time elapsed   
>( +- 18.36% )

I'm assuming this was THP off, no patchset, correct?
 
> THP on, no patchset:
> 
>  Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 
> runs):
> 
> 2528378.966949 task-clock#   50.715 CPUs utilized 
>( +- 11.86% )
>214,063 context-switches  #0.085 K/sec 
>( +- 11.94% )
> 19 cpu-migrations#0.000 K/sec 
>( +- 22.72% )
> 49,226 page-faults   #0.019 K/sec 
>( +-  0.33% )
>  6,034,640,598,498 cycles#2.387 GHz   
>( +- 11.91% )
>  5,685,933,794,081 stalled-cycles-frontend   #   94.22% frontend cycles idle  
>( +-  7.67% )
>  4,414,381,393,353 stalled-cycles-backend#   73.15% backend  cycles idle  
>( +-  2.09% )
>952,086,804,776 instructions  #0.16  insns per cycle
>  #5.97  stalled cycles per 
> insn  ( +- 89.59% )
>166,191,211,974 branches  #   65.730 M/sec 
>( +- 85.52% )
> 33,341,022 branch-misses #0.02% of all branches   
>( +-  3.90% )
> 
>   49.854741504 seconds time elapsed   
>( +- 14.76% )
> 
> THP on, with patchset:
> --
> 
> echo always > /sys/kernel/mm/transparent_hugepage/enabled
>  Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 
> runs):
> 
> 1538763.343568 task-clock#   45.386 CPUs utilized 
>( +-  7.21% )
>130,469 context-switches  #0.085 K/sec 
>( +-  7.32% )
> 14 cpu-migrations#0.000 K/sec 
>( +- 23.58% )
> 49,299 page-faults   #0.032 K/sec 
>( +-  0.15% )
>  3,666,748,502,650 cycles#2.383 GHz   
>( +-  7.25% )
>  3,330,488,035,212 stalled-cycles-frontend   #   90.83% frontend cycles

Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-18 Thread Alex Thorlton
On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote:
> On Tue, 11 Jun 2013, Alex Thorlton wrote:
> 
> > This patch adds the ability to control THPs on a per cpuset basis.
> > Please see
> > the additions to Documentation/cgroups/cpusets.txt for more
> > information.
> > 
> 
> What's missing from both this changelog and the documentation you
> point to 
> is why this change is needed.
> 
> I can understand how you would want a subset of processes to not use
> thp 
> when it is enabled.  This is typically where MADV_NOHUGEPAGE is used
> with 
> some type of malloc hook.
> 
> I don't think we need to do this on a cpuset level, so unfortunately I 
> think this needs to be reworked.  Would it make sense to add a
> per-process 
> tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk()
> and 
> mmap() calls?  Perhaps, but then you would need to justify why it
> can't be 
> done with a malloc hook in userspace.
> 
> This seems to just be working around a userspace issue or for a matter
> of 
> convenience, right?

David,

Thanks for your input, however, I believe the method of using a malloc
hook falls apart when it comes to static binaries, since we wont' have
any shared libraries to hook into.  Although using a malloc hook is a
perfectly suitable solution for most cases, we're looking to implement a
solution that can be used in all situations.

Aside from that particular shortcoming of the malloc hook solution,
there are some other situations having a cpuset-based option is a
much simpler and more efficient solution than the alternatives.  One
such situation that comes to mind would be an environment where a batch
scheduler is in use to ration system resources.  If an administrator
determines that a users jobs run more efficiently with thp always on,
the administrator can simply set the users jobs to always run with that
setting, instead of having to coordinate with that user to get them to
run their jobs in a different way.  I feel that, for cases such as this,
the this additional flag is in line with the other capabilities that
cgroups and cpusets provide.

While there are likely numerous other situations where having a flag to
control thp on the cpuset level makes things a bit easier to manage, the
one real show-stopper here is that we really have no other options when
it comes to static binaries.

- Alex Thorlton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Make transparent hugepages cpuset aware

2013-06-10 Thread Alex Thorlton
This patch adds the ability to control THPs on a per cpuset basis.  Please see
the additions to Documentation/cgroups/cpusets.txt for more information.

Signed-off-by: Alex Thorlton 
Reviewed-by: Robin Holt 
Cc: Li Zefan 
Cc: Rob Landley 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Johannes Weiner 
Cc: Xiao Guangrong 
Cc: David Rientjes 
Cc: linux-...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroups/cpusets.txt |  50 ++-
 include/linux/cpuset.h|   5 ++
 include/linux/huge_mm.h   |  25 +-
 kernel/cpuset.c   | 181 ++
 mm/huge_memory.c  |   3 +
 5 files changed, 261 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index 12e01d4..b7b2c83 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -22,12 +22,14 @@ CONTENTS:
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
   1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.9 What is thp_enabled ?
+  1.10 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
   2.3 Setting flags
   2.4 Attaching processes
+  2.5 Setting thp_enabled flags
 3. Questions
 4. Contact
 
@@ -581,7 +583,34 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
+1.9 What is thp_enabled ?
+---
+
+The thp_enabled file contained within each cpuset controls how transparent
+hugepages are handled within that cpuset.
+
+The root cpuset's thp_enabled flags mirror the flags set in
+/sys/kernel/mm/transparent_hugepage/enabled.  The flags in the root cpuset can
+only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The
+thp_enabled file for the root cpuset is read only.  These flags cause the
+root cpuset to behave as one might expect:
+
+- When set to always, THPs are used whenever practical
+- When set to madvise, THPs are used only on chunks of memory that have the
+  MADV_HUGEPAGE flag set
+- When set to never, THPs are never allowed for tasks in this cpuset
+
+The behavior of thp_enabled for children of the root cpuset is where things
+become a bit more interesting.  The child cpusets accept the same flags as the
+root, but also have a default flag, which, when set, causes a cpuset to use the
+behavior of its parent.  When a child cpuset is created, its default flag is
+always initially set.
+
+Since the flags on child cpusets are allowed to differ from the flags on their
+parents, we are able to enable THPs for tasks in specific cpusets, and disable
+them in others.
+
+1.10 How do I use cpusets ?
 --
 
 In order to minimize the impact of cpusets on critical kernel
@@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance
 cpuset.mem_exclusive   cpuset.sched_relax_domain_level
 cpuset.mem_hardwallnotify_on_release
 cpuset.memory_migrate  tasks
+thp_enabled
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
@@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one 
after another:
...
 # /bin/echo PIDn > tasks
 
+2.5 Setting thp_enabled flags
+-
+
+The syntax for setting these flags is similar to setting thp flags in
+/sys/kernel/mm/transparent_hugepage/enabled.  In order to change the flags you
+simply echo the name of the flag you want to set to the thp_enabled file of the
+desired cpuset:
+
+# /bin/echo always > thp_enabled   -> always use THPs when practical
+# /bin/echo madvise > thp_enabled  -> only use THPs in madvise sections
+# /bin/echo never > thp_enabled-> never use THPs
+# /bin/echo default > thp_enabled  -> use parent cpuset's THP flags
+
+Note that the flags on the root cpuset cannot be changed in /dev/cpuset.  These
+flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can 
only
+be modified there.
 
 3. Questions
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index cc1b01c..624aafd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are 
defined in system? */
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
+extern void cpuset_update_top_thp_flags(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern int cpuset_thp_always(struct task_struct *p);
+extern int cpuset_thp_madvise(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p)

Re: [PATCH] Make transparent hugepages cpuset aware

2013-06-11 Thread Alex Thorlton
On Tue, Jun 11, 2013 at 09:55:18AM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > This patch adds the ability to control THPs on a per cpuset basis.  Please 
> > see
> > the additions to Documentation/cgroups/cpusets.txt for more information.
> > 
> > Signed-off-by: Alex Thorlton 
> > Reviewed-by: Robin Holt 
> > Cc: Li Zefan 
> > Cc: Rob Landley 
> > Cc: Andrew Morton 
> > Cc: Mel Gorman 
> > Cc: Rik van Riel 
> > Cc: Kirill A. Shutemov 
> > Cc: Johannes Weiner 
> > Cc: Xiao Guangrong 
> > Cc: David Rientjes 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux...@kvack.org
> > ---
> >  Documentation/cgroups/cpusets.txt |  50 ++-
> >  include/linux/cpuset.h|   5 ++
> >  include/linux/huge_mm.h   |  25 +-
> >  kernel/cpuset.c   | 181 
> > ++
> >  mm/huge_memory.c  |   3 +
> >  5 files changed, 261 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/cpusets.txt 
> > b/Documentation/cgroups/cpusets.txt
> > index 12e01d4..b7b2c83 100644
> > --- a/Documentation/cgroups/cpusets.txt
> > +++ b/Documentation/cgroups/cpusets.txt
> > @@ -22,12 +22,14 @@ CONTENTS:
> >1.6 What is memory spread ?
> >1.7 What is sched_load_balance ?
> >1.8 What is sched_relax_domain_level ?
> > -  1.9 How do I use cpusets ?
> > +  1.9 What is thp_enabled ?
> > +  1.10 How do I use cpusets ?
> >  2. Usage Examples and Syntax
> >2.1 Basic Usage
> >2.2 Adding/removing cpus
> >2.3 Setting flags
> >2.4 Attaching processes
> > +  2.5 Setting thp_enabled flags
> >  3. Questions
> >  4. Contact
> >  
> > @@ -581,7 +583,34 @@ If your situation is:
> >  then increasing 'sched_relax_domain_level' would benefit you.
> >  
> >  
> > -1.9 How do I use cpusets ?
> > +1.9 What is thp_enabled ?
> > +---
> > +
> > +The thp_enabled file contained within each cpuset controls how transparent
> > +hugepages are handled within that cpuset.
> > +
> > +The root cpuset's thp_enabled flags mirror the flags set in
> > +/sys/kernel/mm/transparent_hugepage/enabled.  The flags in the root cpuset 
> > can
> > +only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. 
> > The
> > +thp_enabled file for the root cpuset is read only.  These flags cause the
> > +root cpuset to behave as one might expect:
> > +
> > +- When set to always, THPs are used whenever practical
> > +- When set to madvise, THPs are used only on chunks of memory that have the
> > +  MADV_HUGEPAGE flag set
> > +- When set to never, THPs are never allowed for tasks in this cpuset
> > +
> > +The behavior of thp_enabled for children of the root cpuset is where things
> > +become a bit more interesting.  The child cpusets accept the same flags as 
> > the
> > +root, but also have a default flag, which, when set, causes a cpuset to 
> > use the
> > +behavior of its parent.  When a child cpuset is created, its default flag 
> > is
> > +always initially set.
> > +
> > +Since the flags on child cpusets are allowed to differ from the flags on 
> > their
> > +parents, we are able to enable THPs for tasks in specific cpusets, and 
> > disable
> > +them in others.
> 
> Should we have a way for parent cgroup can enforce child behaviour?
> Like a mask of allowed thp_enabled values children can choose.
> 

We don't have a use case for that particular scenario, so we didn't
include any such functionality.  Our main goal here was to allow 
cpusets to override the /sys/kernel/mm/transparent_hugepage/enabled 
setting.  If you have a use case for that scenario, then I think it
would be more suitable to add that functionality in a separate patch.

> > @@ -177,6 +177,29 @@ static inline struct page *compound_trans_head(struct 
> > page *page)
> > return page;
> >  }
> >  
> > +#ifdef CONFIG_CPUSETS
> > +extern int cpuset_thp_always(struct task_struct *p);
> > +extern int cpuset_thp_madvise(struct task_struct *p);
> > +
> > +static inline int transparent_hugepage_enabled(struct vm_area_struct *vma)
> > +{
> > +   if (cpuset_thp_always(current))
> > +   return 1;
> 
> Why do you ignore VM_NOHUGEPAGE?
> And !is_vma_temporary_stack(__vma) is still relevant.
> 

That was an oversight, on my part.  I've fixed it and will submit the
corrected patch shortly.  Thanks for pointing that out.

[PATCH v2] Make transparent hugepages cpuset aware

2013-06-11 Thread Alex Thorlton
This patch adds the ability to control THPs on a per cpuset basis.  Please see
the additions to Documentation/cgroups/cpusets.txt for more information.

Signed-off-by: Alex Thorlton 
Reviewed-by: Robin Holt 
Cc: Li Zefan 
Cc: Rob Landley 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Johannes Weiner 
Cc: Xiao Guangrong 
Cc: David Rientjes 
Cc: linux-...@vger.kernel.org
Cc: linux...@kvack.org
---
Changes since last patch version:
- Modified transparent_hugepage_enable to always check the vma for the
  VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack.
- Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags

 Documentation/cgroups/cpusets.txt |  50 ++-
 include/linux/cpuset.h|   5 ++
 include/linux/huge_mm.h   |  27 +-
 kernel/cpuset.c   | 181 ++
 mm/huge_memory.c  |   3 +
 5 files changed, 263 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index 12e01d4..b7b2c83 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -22,12 +22,14 @@ CONTENTS:
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
   1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.9 What is thp_enabled ?
+  1.10 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
   2.3 Setting flags
   2.4 Attaching processes
+  2.5 Setting thp_enabled flags
 3. Questions
 4. Contact
 
@@ -581,7 +583,34 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
+1.9 What is thp_enabled ?
+---
+
+The thp_enabled file contained within each cpuset controls how transparent
+hugepages are handled within that cpuset.
+
+The root cpuset's thp_enabled flags mirror the flags set in
+/sys/kernel/mm/transparent_hugepage/enabled.  The flags in the root cpuset can
+only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The
+thp_enabled file for the root cpuset is read only.  These flags cause the
+root cpuset to behave as one might expect:
+
+- When set to always, THPs are used whenever practical
+- When set to madvise, THPs are used only on chunks of memory that have the
+  MADV_HUGEPAGE flag set
+- When set to never, THPs are never allowed for tasks in this cpuset
+
+The behavior of thp_enabled for children of the root cpuset is where things
+become a bit more interesting.  The child cpusets accept the same flags as the
+root, but also have a default flag, which, when set, causes a cpuset to use the
+behavior of its parent.  When a child cpuset is created, its default flag is
+always initially set.
+
+Since the flags on child cpusets are allowed to differ from the flags on their
+parents, we are able to enable THPs for tasks in specific cpusets, and disable
+them in others.
+
+1.10 How do I use cpusets ?
 --
 
 In order to minimize the impact of cpusets on critical kernel
@@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance
 cpuset.mem_exclusive   cpuset.sched_relax_domain_level
 cpuset.mem_hardwallnotify_on_release
 cpuset.memory_migrate  tasks
+thp_enabled
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
@@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one 
after another:
...
 # /bin/echo PIDn > tasks
 
+2.5 Setting thp_enabled flags
+-
+
+The syntax for setting these flags is similar to setting thp flags in
+/sys/kernel/mm/transparent_hugepage/enabled.  In order to change the flags you
+simply echo the name of the flag you want to set to the thp_enabled file of the
+desired cpuset:
+
+# /bin/echo always > thp_enabled   -> always use THPs when practical
+# /bin/echo madvise > thp_enabled  -> only use THPs in madvise sections
+# /bin/echo never > thp_enabled-> never use THPs
+# /bin/echo default > thp_enabled  -> use parent cpuset's THP flags
+
+Note that the flags on the root cpuset cannot be changed in /dev/cpuset.  These
+flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can 
only
+be modified there.
 
 3. Questions
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index cc1b01c..624aafd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are 
defined in system? */
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
+extern void cpuset_update_top_thp_flags(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
 extern void cpuset_cpus_allowed(struct task

[PATCH 0/8] Re: [PATCH] Add per-process flag to control thp

2013-08-16 Thread Alex Thorlton
Here are the results from one of the benchmarks that performs
particularly poorly when thp is enabled.  Unfortunately the vclear
patches don't seem to provide a performance boost.  I've attached
the patches that include the changes I had to make to get the vclear
patches applied to the latest kernel.

This first set of tests was run on the latest community kernel, with the
vclear patches:

Kernel string: Kernel 3.11.0-rc5-medusa-00021-g1a15a96-dirty
harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# 
cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# 
time ./run.sh
...
Done. Terminating the simulation.

real25m34.052s
user10769m7.948s
sys 37m46.524s

harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# 
echo never > /sys/kernel/mm/transparent_hugepage/enabled
harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# 
cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# 
time ./run.sh
...
Done. Terminating the simulation.

real5m0.377s
user2202m0.684s
sys 108m31.816s

Here are the same tests on the clean kernel:

Kernel string: Kernel 3.11.0-rc5-medusa-00013-g584d88b

Kernel string: Kernel 3.11.0-rc5-medusa-00013-g584d88b
athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l>
 cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l>
 time ./run.sh
...
Done. Terminating the simulation.

real21m44.052s
user10809m55.356s
sys 39m58.300s


harp31-sys:~ # echo never > /sys/kernel/mm/transparent_hugepage/enabled
athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l>
 cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l>
 time ./run.sh
...
Done. Terminating the simulation.

real4m52.502s
user2127m18.548s
sys 104m50.828s

Working on getting some more information about the root of the
performance issues now...

Alex Thorlton (8):
  THP: Use real address for NUMA policy
  mm: make clear_huge_page tolerate non aligned address
  THP: Pass real, not rounded, address to clear_huge_page
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2
  remove KM_USER0 from kmap_atomic call
  fix up references to kernel_fpu_begin/end

 arch/x86/include/asm/page.h  |  2 +
 arch/x86/include/asm/string_32.h |  5 ++
 arch/x86/include/asm/string_64.h |  5 ++
 arch/x86/lib/Makefile|  1 +
 arch/x86/lib/clear_page_nocache_32.S | 30 
 arch/x86/lib/clear_page_nocache_64.S | 92 
 arch/x86/mm/fault.c  |  7 +++
 mm/huge_memory.c | 17 +++
 mm/memory.c  | 31 ++--
 9 files changed, 179 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/8] THP: Pass real, not rounded, address to clear_huge_page

2013-08-16 Thread Alex Thorlton
Conflicts:
mm/huge_memory.c
---
 mm/huge_memory.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55ec681..2e6f106 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -701,7 +701,8 @@ static inline pmd_t mk_huge_pmd(struct page *page, struct 
vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
-   unsigned long haddr, pmd_t *pmd,
+   unsigned long haddr,
+   unsigned long address, pmd_t *pmd,
struct page *page)
 {
pgtable_t pgtable;
@@ -711,12 +712,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
if (unlikely(!pgtable))
return VM_FAULT_OOM;
 
-   clear_huge_page(page, haddr, HPAGE_PMD_NR);
/*
 * The memory barrier inside __SetPageUptodate makes sure that
 * clear_huge_page writes become visible before the set_pmd_at()
 * write.
 */
+   clear_huge_page(page, address, HPAGE_PMD_NR);
__SetPageUptodate(page);
 
spin_lock(&mm->page_table_lock);
@@ -825,8 +826,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
put_page(page);
goto out;
}
-   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
+   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+   address, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
goto out;
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/8] remove KM_USER0 from kmap_atomic call

2013-08-16 Thread Alex Thorlton
---
 arch/x86/mm/fault.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a088ed7..d3f6dc7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1233,7 +1233,7 @@ do_page_fault(struct pt_regs *regs, unsigned long 
error_code)
 
 void clear_user_highpage_nocache(struct page *page, unsigned long vaddr)
 {
-   void *p = kmap_atomic(page, KM_USER0);
+   void *p = kmap_atomic(page);
clear_page_nocache(p);
kunmap_atomic(p);
 }
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/8] fix up references to kernel_fpu_begin/end

2013-08-16 Thread Alex Thorlton
---
 arch/x86/lib/clear_page_nocache_64.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/lib/clear_page_nocache_64.S 
b/arch/x86/lib/clear_page_nocache_64.S
index a6d938c..d03408e 100644
--- a/arch/x86/lib/clear_page_nocache_64.S
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -23,7 +23,7 @@
 ENTRY(clear_page_nocache)
CFI_STARTPROC
push   %rdi
-   call   kernel_fpu_begin
+   call   __kernel_fpu_begin
pop%rdi
sub$16,%rsp
CFI_ADJUST_CFA_OFFSET 16
@@ -43,7 +43,7 @@ ENTRY(clear_page_nocache)
movdqu (%rsp),%xmm0
addq   $16,%rsp
CFI_ADJUST_CFA_OFFSET -16
-   jmp   kernel_fpu_end
+   jmp   __kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache)
 
@@ -65,7 +65,7 @@ ENDPROC(clear_page_nocache)
 ENTRY(clear_page_nocache_avx)
CFI_STARTPROC
push   %rdi
-   call   kernel_fpu_begin
+   call   __kernel_fpu_begin
pop%rdi
sub$32,%rsp
CFI_ADJUST_CFA_OFFSET 32
@@ -85,7 +85,7 @@ ENTRY(clear_page_nocache_avx)
vmovdqu (%rsp),%ymm0
addq   $32,%rsp
CFI_ADJUST_CFA_OFFSET -32
-   jmp   kernel_fpu_end
+   jmp   __kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache_avx)
 
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8] x86: switch the 64bit uncached page clear to SSE/AVX v2

2013-08-16 Thread Alex Thorlton
---
 arch/x86/lib/clear_page_nocache_64.S | 91 ++--
 1 file changed, 77 insertions(+), 14 deletions(-)

diff --git a/arch/x86/lib/clear_page_nocache_64.S 
b/arch/x86/lib/clear_page_nocache_64.S
index ee16d15..a6d938c 100644
--- a/arch/x86/lib/clear_page_nocache_64.S
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -1,29 +1,92 @@
+/*
+ * Clear pages with cache bypass.
+ *
+ * Copyright (C) 2011, 2012 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ */
+
 #include 
+#include 
+#include 
 #include 
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi page
  */
 ENTRY(clear_page_nocache)
CFI_STARTPROC
-   xorl   %eax,%eax
-   movl   $4096/64,%ecx
+   push   %rdi
+   call   kernel_fpu_begin
+   pop%rdi
+   sub$16,%rsp
+   CFI_ADJUST_CFA_OFFSET 16
+   movdqu %xmm0,(%rsp)
+   xorpd  %xmm0,%xmm0
+   movl   $4096/SSE_UNROLL,%ecx
.p2align 4
 .Lloop:
decl%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-   movnti %rax,(%rdi)
-   PUT(1)
-   PUT(2)
-   PUT(3)
-   PUT(4)
-   PUT(5)
-   PUT(6)
-   PUT(7)
-   leaq64(%rdi),%rdi
+   .set x,0
+   .rept SSE_UNROLL/16
+   movntdq %xmm0,x(%rdi)
+   .set x,x+16
+   .endr
+   leaqSSE_UNROLL(%rdi),%rdi
jnz .Lloop
-   nop
-   ret
+   movdqu (%rsp),%xmm0
+   addq   $16,%rsp
+   CFI_ADJUST_CFA_OFFSET -16
+   jmp   kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+   .section .altinstr_replacement,"ax"
+1: .byte 0xeb  /* jmp  */
+   .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+16, 2b-1b
+   .previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+   CFI_STARTPROC
+   push   %rdi
+   call   kernel_fpu_begin
+   pop%rdi
+   sub$32,%rsp
+   CFI_ADJUST_CFA_OFFSET 32
+   vmovdqu %ymm0,(%rsp)
+   vxorpd  %ymm0,%ymm0,%ymm0
+   movl   $4096/AVX_UNROLL,%ecx
+   .p2align 4
+.Lloop_avx:
+   decl%ecx
+   .set x,0
+   .rept AVX_UNROLL/32
+   vmovntdq %ymm0,x(%rdi)
+   .set x,x+32
+   .endr
+   leaqAVX_UNROLL(%rdi),%rdi
+   jnz .Lloop_avx
+   vmovdqu (%rsp),%ymm0
+   addq   $32,%rsp
+   CFI_ADJUST_CFA_OFFSET -32
+   jmp   kernel_fpu_end
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/8] mm: make clear_huge_page tolerate non aligned address

2013-08-16 Thread Alex Thorlton
---
 mm/memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1ce2e2a..69a5a38 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4249,16 +4249,17 @@ void clear_huge_page(struct page *page,
 unsigned long addr, unsigned int pages_per_huge_page)
 {
int i;
+   unsigned long haddr = addr & HPAGE_PMD_MASK;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, addr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, pages_per_huge_page);
return;
}
 
might_sleep();
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
}
 }
 
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/8] THP: Use real address for NUMA policy

2013-08-16 Thread Alex Thorlton
---
 mm/huge_memory.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..55ec681 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -746,11 +746,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
  struct vm_area_struct *vma,
- unsigned long haddr, int nd,
+ unsigned long address, int nd,
  gfp_t extra_gfp)
 {
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-  HPAGE_PMD_ORDER, vma, haddr, nd);
+  HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -815,7 +815,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
return 0;
}
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1160,7 +1160,7 @@ alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
else
new_page = NULL;
 
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8] x86: Add clear_page_nocache

2013-08-16 Thread Alex Thorlton
Conflicts:
arch/x86/mm/fault.c
---
 arch/x86/include/asm/page.h  |  2 ++
 arch/x86/include/asm/string_32.h |  5 +
 arch/x86/include/asm/string_64.h |  5 +
 arch/x86/lib/Makefile|  1 +
 arch/x86/lib/clear_page_nocache_32.S | 30 ++
 arch/x86/lib/clear_page_nocache_64.S | 29 +
 arch/x86/mm/fault.c  |  7 +++
 7 files changed, 79 insertions(+)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index c878924..ccec9c3 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -33,6 +33,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long 
pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 96b2c66..e48776c 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_nocache_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
diff --git a/arch/x86/lib/clear_page_nocache_32.S 
b/arch/x86/lib/clear_page_nocache_32.S
new file mode 100644
index 000..2394e0c
--- /dev/null
+++ b/arch/x86/lib/clear_page_nocache_32.S
@@ -0,0 +1,30 @@
+#include 
+#include 
+
+/*
+ * Zero a page avoiding the caches
+ * rdi page
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   mov%eax,%edi
+   xorl   %eax,%eax
+   movl   $4096/64,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
+   PUT(0)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+   lea 64(%edi),%edi
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/lib/clear_page_nocache_64.S 
b/arch/x86/lib/clear_page_nocache_64.S
new file mode 100644
index 000..ee16d15
--- /dev/null
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -0,0 +1,29 @@
+#include 
+#include 
+
+/*
+ * Zero a page avoiding the caches
+ * rdi page
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   xorl   %eax,%eax
+   movl   $4096/64,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) movnti %rax,x*8(%rdi)
+   movnti %rax,(%rdi)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+   leaq64(%rdi),%rdi
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 654be4a..a088ed7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1230,3 +1230,10 @@ do_page_fault(struct pt_regs *regs, unsigned long 
error_code)
__do_page_fault(regs, error_code);
exception_exit(prev_state);
 }
+
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr)
+{
+   void *p = kmap_atomic(page, KM_USER0);
+   clear_page_nocache(p);
+   kunmap_atomic(p);
+}
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@v

[PATCH 5/8] mm: make clear_huge_page cache clear only around the fault address

2013-08-16 Thread Alex Thorlton
---
 mm/memory.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 69a5a38..17d61f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4231,18 +4231,35 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page)
 {
int i;
struct page *p = page;
+   unsigned long vaddr;
+   unsigned long haddr = addr & HPAGE_PMD_MASK;
+   int target = (addr - haddr) >> PAGE_SHIFT;
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page;
 i++, p = mem_map_next(p, page, i)) {
cond_resched();
-   clear_user_highpage(p, addr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE  || i == target)
+   clear_user_highpage(p, vaddr);
+   else
+   clear_user_highpage_nocache(p, vaddr);
}
 }
 void clear_huge_page(struct page *page,
@@ -4250,6 +4267,8 @@ void clear_huge_page(struct page *page,
 {
int i;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+   unsigned long vaddr;
+   int target = (addr - haddr) >> PAGE_SHIFT;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
clear_gigantic_page(page, haddr, pages_per_huge_page);
@@ -4257,9 +4276,14 @@ void clear_huge_page(struct page *page,
}
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE || i == target)
+   clear_user_highpage(page + i, vaddr);
+   else
+   clear_user_highpage_nocache(page + i, vaddr);
}
 }
 
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] Re: [PATCH] Add per-process flag to control thp

2013-08-16 Thread Alex Thorlton
> This 321.equake_l thing is not public, right? Do you have anything
> that
> is public that shows the same problem?

The quake test comes from the SPEC OMP benchmarks.  While I believe this
suite is available to anybody, I don't think it's free.  I was given
the test by our benchmarking team, I'll have to ask them if this is
publicly available/if they know of any publicly available tests that can
reproduce the issue.

> Do you have any idea where all the extra time is spend? 

I'm working on that as we speak.  I was trying to eliminate the
possibility that the vclear patches would help solve the problem,
before digging any deeper.  I should have some more information on
exactly where it's getting hung up in a few hours.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] THP: Use real address for NUMA policy

2013-08-16 Thread Alex Thorlton
> Could you add some actual descriptions to these patches that say why you
> are doing this, and why this particular patch is needed and implemented
> this way?
> 
> You mention that THP is slow for you, then go on to implement some
> non-cached page zero'ing, but you never quite connect the dots.

I actually didn't write these patches (made a few tweaks to get them
running on the latest kernel though).  They were submitted last July by 
Peter Zijlstra.  Andi Kleen suggested that I re-run some of my tests
using these patches to see whether they solved my issue.  I just
included my updated patches so that people could confirm that I'd pulled
them forward properly.

The messages from the original submission can be found here:
https://lkml.org/lkml/2012/7/20/165

I did write the patch that these were submitted in response to, to
control THP on a per-process basis, but it looks like we're probably
going to end up taking this in a different direction, pending some more
test results.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/8] THP: Use real address for NUMA policy

2013-08-16 Thread Alex Thorlton
On Fri, Aug 16, 2013 at 09:46:35PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 16, 2013 at 01:17:28PM -0500, Alex Thorlton wrote:
> > I actually didn't write these patches (made a few tweaks to get them
> > running on the latest kernel though).  They were submitted last July by 
> > Peter Zijlstra. 
> 
> By Kirill, I don't think I've ever touched them.
> 
> > The messages from the original submission can be found here:
> > https://lkml.org/lkml/2012/7/20/165
> 
> As can be seen by this posting you reference ;-)

Sorry about that!  Not sure where I got the idea that you wrote them,
must've gotten some names mixed up somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-07-29 Thread Alex Thorlton
We've re-evaluated the need for a patch to support some sort of finer
grained control over thp, and, based on some tests performed by our 
benchmarking team, we're seeing that we'd definitely still like to
implement some method to support this.  Here's an e-mail from John 
Baron (jba...@sgi.com), on our benchmarking team,  containing some data
which shows a decrease in performance for some SPEC OMP benchmarks when
thp is enabled:

> Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64
> sockets.  These show the performance ratio for jobs run with THP
> disabled versus THP enabled (so > 1.0 means THP disabled is faster).  
> One possible reason for lower performance with THP enabled is that the 
> larger page granularity can result in more remote data accesses.
> 
> 
> SPEC OMP2012:
> 
> 350.md  1.0
> 351.bwaves  1.3
> 352.nab 1.0
> 357.bt331   0.9
> 358.botsalgn1.0
> 359.botsspar1.1
> 360.ilbdc   1.8
> 362.fma3d   1.0
> 363.swim1.4
> 367.imagick 0.9
> 370.mgrid3311.1
> 371.applu3310.9
> 372.smithwa 1.0
> 376.kdtree  1.0
> 
> SPEC OMPL2001:
> 
> 311.wupwise_l   1.1
> 313.swim_l  1.5
> 315.mgrid_l 1.0
> 317.applu_l 1.1
> 321.equake_l5.8
> 325.apsi_l  1.5
> 327.gafort_l1.0
> 329.fma3d_l 1.0
> 331.art_l   0.8
> 
> One could argue that real-world applications could be modified to avoid
> these kinds of effects, but (a) it is not always possible to modify code
> (e.g. in benchmark situations) and (b) even if it is possible to do so,
> it is not necessarily easy to do so (e.g. for customers with large
> legacy Fortran codes).
> 
> We have also observed on Intel Sandy Bridge processors that, as
> counter-intuitive as it may seem, local memory bandwidth is actually
> slightly lower with THP enabled (1-2%), even with unit stride data
> accesses.  This may not seem like much of a performance hit but
> it is important for HPC workloads.  No code modification will help here.

In light of the previous issues discussed in this thread, and some
suggestions from David Rientjes:

> why not make it per-process so users don't have to configure
> cpusets to control it?

Robin and I have come up with a proposal for a way to replicate behavior
similar to what this patch introduced, only on a per-process level
instead of at the cpuset level.

Our idea would be to add a flag somewhere in the task_struct to keep
track of whether or not thp is enabled for each task.  The flag would be
controlled by an additional option included in prctl(), allowing
programmers to set/unset this flag via the prctl() syscall.  We would 
also introduce some code into the clone() syscall to ensure that this 
flag is copied down from parent to child tasks when necessary.  The flag
would be checked in the same place the the per-cpuset flag was checked
in my original patch, thereby allowing the same behavior to be
replicated on a per-process level.

In this way, we will also be able to get static binaries to behave
appropriately by setting this flag in a userland program, and then
having that program exec the static binary for which we need to disable
thp.

This solution allows us to incorporate the behavior that we're looking
for into the kernel, without abusing cpusets for the purpose of
containerization.

Please let me know if anyone has any objections to this approach, or if
you have any suggestions as to how we could improve upon this idea.

Thanks!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add per-process flag to control thp

2013-08-02 Thread Alex Thorlton
This patch implements functionality to allow processes to disable the use of
transparent hugepages through the prctl syscall.

We've determined that some jobs perform significantly better with thp disabled,
and we need a way to control thp on a per-process basis, without relying on
madvise.

---
 include/linux/huge_mm.h| 14 +-
 include/linux/init_task.h  |  8 
 include/linux/sched.h  |  3 +++
 include/uapi/linux/prctl.h |  3 +++
 kernel/fork.c  |  4 
 kernel/sys.c   | 31 +++
 6 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b60de92..53af3ca 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_HUGE_MM_H
 #define _LINUX_HUGE_MM_H
 
+#include 
+
 extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmd,
@@ -66,7 +68,7 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
-#define transparent_hugepage_enabled(__vma)\
+#define _transparent_hugepage_enabled(__vma)   \
((transparent_hugepage_flags &  \
  (1sequential_io_avg= 0;
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   p->thp_disabled = current->thp_disabled;
+#endif
+
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 771129b..416c8a6 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1836,6 +1836,31 @@ static int prctl_get_tid_address(struct task_struct *me, 
int __user **tid_addr)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int prctl_set_thp_disabled(struct task_struct *me)
+{
+   me->thp_disabled = 1;
+   return 0;
+}
+
+static int prctl_get_thp_disabled(struct task_struct *me,
+ int __user *thp_disabled)
+{
+   return put_user(me->thp_disabled, thp_disabled);
+}
+#else
+static int prctl_set_thp_disabled(struct task_struct *me)
+{
+   return -EINVAL;
+}
+
+static int prctl_get_thp_disabled(struct task_struct *me,
+ int __user *thp_disabled)
+{
+   return -EINVAL;
+}
+#endif
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
 {
@@ -1999,6 +2024,12 

Re: [PATCH] Add per-process flag to control thp

2013-08-02 Thread Alex Thorlton
> Instead of blowing a whole int on this bool, we could add it
> to the bitfield a few pages up where we have other prctl things like..
> 
>   unsigned no_new_privs:1;

Definitely a better way to go.  I'll tweak that and float another
version of the patch.  Thanks for the input, Dave!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add per-process flag to control thp

2013-08-02 Thread Alex Thorlton
> What kind of workloads are you talking about?

Our benchmarking team has a list of several of the SPEC OMP benchmarks
that perform significantly better when THP is disabled. I tried to get
the list but one of our servers is acting up and I can't get to it
right now :/

> What's wrong with madvise? Could you elaborate?

The main issue with using madvise is that it's not an option with static
binaries, but there are also some users who have legacy Fortran code
that they're not willing/able to change.

> And I think thp_disabled should be reset to 0 on exec.

The main purpose for this getting carried down from the parent process
is that we'd like to be able to have a userland program set this flag on
itself, and then spawn off children who will also carry the flag.
This allows us to set the flag for programs where we're unable to modify
the code, thus resolving the issues detailed above.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2] Add per-process flag to control thp

2013-08-02 Thread Alex Thorlton
This patch implements functionality to allow processes to disable the use of
transparent hugepages through the prctl syscall.

We've determined that some jobs perform significantly better with thp disabled,
and we needed a way to control thp on a per-process basis, without relying on
madvise.

v2 - tweaked thp_disabled flag to be a single bit instead of an int

---
 include/linux/huge_mm.h| 14 +-
 include/linux/init_task.h  |  8 
 include/linux/sched.h  |  4 
 include/uapi/linux/prctl.h |  3 +++
 kernel/fork.c  |  4 
 kernel/sys.c   | 31 +++
 6 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b60de92..53af3ca 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_HUGE_MM_H
 #define _LINUX_HUGE_MM_H
 
+#include 
+
 extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmd,
@@ -66,7 +68,7 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
-#define transparent_hugepage_enabled(__vma)\
+#define _transparent_hugepage_enabled(__vma)   \
((transparent_hugepage_flags &  \
  (1sequential_io_avg= 0;
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   p->thp_disabled = current->thp_disabled;
+#endif
+
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 771129b..416c8a6 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1836,6 +1836,31 @@ static int prctl_get_tid_address(struct task_struct *me, 
int __user **tid_addr)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int prctl_set_thp_disabled(struct task_struct *me)
+{
+   me->thp_disabled = 1;
+   return 0;
+}
+
+static int prctl_get_thp_disabled(struct task_struct *me,
+ int __user *thp_disabled)
+{
+   return put_user(me->thp_disabled, thp_disabled);
+}
+#else
+static int prctl_set_thp_disabled(struct task_struct *me)
+{
+   return -EINVAL;
+}
+
+static int prctl_get_thp_disabled(struct task_struct *me,
+ int __user *thp_disabled)
+{
+   return -EINVAL;
+}
+#endif
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, ar

Re: [PATCHv2] Add per-process flag to control thp

2013-08-05 Thread Alex Thorlton
 think the changelog should explain why madvise() is bad.

No problem.  I wanted to keep it simple for the original submission, but
that's probably something that should be included.

> But I simply can't understand why this flag is per-thread. It should
> be
> mm flag, no?

This is something that we (Robin and I) had discussed a while back, and,
upon review, I'm beginning to agree that this might be the more sensible
route to take.

I'm going to try and gather a bit more data to see if we can get some
more exact answers as to why THP is performing so poorly under certain
conditions before trying to push this particular patch any further.

Thanks for the input!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add per-process flag to control thp

2013-08-05 Thread Alex Thorlton
> Please try it with the vclear patches.

Thanks, Andi.  I'll give that a shot and see if it makes any difference.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] Add per-process flag to control thp

2013-08-05 Thread Alex Thorlton
> It would be good to describe which jobs and why thp cannot be fixed for
> these.

I'm going to see if I can get the benchmarks that are causing trouble up
and running on my own to get some more information (problem was
originally identified by our benchmarking team).

Thanks for the input!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] Add per-process flag to control thp

2013-08-06 Thread Alex Thorlton
> What everyone else said, plus...
> 
> I worry about the inherit-across-fork thing.  A scenario we should
> think about is where the user doesn't build his own executables.  So he
> writes a little wrapper which runs prctl(PR_SET_THP_DISABLED, 0) then
> execs the third-party-app.  But what happens if that app execs children
> who *do* want THP?  He has to hunt down and wrap those as well?
> 
> It all seems a bit unwieldy.  I wonder if controlling it via the ELF
> header would be more practical.

Thanks, Andrew.  I'm doing some more testing and looking into using a
different method for controlling this.  At this point, I think it's fair
to say that we don't want to control this using the method that I've
proposed here, no matter how we look at it.

I've gotten my hands on some of the benchmarks/code that were used to
originally uncover the performance issues we're seeing.  I'm currently
trying to separate out the performance issues that are being caused by
the kernel code from issues involving hardware - the cost of remote
memory accesses is a bit higher on our systems with node controllers vs.
glueless QPI/Hypertransport-based systems.  

At this point, it's difficult to say whether or not the issue can be
solved by "fixing the performance issues with thp," as several have
suggested.  Don't get me wrong I like the idea of that solution as well;
we're just not sure, right now, if that's going to solve all of our
problems.

I'll be back when I've dug up a bit more info on the issue.  Thanks for
the input, everyone!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Skip UV runtime services mapping in the efi_runtime_disabled case

2016-08-01 Thread Alex Thorlton
On Mon, Aug 01, 2016 at 02:49:57PM +0100, Matt Fleming wrote:
> On Tue, 26 Jul, at 05:38:33PM, Alex Thorlton wrote:
> > This problem has actually been in the UV code for a while, but we didn't
> > catch it until recently, because we had been relying on EFI_OLD_MEMMAP
> > to allow our systems to boot for a period of time.  We noticed the issue
> > when trying to kexec a recent community kernel, where we hit this NULL
> > pointer dereference in efi_sync_low_kernel_mappings:
> > 
> > [0.337515] BUG: unable to handle kernel NULL pointer dereference at 
> > 0880
> > [0.346276] IP: [] 
> > efi_sync_low_kernel_mappings+0x5d/0x1b0
> > 
> > The problem doesn't show up with EFI_OLD_MEMMAP because we skip the
> > chunk of setup_efi_state that sets the efi_loader_signature for the
> > kexec'd kernel.  When the kexec'd kernel boots, it won't set EFI_BOOT in
> > setup_arch, so we completely avoid the bug.
> > 
> > We always kexec with noefi on the command line, so this shouldn't be an
> > issue, but since we're not actually checking for efi_runtime_disabled in
> > uv_bios_init, we end up trying to do EFI runtime callbacks when we
> > shouldn't be. This patch just adds a check for efi_runtime_disabled in
> > uv_bios_init so that we don't map in uv_systab when runtime_disabled ==
> > true.
> > 
> > Signed-off-by: Alex Thorlton 
> > Cc: Russ Anderson 
> > Cc: Mike Travis 
> > Cc: Matt Fleming 
> > Cc: Borislav Petkov 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin" 
> > Cc: x...@kernel.org
> > ---
> >  arch/x86/platform/uv/bios_uv.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c
> > index 66b2166..0df8a03 100644
> > --- a/arch/x86/platform/uv/bios_uv.c
> > +++ b/arch/x86/platform/uv/bios_uv.c
> > @@ -187,7 +187,8 @@ EXPORT_SYMBOL_GPL(uv_bios_set_legacy_vga_target);
> >  void uv_bios_init(void)
> >  {
> > uv_systab = NULL;
> > -   if ((efi.uv_systab == EFI_INVALID_TABLE_ADDR) || !efi.uv_systab) {
> > +   if ((efi.uv_systab == EFI_INVALID_TABLE_ADDR) ||
> > +   !efi.uv_systab || efi_runtime_disabled()) {
> > pr_crit("UV: UVsystab: missing\n");
> > return;
> > }
> 
> The fix looks fine, but I'm losing track of which kernels this patch
> should be applied to. Does it just need to be applied for v4.8 or
> earlier kernels too?

Well, we *have* to boot v4.6 and v4.7 with efi=old_map, which will avoid
our kexec problem entirely, so while the patch would apply just fine on
those kernels, and achieve the desired effect, we wouldn't really get
any benefit from it.

So, it definitely needs to go in for v4.8, but it's kind of a toss-up
for the older kernels.  I'll discuss this with the other guys around
here to see what they think, and get back to you a bit later, if that's
alright?

- Alex


Re: [RFC PATCH] Fix EFI callbacks on UV during kexec

2016-08-01 Thread Alex Thorlton
On Mon, Aug 01, 2016 at 02:39:26PM +0100, Matt Fleming wrote:
> On Tue, 26 Jul, at 05:38:32PM, Alex Thorlton wrote:
> > 
> > After investigating the problem here and figuring out the proper way to
> > get the noefi parameter working again, I noticed that there appears to
> > be support for EFI runtime callbacks in a kexec'd kernel now...  I
> > think we need some more cleanup here to get that all working entirely.
> > Without noefi, we hit a bad paging request when we try to do EFI 
> > callbacks:
>  
> [...]
> 
> > [0.341531] BUG: unable to handle kernel paging request at 
> > 6a1ab938
> > [0.349319] IP: [<6a1ab938>] 0x6a1ab938
> > [0.354386] PGD 354e0063 PUD 0
> > [0.357910] Oops: 0010 [#1] SMP
> > [0.361414] Modules linked in:
> > [0.364833] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 4.7.0-runtime-check+ #713
> 
> [...]
> 
> > This is due to the fact that the efi_map_region_fixed calls in
> > kexec_enter_virtual_mode, which map in the EFI runtime memory
> > descriptors, only map the virtual address of the descriptor.
> > Unfortunately, since we're still relying on the physical address of our
> > EFI runtime code being mapped in, we don't have access to that code in
> > the kexec scenario.
> > 
> > A potential fix for this would be to map in the physical addresses of
> > the descriptors as well as the virtual addresses in
> > efi_map_region_fixed, but the more "correct" fix would be to update
> > our system table pointer to its new virtual address during
> > SetVirtualAddressMap.  We intend to get that piece fixed up relatively
> > soon, but haven't quite gotten around to it yet.
> 
> I don't think it would be so bad if we did the 1:1 mappings in the
> kexec kernel too, we've got our own page tables after all and the VA
> space is available. It would be required if people ever want to use
> kexec with mixed mode kernels too.

Hmm...   That's a good point.  It certainly would be nice for us to have
those mappings in the kexec kernel, at least for the time being.  If
you're not opposed to it, I can write up the patch.  Pretty sure it's a
one-liner.

- Alex


Re: [RFC PATCH] Fix EFI callbacks on UV during kexec

2016-08-04 Thread Alex Thorlton
On Thu, Aug 04, 2016 at 10:25:32AM +0100, Matt Fleming wrote:
> On Mon, 01 Aug, at 09:34:10AM, Alex Thorlton wrote:
> > 
> > Hmm...   That's a good point.  It certainly would be nice for us to have
> > those mappings in the kexec kernel, at least for the time being.  If
> > you're not opposed to it, I can write up the patch.  Pretty sure it's a
> > one-liner.
> 
> If it's that trivial, sure, please go ahead and submit.

Sure thing.  I played around with this a bit before I sent this patch
up.  I'm pretty sure it all worked as expected, but I'll need to double
check everything.  I'll try and get it out in the next day or two.

Thanks, Matt!

- Alex


[PATCH] Map in physical addresses in efi_map_region_fixed

2016-08-06 Thread Alex Thorlton
This is a simple change to add in the physical mappings as well as the
virtual mappings in efi_map_region_fixed.  The motivation here is to
get access to EFI runtime code that is only available via the 1:1
mappings on a kexec'd kernel.

The added call is essentially the kexec analog of the first __map_region
that Boris put in efi_map_region in commit d2f7cbe7b26a ("x86/efi:
Runtime services virtual mapping").

Signed-off-by: Alex Thorlton 
Cc: Russ Anderson 
Cc: Dimitri Sivanich 
Cc: Mike Travis 
Cc: Matt Fleming 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: linux-...@vger.kernel.org
---
 arch/x86/platform/efi/efi_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 459bcbb..b206126 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -363,6 +363,7 @@ void __init efi_map_region(efi_memory_desc_t *md)
  */
 void __init efi_map_region_fixed(efi_memory_desc_t *md)
 {
+   __map_region(md, md->phys_addr);
__map_region(md, md->virt_addr);
 }
 
-- 
1.8.5.6



[BUG] Boot hangs at clocksource_done_booting on large configs

2015-08-31 Thread Alex Thorlton
Hey guys,

We've been hitting issues with machines with 4096 or more logical cores
hanging up at clocksource_done_booting, seemingly forever.  The problem
occurs somewhat intermittently, though it hits more consistently as you
increase the core count.  We've observed this problem on kernels
starting at 3.10, all the way up through 4.2.

We initially thought there was some issue with the stop_machine code,
but Thomas G. cleared that up for us with a thorough explanation of the
stop_machine mechanism.  After ruling out issues in stop_machine, we
started investigating possible hardware related scaling issues.

I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
out any scaling issues related to multiple concurrent reads to our
RTC's MMR.  We start getting hung task timeouts during
clocksource_done_booting:

8<
 calling  clocksource_done_booting+0x0/0x42 @ 1
 INFO: task swapper/0:1 blocked for more than 480 seconds.
   Not tainted 4.2.0-rc1-nortc+ #384
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 swapper/0   D 881efa933b48 0 1  0 0x
  881efa933b48 881ef45e b5de7a13 00fdfa933b38
  881efa934000 7fff 881efa933ca0 b5de7a13
   881efa933b68 8158eae7 881efa933b68
 Call Trace:
  [] schedule+0x37/0x80
  [] schedule_timeout+0x1e7/0x270
  [] ? _find_next_bit.part.0+0x19/0x70
  [] wait_for_completion+0xa0/0x100
  [] ? wake_up_q+0x70/0x70
  [] ? cpu_stop_should_run+0x50/0x50
  [] __stop_cpus+0x50/0x70
  [] ? console_unlock+0x2dd/0x4a0
  [] ? cpu_stop_should_run+0x50/0x50
  [] stop_cpus+0x35/0x50
  [] ? tk_setup_internals.constprop.10+0x140/0x140
  [] __stop_machine+0xb2/0xe0
  [] ? tk_setup_internals.constprop.10+0x140/0x140
  [] ? tk_setup_internals.constprop.10+0x140/0x140
  [] stop_machine+0x2e/0x50
  [] timekeeping_notify+0x2d/0x50
  [] __clocksource_select+0xcb/0x120
  [] ? boot_override_clock+0x4d/0x4d
  [] clocksource_done_booting+0x32/0x42
  [] do_one_initcall+0x169/0x1d0
  [] kernel_init_freeable+0x1bb/0x248
  [] ? rest_init+0x80/0x80
  [] kernel_init+0xe/0xe0
  [] ret_from_fork+0x3f/0x70
  [] ? rest_init+0x80/0x80
>8

After this I NMId the machine to get backtraces for the other CPUs.
Most are stuck around here:

8<
 CPU: 4867 PID: 19730 Comm: migration/4867 Tainted: G  D 
4.2.0-rc1-nortc+ #384
 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013
 task: ad5e798cc500 ti: ad5e798d4000 task.ti: ad5e798d4000
 RIP: 0010:[]  [] uv_handle_nmi+0x19/0x790
 RSP: :c90020156da8  EFLAGS: 00010092
 RAX: 02fe65b95384 RBX: 002d RCX: c90020162a98
 RDX: 02fe65b95384 RSI: c90020156ef8 RDI: 0001
 RBP: c90020156e18 R08: 0001 R09: 
 R10: 0003 R11: 0020 R12: 02fe65b95384
 R13:  R14: 81a38880 R15: c90020156ef8
 FS:  () GS:c90020151000() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: da80 CR3: 01a0b000 CR4: 001406e0
 Stack:
     
     
   002d 02fe65b95384 
 Call Trace:
  
  [] ? cpu_stop_should_run+0x50/0x50
  [] nmi_handle+0x79/0x100
  [] ? nmi_handle+0x83/0x100
  [] ? cpu_stop_should_run+0x50/0x50
  [] unknown_nmi_error+0x1c/0x90
  [] default_do_nmi+0xd8/0xf0
  [] do_nmi+0x83/0xb0
  [] end_repeat_nmi+0x1e/0x2e
  [] ? cpu_stop_should_run+0x50/0x50
  [] ? read_hpet+0x16/0x20
  [] ? read_hpet+0x16/0x20
  [] ? read_hpet+0x16/0x20
  <>
  
  [] ktime_get+0x37/0xb0
  [] clockevents_program_event+0x4b/0x130
  [] tick_handle_periodic+0x57/0x70
  [] local_apic_timer_interrupt+0x3c/0x70
  [] smp_apic_timer_interrupt+0x41/0x60
  [] apic_timer_interrupt+0x6b/0x70
  
  [] ? update_blocked_averages+0x381/0x860
  [] ? multi_cpu_stop+0x98/0xf0
  [] ? __switch_to+0x26a/0x590
  [] ? cpu_stop_should_run+0x50/0x50
  [] cpu_stopper_thread+0x88/0x110
  [] ? __schedule+0x338/0x900
  [] smpboot_thread_fn+0x135/0x190
  [] ? sort_range+0x30/0x30
  [] kthread+0xdb/0x100
  [] ? kthread_create_on_node+0x180/0x180
  [] ret_from_fork+0x3f/0x70
  [] ? kthread_create_on_node+0x180/0x180
 Code: 31 c0 e8 a0 c2 52 00 e9 39 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 
48 89 e5 41 57 41 56 41 55 41 54 49 89 f7 53 48 83 ec 48 <48> 8b 1d 00 f2 fa 7e 
65 44 8b 2d 70 b8 fa 7e 9c 58 0f 1f 44 00
>8

I can provide full logs if desired.

Anyways, based on my limited understanding of what's going on here, I
tried disabling interrupts a bit further out in the cpu_stopper_thread
code, to try and avoid that apic_timer_interrupt during the clocksource
change.  change_clocksource already disables interrupts, but, from the
backtraces, it appears that we're taking a

Re: [BUG] Boot hangs at clocksource_done_booting on large configs

2015-08-31 Thread Alex Thorlton
On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote:
> I can provide full logs if desired.

Full log is here:

http://oss.sgi.com/projects/athorlton/harp50-6144-nortc

I'd use wget to download it.  Probably don't want to try to open the 15M
file in your browser :)

Also, I've filed a kernel BugZilla for the same issue, in case anyone
would rather use BugZilla:

https://bugzilla.kernel.org/show_bug.cgi?id=103821

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Boot hangs at clocksource_done_booting on large configs

2015-08-31 Thread Alex Thorlton
On Mon, Aug 31, 2015 at 10:32:50PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote:
> q
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index fd643d8..8502521 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu)
> >  {
> > struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
> > struct cpu_stop_work *work;
> > +   unsigned long flags;
> > int ret;
> > 
> > +   local_irq_save(flags);
> > +
> >  repeat:
> > work = NULL;
> > spin_lock_irq(&stopper->lock);
> > @@ -452,6 +455,8 @@ repeat:
> > cpu_stop_signal_done(done, true);
> > goto repeat;
> > }
> > +
> > +   local_irq_restore(flags);
> >  }
> > 
> 
> So I should probably just go sleep and not say anything.. _but_
> *confused*.
> 
> That local_irq_save() will disable IRQs over:
> 
>   work = NULL;
> 
> But that is _all_. The spin_unlock_irq() will re-enable IRQs, after
> which things will run as usual.
> 
> That local_irq_restore() is a total NOP, IRQs are guaranteed enabled at
> the irq_local_save() (otherwise lockdep would've complained about
> spin_unlock_irq() unconditionally enabling them) and by the time we get
> to the restore that same unlock_irq will have enabled them already.

Ahh, right.  Well that code is worthless then :)

Either way though, I guess that means that slight change just fudged the
timing enough in that area to avoid the lockup we're seeing.  Ignoring
my useless code change, does anything else jump out at you as
interesting, here?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Boot hangs at clocksource_done_booting on large configs

2015-09-01 Thread Alex Thorlton
On Mon, Aug 31, 2015 at 11:12:12PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Alex Thorlton wrote:
> > I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
> > out any scaling issues related to multiple concurrent reads to our
> > RTC's MMR.
> 
> And to rule out scaling issues you replaced the RTC MMR with HPET. Not
> a very good choice:
> 
> HPET does not scale either. It's uncached memory mapped I/O. See
> below.

Ahhh.  I wondered if there might be an issue there.  I can say that I've
seen it hit a very similar issue with some other possible solutions that
we've tried, but this was the most recent, fairly unmodified kernel that
I've played with, so I went with this.

> > I'm hoping to get some input from the experts in this area, first of
> > all, on whether the problem I'm seeing is actually what I think it is,
> > and, if so, if I've solved it in the correct way.
> 
> I fear both the analysis and the solution is wrong.
> 
> Up to the point where the actual clocksource change happens there is
> no reason why timer interrupts should not happen. And the code which
> actually changes the clocksource is definitely called with interrupts
> disabled. When that function returns the new clocksource is fully
> functional and interrupts can happen again.
> 
> Now looking at your backtraces. Most CPUs are in the migration thread
> and a few (3073,3078,3079,3082) are in the idle task.
> 
> From the trace artifacts (? read_hpet) it looks like the clock source
> change has been done and the cpus are on the way back from stop
> machine.

That's good to note.  I'll keep that in mind as I continue gathering
info.

> But they are obviously held off by something. And that something looks
> like the timekeeper sequence lock. 

Ok.  IIRC, that lock prevents reads to the clocksource until all
necessary write operations have finished, correct?  i.e. tk_core.seq?

I looked at that as being a possible place to get stuck in ktime_get:

do {
seq = read_seqcount_begin(&tk_core.seq);
base = tk->tkr_mono.base;
nsecs = timekeeping_get_ns(&tk->tkr_mono);

} while (read_seqcount_retry(&tk_core.seq, seq));

But I don't have enough evidence to point there for sure.

> Too bad, that we don't have a backtrace for CPU0 in the log.

I think that may have gotten swallowed up with all the lost printks that
you'll notice in there.  Does the hung task backtrace not cover that
though?

> I really wonder how a machine that large works with HPET as
> clocksource at all. hpet_read() is uncached memory mapped IO which
> takes thousands of CPU cycles. Last time I looked it was around
> 1us. Let's take that number to do some math.
> 
> If all CPUs do that access at the same time, then it takes NCPUS
> microseconds to complete if the memory mapped I/O scheduling is
> completely fair, which I doubt. So with 4k CPUs thats whopping 4.096ms
> and it gets worse if you go larger. That's more than a tick with
> HZ=250.
> 
> I'm quite sure that you are staring at the HPET scalability bottleneck
> and not at some actual kernel bug.

That could well be the case.

> Your patch shifts some timing around so the issue does not happen, but
> that's certainly not a solution.

I was fairly sure of that, but it's good to know that you agree.

I'll try and get a backtrace from a completely unmodified kernel here in
the next few days.  Unfortunately, as mentioned, we only hit the issue
intermittently.  I was able to reproduce pretty regularly on kernels
from 4.0 and before, but it seems to be less common on newer kernels,
though we definitely still manage to hit it from time to time.

Thanks for all the help!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] kzalloc overflow in lpfc driver on 6k core system

2014-12-08 Thread Alex Thorlton
On Sat, Dec 06, 2014 at 03:14:45PM -0500, James Smart wrote:
> Myself and several others here at Emulex maintain the code. The
> recommendations look fine.   Feel free to post something if you beat
> us to the work.

Great!  Thanks for letting me know, James.  I will probaby have some
time to look at this later this week.  Please let me know if somebody
over there gets around to looking at it - no need to have two groups
chasing the same issue :)

Thanks again!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix KMALLOC_MAX_SIZE overflow during cpumask allocation

2014-12-08 Thread Alex Thorlton
On Mon, Dec 08, 2014 at 11:42:14AM +0100, Ingo Molnar wrote:
> This patch fails to build with certain configs:
> 
> kernel/sched/core.c:7130:33: error: incompatible types when assigning to type 
> ‘cpumask_var_t’ from type ‘void *’

Thanks for letting us know, Ingo.  I believe George has something in
mind to fix this.  We'll take a look and get another version out ASAP!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Possible locking issues in stop_machine code on 6k core machine

2014-12-04 Thread Alex Thorlton
On Thu, Dec 04, 2014 at 12:34:04AM +0100, Thomas Gleixner wrote:
> first of all, could you please format your mail so it is actually
> readable and can be replied to?

My fault there - stupid mistake.

> If you would provide real data instead of half baken assumptions
> we might actually be able to help you.
> 
> There might be some hidden issue in the stomp machine crap, but surely
> not the one you described.

I'll return with some actual data so that others can get a better
picture of what we're seeing.  I'll also see if I can come up with a
way to reproduce the issue on a smaller configuration; might not be
possible, but attempting to do so might provide some more insight into
our problem.

Thanks for your detailed description of the code path!  I definitely
understand that our assumption can't be the actual problem here.  I'll
dig back into the issue and try to get some better information.

Thanks again!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] mm, thp: Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

2014-02-25 Thread Alex Thorlton
This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags
for VMs.  It also adds a prctl control which alllows us to set the THP
disable bit in mm->def_flags so that VMs will pick up the setting as
they are created.

Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 include/linux/mm.h |  3 +++
 include/uapi/linux/prctl.h |  3 +++
 kernel/fork.c  | 11 ---
 kernel/sys.c   | 15 +++
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..0314e4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -177,6 +177,9 @@ extern unsigned int kobjsize(const void *objp);
  */
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
 
+/* This mask defines which mm->def_flags a process can inherit its parent */
+#define VM_INIT_DEF_MASK   VM_NOHUGEPAGE
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..58afc04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,7 @@
 
 #define PR_GET_TID_ADDRESS 40
 
+#define PR_SET_THP_DISABLE 41
+#define PR_GET_THP_DISABLE 42
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..9fc0a30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p)
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
-   mm->flags = (current->mm) ?
-   (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
@@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p)
mm_init_owner(mm, p);
clear_tlb_flush_pending(mm);
 
-   if (likely(!mm_alloc_pgd(mm))) {
+   if (current->mm) {
+   mm->flags = current->mm->flags & MMF_INIT_MASK;
+   mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
+   } else {
+   mm->flags = default_dump_filter;
mm->def_flags = 0;
+   }
+
+   if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
return mm;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index c0a58be..aa8d3a0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1996,6 +1996,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+   case PR_GET_THP_DISABLE:
+   if (arg2 || arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+   break;
+   case PR_SET_THP_DISABLE:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   down_write(&me->mm->mmap_sem);
+   if (arg2)
+   me->mm->def_flags |= VM_NOHUGEPAGE;
+   else
+   me->mm->def_flags &= ~VM_NOHUGEPAGE;
+   up_write(&me->mm->mmap_sem);
+   break;
default:
error = -EINVAL;
break;
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

2014-02-25 Thread Alex Thorlton
This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85c, and adds
in code to fix up any issues caused by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Here's a snip of an e-mail from Gerald detailing the original purpose
of this code, and providing justification for the revert:


The intent of 8e72033f2a48 was to guard against any future programming
errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings,
which would crash the kernel.

Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48
was to be reverted, because that check will also prevent a kernel crash
in the case described above, it will now send a SIGSEGV instead.

This would now also allow to do the madvise on other parts, if needed,
so it is a more flexible approach. One could also say that it would have
been better to do it this way right from the beginning... 


Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 arch/s390/mm/pgtable.c | 3 +++
 mm/huge_memory.c   | 4 
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 3584ed9..a87cdb4 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, 
unsigned long segment,
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+   /* large pmds cannot yet be handled */
+   if (pmd_large(*pmd))
+   return -EFAULT;
/* pmd now points to a valid segment table entry. */
rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
if (!rmap)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..a4310a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
 unsigned long *vm_flags, int advice)
 {
-   struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
 */
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
-   if (mm->def_flags & VM_NOHUGEPAGE)
-   return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

2014-02-25 Thread Alex Thorlton
load_elf_binary() sets current->mm->def_flags = def_flags and
def_flags is always zero. Not only this looks strange, this is
unnecessary because mm_init() has already set ->def_flags = 0. 

Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 fs/binfmt_elf.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..d09bd9c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -579,7 +579,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long start_code, end_code, start_data, end_data;
unsigned long reloc_func_desc __maybe_unused = 0;
int executable_stack = EXSTACK_DEFAULT;
-   unsigned long def_flags = 0;
struct pt_regs *regs = current_pt_regs();
struct {
struct elfhdr elf_ex;
@@ -719,9 +718,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval)
goto out_free_dentry;
 
-   /* OK, This is the point of no return */
-   current->mm->def_flags = def_flags;
-
/* Do this immediately, since STACK_TOP as used in setup_arg_pages
   may depend on the personality.  */
SET_PERSONALITY(loc->elf_ex);
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv4 0/3] [RESEND] mm, thp: Add mm flag to control THP

2014-02-25 Thread Alex Thorlton
194.536 M/sec   
[100.00%]
   714,066,351 branch-misses #0.00% of all branches

 216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here.  Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

Let me know if anybody has any further suggestions here.  Thanks!

(Sorry for the big cc list!)

Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

Alex Thorlton (3):
  Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
  Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
  exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

 arch/s390/mm/pgtable.c |  3 +++
 fs/binfmt_elf.c|  4 
 include/linux/mm.h |  3 +++
 include/uapi/linux/prctl.h |  3 +++
 kernel/fork.c  | 11 ---
 kernel/sys.c   | 15 +++
 mm/huge_memory.c   |  4 
 7 files changed, 32 insertions(+), 11 deletions(-)

-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 0/3] [RESEND] mm, thp: Add mm flag to control THP

2014-02-25 Thread Alex Thorlton
On Tue, Feb 25, 2014 at 02:54:03PM -0600, Alex Thorlton wrote:
> (First send had too big of a cc list to make it into all the mailing lists.)

Sorry for the double-send to a couple of lists/people.  I got bumped
from linux-api for having too long of a cc list, and I figured it had
gotten bumped elsewhere as well.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree

2014-02-26 Thread Alex Thorlton
On Wed, Feb 26, 2014 at 05:57:59PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> Do we want a comment here, explaining why s390 is special again?

Here's what I've got, with everybody's suggestions spun together:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..7f01491 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
 unsigned long *vm_flags, int advice)
 {
-   struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_struct *vma,
 */
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
-   if (mm->def_flags & VM_NOHUGEPAGE)
+
+/*
+ * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
+ * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
+ */
+#ifdef CONFIG_S390
+   if (mm_has_pgste(vma->vm_mm))
return -EINVAL;
+#endif
+
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*

I've compiled and tested and it works ok on my machines (I don't have an
s390 to test on though :).  Is everybody okay with this solution?

BTW, Kirill, I looked at using IS_ENABLED to clean up the ifdef, but it
won't work here, since mm_has_pgste isn't defined unless CONFIG_S390
is defined.  i.e.: This line:

if (IS_ENABLED(CONFIG_S390) && mm_has_pgste(vma->vm_mm))

won't compile.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree

2014-02-26 Thread Alex Thorlton
On Wed, Feb 26, 2014 at 08:27:36PM +0100, Christian Borntraeger wrote:
> On 26/02/14 19:06, Oleg Nesterov wrote:
> > On 02/26, Alex Thorlton wrote:
> >>
> >> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> >> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> >> + */
> >> +#ifdef CONFIG_S390
> >> +   if (mm_has_pgste(vma->vm_mm))
> >> return -EINVAL;
> >> +#endif
> > 
> > The comment is not really right...
> > 
> > And personally I think that
> > 
> > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long 
> > address, unsigned long segment,
> > if (!pmd_present(*pmd) &&
> > __pte_alloc(mm, vma, pmd, vmaddr))
> > return -ENOMEM;
> > +   /* large pmds cannot yet be handled */
> > +   if (pmd_large(*pmd))
> > +   return -EFAULT;
> > 
> > change still makes sense, so that we can simply revert this s390-
> > specific hack in hugepage_madvise().
> 
> Yes, it still makes sense to cover existing THPs here.

Yes, it does.  I just snipped the chunk of the original patch that I
actually changed in my last e-mail.  I didn't intend to actually remove
the above portion in the final patch - sorry for the confusion!

> 
> 
> > I'd suggest the patch below on top of your changes, but I won't argue.

I like this approach, and your updated comment as well.  This should
probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
with this new patch included in the series, or will Andrew pull this in?

> > 
> > It would be nice to also change thp_split_mm() to not not play with
> > mm->def_flags, but I am not sure if we can do this.
> > 
> > Oleg.
> > ---
> > 
> > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
> > 
> > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
> > 
> > Reported-by: Christian Borntraeger 
> > Suggested-by: Christian Borntraeger 
> > Signed-off-by: Oleg Nesterov 
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a4310a5..0e08d92 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >  {
> > switch (advice) {
> > case MADV_HUGEPAGE:
> > +#ifdef CONFIG_S390
> > +   /*
> > +* MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> > +* blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> > +* and expects it must fail on s390. Avoid a possible SIGSEGV
> > +* until qemu is changed.
> 
> I prefer:
>* until kvm/s390 can handle large pages in the host.
> 
> Otherwise qemu has to be changed again, if we get THP working for kvm.
> 
> > +*/
> > +   if (mm_has_pgste(vma->vm_mm))
> > +   return -EINVAL;
> > +#endif
> > /*
> >  * Be somewhat over-protective like KSM for now!
> >  */
> > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > return -EINVAL;
> > +
> 
> Unrelated white space?
> > *vm_flags &= ~VM_NOHUGEPAGE;
> > *vm_flags |= VM_HUGEPAGE;
> > /*
> > 
> 
> 
> 
> With the comment and white space change:
> 
> Tested-by: Christian Borntraeger 
> Reviewed-by: Christian Borntraeger 
> 
> Thanks for the quick patch

Yes, thank you all for the quick responses, and for looking over these
patches!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-04 Thread Alex Thorlton
On Mon, Nov 04, 2013 at 02:58:28PM +, Mel Gorman wrote:
> On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> > Hi guys,
> > 
> > I ran into a bug a week or so ago, that I believe has something to do
> > with NUMA balancing, but I'm having a tough time tracking down exactly
> > what is causing it.  When running with the following configuration
> > options set:
> > 
> 
> Can you test with patches
> cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1
> applied? They fix some known memory corruption problems, were merged for
> 3.12 (so alternatively just test 3.12) and have been tagged for -stable.

I just finished testing with 3.12, and I'm still seeing the same issue.
This is actually a bit strange to me, because, when I tested with
3.12-rc5 a while back, everything seemed to be ok (see previoues e-mail
in this thread, to Bob Liu).  I guess, embarrasingly enough, I must have
been playing with a screwed up config that day, and managed to somehow
avoid the problem...  Either way, it appears that we still have a
problem here.

I'll poke around a bit more on this in the next few days and see if I
can come up with any more information.  In the meantime, let me know if
you have any other suggestions.

Thanks,

- Alex

> 
> Thanks.
> 
> -- 
> Mel Gorman
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree

2014-02-26 Thread Alex Thorlton
On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton  wrote:
> 
> > > 
> > > 
> > > > I'd suggest the patch below on top of your changes, but I won't argue.
> > 
> > I like this approach, and your updated comment as well.  This should
> > probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
> > with this new patch included in the series, or will Andrew pull this in?
> 
> Paolo's comments on Oleg's patch remain unaddressed, so please take a
> look at that and send out the final version?

Got it.  I'll get that to you tonight/tomorrow morning.

> An incremental patch would be preferred, please.  I'll later fold that
> into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to
> avoid any bisection holes.

Understood.  I'll keep them separate.

Thanks, Andrew!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] mm, s390: Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

2014-02-27 Thread Alex Thorlton
This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85c, and adds
in code to fix up any issues caused by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Here's a snip of an e-mail from Gerald detailing the original purpose
of this code, and providing justification for the revert:


The intent of 8e72033f2a48 was to guard against any future programming
errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings,
which would crash the kernel.

Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48
was to be reverted, because that check will also prevent a kernel crash
in the case described above, it will now send a SIGSEGV instead.

This would now also allow to do the madvise on other parts, if needed,
so it is a more flexible approach. One could also say that it would have
been better to do it this way right from the beginning... 


Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 arch/s390/mm/pgtable.c | 3 +++
 mm/huge_memory.c   | 4 
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 3584ed9..a87cdb4 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, 
unsigned long segment,
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+   /* large pmds cannot yet be handled */
+   if (pmd_large(*pmd))
+   return -EFAULT;
/* pmd now points to a valid segment table entry. */
rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
if (!rmap)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..a4310a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
 unsigned long *vm_flags, int advice)
 {
-   struct mm_struct *mm = vma->vm_mm;
-
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1977,8 +1975,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
 */
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
-   if (mm->def_flags & VM_NOHUGEPAGE)
-   return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] mm, s390: Ignore MADV_HUGEPAGE on s390 to prevent SIGSEGV in qemu

2014-02-27 Thread Alex Thorlton
As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.

Paolo suggested that instead of failing on the call to madvise, we
simply ignore the call (return 0).

Reported-by: Christian Borntraeger 
Suggested-by: Paolo Bonzini 
Suggested-by: Oleg Nesterov 
Signed-off-by: Alex Thorlton 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 mm/huge_memory.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4310a5..61d234d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1970,6 +1970,15 @@ int hugepage_madvise(struct vm_area_struct *vma,
 {
switch (advice) {
case MADV_HUGEPAGE:
+#ifdef CONFIG_S390
+   /*
+* qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+* can't handle this properly after s390_enable_sie, so we 
simply
+* ignore the madvise to prevent qemu from causing a SIGSEGV.
+*/
+   if (mm_has_pgste(vma->vm_mm))
+   return 0;
+#endif
/*
 * Be somewhat over-protective like KSM for now!
 */
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree

2014-02-27 Thread Alex Thorlton
On Wed, Feb 26, 2014 at 06:01:53PM -0600, Alex Thorlton wrote:
> On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote:
> > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton  wrote:
> > 
> > > > 
> > > > 
> > > > > I'd suggest the patch below on top of your changes, but I won't argue.
> > > 
> > > I like this approach, and your updated comment as well.  This should
> > > probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
> > > with this new patch included in the series, or will Andrew pull this in?
> > 
> > Paolo's comments on Oleg's patch remain unaddressed, so please take a
> > look at that and send out the final version?
> 
> Got it.  I'll get that to you tonight/tomorrow morning.

Just kicked out another version of the patch that should cover all
comments/suggestions.  Let me know if you need anything else from me!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv5 0/4] mm, thp: Add mm flag to control THP

2014-02-27 Thread Alex Thorlton
 Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

Let me know if anybody has any further suggestions here.  Thanks!

Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

Alex Thorlton (4):
  Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
  Ignore MADV_HUGEPAGE on s390 to prevent SIGSEGV in qemu
  Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
  exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

 arch/s390/mm/pgtable.c |  3 +++
 fs/binfmt_elf.c|  4 
 include/linux/mm.h |  3 +++
 include/uapi/linux/prctl.h |  3 +++
 kernel/fork.c  | 11 ---
 kernel/sys.c   | 15 +++
 mm/huge_memory.c   | 13 +
 7 files changed, 41 insertions(+), 11 deletions(-)

-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] mm, thp: Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE

2014-02-27 Thread Alex Thorlton
This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags
for VMs.  It also adds a prctl control which alllows us to set the THP
disable bit in mm->def_flags so that VMs will pick up the setting as
they are created.

Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 include/linux/mm.h |  3 +++
 include/uapi/linux/prctl.h |  3 +++
 kernel/fork.c  | 11 ---
 kernel/sys.c   | 15 +++
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..0314e4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -177,6 +177,9 @@ extern unsigned int kobjsize(const void *objp);
  */
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
 
+/* This mask defines which mm->def_flags a process can inherit its parent */
+#define VM_INIT_DEF_MASK   VM_NOHUGEPAGE
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 289760f..58afc04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -149,4 +149,7 @@
 
 #define PR_GET_TID_ADDRESS 40
 
+#define PR_SET_THP_DISABLE 41
+#define PR_GET_THP_DISABLE 42
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..9fc0a30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p)
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
-   mm->flags = (current->mm) ?
-   (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
@@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p)
mm_init_owner(mm, p);
clear_tlb_flush_pending(mm);
 
-   if (likely(!mm_alloc_pgd(mm))) {
+   if (current->mm) {
+   mm->flags = current->mm->flags & MMF_INIT_MASK;
+   mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
+   } else {
+   mm->flags = default_dump_filter;
mm->def_flags = 0;
+   }
+
+   if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
return mm;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index c0a58be..aa8d3a0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1996,6 +1996,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+   case PR_GET_THP_DISABLE:
+   if (arg2 || arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+   break;
+   case PR_SET_THP_DISABLE:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   down_write(&me->mm->mmap_sem);
+   if (arg2)
+   me->mm->def_flags |= VM_NOHUGEPAGE;
+   else
+   me->mm->def_flags &= ~VM_NOHUGEPAGE;
+   up_write(&me->mm->mmap_sem);
+   break;
default:
error = -EINVAL;
break;
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

2014-02-27 Thread Alex Thorlton
load_elf_binary() sets current->mm->def_flags = def_flags and
def_flags is always zero. Not only this looks strange, this is
unnecessary because mm_init() has already set ->def_flags = 0. 

Signed-off-by: Alex Thorlton 
Suggested-by: Oleg Nesterov 
Cc: Gerald Schaefer 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Christian Borntraeger 
Cc: Andrew Morton 
Cc: Paolo Bonzini 
Cc: "Kirill A. Shutemov" 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexander Viro 
Cc: linux...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org

---
 fs/binfmt_elf.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..d09bd9c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -579,7 +579,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long start_code, end_code, start_data, end_data;
unsigned long reloc_func_desc __maybe_unused = 0;
int executable_stack = EXSTACK_DEFAULT;
-   unsigned long def_flags = 0;
struct pt_regs *regs = current_pt_regs();
struct {
struct elfhdr elf_ex;
@@ -719,9 +718,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval)
goto out_free_dentry;
 
-   /* OK, This is the point of no return */
-   current->mm->def_flags = def_flags;
-
/* Do this immediately, since STACK_TOP as used in setup_arg_pages
   may depend on the personality.  */
SET_PERSONALITY(loc->elf_ex);
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 3.14 fails to boot with new EFI changes

2014-02-11 Thread Alex Thorlton
On Thu, Feb 06, 2014 at 12:15:40AM +0100, Borislav Petkov wrote:
> On Wed, Feb 05, 2014 at 03:45:36PM -0600, Alex Thorlton wrote:
> > While working on an answer to this question, I ran across another issue
> > on some newer hardware, that looks like it's definitely related to this
> > problem, and might be the root cause.
> > 
> > When booting on a UV2 we die in efi_enter_virtual_mode:
> > 
> > BUG: unable to handle kernel paging request at 008f7e848020
> > IP: [<7dadb6a9>] 0x7dadb6a8
> > PGD 0
> 
> That looks very much like this other issue we're debugging right now:

Have there been any developments on this since last week, Boris?  Just
trying to make sure that we stay in the loop on this issue.

Let me know if there's anything else we can do from our end to help
expedite the process.  I'm always available to test new ideas.

Thanks!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] mm: thp: hugepage_vma_check has a blind spot

2014-01-21 Thread Alex Thorlton
hugepage_vma_check is called during khugepaged_scan_mm_slot to ensure
that khugepaged doesn't try to allocate THPs in vmas where they are
disallowed, either due to THPs being disabled system-wide, or through
MADV_NOHUGEPAGE.

The logic that hugepage_vma_check uses doesn't seem to cover all cases,
in my opinion.  Looking at the original code:

   if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
   (vma->vm_flags & VM_NOHUGEPAGE))

We can see that it's possible to have THP disabled system-wide, but still
receive THPs in this vma.  It seems that it's assumed that just because
khugepaged_always == false, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG must be
set, which is not the case.  We could have VM_HUGEPAGE set, but have THP
set to "never" system-wide, in which case, the condition presented in the
if will evaluate to false, and (provided the other checks pass) we can
end up giving out a THP even though the behavior is set to "never."

While we do properly check these flags in khugepaged_has_work, it looks
like it's possible to sleep after we check khugepaged_hask_work, but
before hugepage_vma_check, during which time, hugepages could have been
disabled system-wide, in which case, we could hand out THPs when we
shouldn't be.

This small fix makes hugepage_vma_check work more like
transparent_hugepage_enabled, checking if THPs are set to "always"
system-wide, then checking if THPs are set to "madvise," as well as
making sure that VM_HUGEPAGE is set for this vma.

Signed-off-by: Alex Thorlton 
Reported-by: Alex Thorlton 
Cc: Andrew Morton  
Cc: Mel Gorman  
Cc: Rik van Riel  
Cc: Ingo Molnar  
Cc: Peter Zijlstra  
Cc: linux...@kvack.org 
Cc: linux-kernel@vger.kernel.org 

---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95d1acb..f62fba9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2394,7 +2394,8 @@ static struct page
 
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
-   if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
+   if ((!khugepaged_always() ||
+(!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_req_madv())) ||
(vma->vm_flags & VM_NOHUGEPAGE))
return false;
 
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-22 Thread Alex Thorlton
On Wed, Jan 22, 2014 at 10:26:21AM +, Mel Gorman wrote:
> On Tue, Jan 14, 2014 at 01:38:01PM -0600, Alex Thorlton wrote:
> > On Tue, Jan 14, 2014 at 03:44:57PM +, Mel Gorman wrote:
> > > On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote:
> > > > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> > > > > We already have the information to determine if a page is shared 
> > > > > across
> > > > > nodes, Mel even had some prototype code to do splits under those
> > > > > conditions.
> > > > 
> > > > I'm aware that we can determine if pages are shared across nodes, but I
> > > > thought that Mel's code to split pages under these conditions had some
> > > > performance issues.  I know I've seen the code that Mel wrote to do
> > > > this, but I can't seem to dig it up right now.  Could you point me to
> > > > it?
> > > > 
> > > 
> > > It was a lot of revisions ago! The git branches no longer exist but the
> > > diff from the monolithic patches is below. The baseline was v3.10 and
> > > this will no longer apply but you'll see the two places where I added a
> > > split_huge_page and prevented khugepaged collapsing them again. 
> > 
> > Thanks, Mel.  I remember seeing this code a while back when we were
> > discussing THP/locality issues.
> > 
> > > At the time, the performance with it applied was much worse but it was a
> > > 10 minute patch as a distraction. There was a range of basic problems that
> > > had to be tackled before there was any point looking at splitting THP due
> > > to locality. I did not pursue it further and have not revisited it since.
> > 
> > So, in your opinion, is this something we should look into further
> > before moving towards the per-mm switch that I propose here? 
> 
> No because they have different purposes. Any potential split of THP from
> automatic NUMA balancing context is due to it detecting that threads running
> on CPUs on different nodes are accessing a THP. You are proposing to have
> a per-mm flag that prevents THP being allocated in the first place. They
> are two separate problems with decisions that are made at completely
> different times.

Makes sense.  That discussion doesn't really fit here.

> > I
> > personally think that it will be tough to get this to perform as well as
> > a method that totally disables THP when requested, or a method that
> > tries to prevent THPs from being handed out in certain situations, since
> > we'll be doing the work of both making and splitting a THP in the case
> > where remote accesses are made to the page.
> > 
> 
> I would expect that the alternative solution to a per-mm switch is to
> reserve the naturally aligned pages for a THP promotion. Have a threshold
> of pages pages that must be faulted before the full THP's worth of pages
> is allocated, zero'd and a huge pmd established. That would defer the
> THP setup costs until it was detected that it was necessary.

I have some half-finished patches that I was working on a month or so
ago, to do exactly this (I think you were involved with some of the
discussion, maybe?  I'd have to dig up the e-mails to be sure).  After
cycling through numerous other methods of handling this problem, I still
like that idea, but I think it's going to require a decent amount of
effort to get finished.  

> The per-mm THP switch is a massive hammer but not necessarily a bad one.

I agree that it's a big hammer, but I suppose it's a bit more of a
surgical strike than using the system-wide switch, and can be especially
useful in some cases I've discussed, where madvise isn't really an
option.

> > I also think there could be some issues with over-zealously splitting
> > pages, since it sounds like we can only determine if an access is from a
> > remote node.  We don't have a good way of determining how many accesses
> > are remote vs. local, or how many separate nodes are accessing a page.
> > 
> 
> Indeed not, but it's a different problem. We also do not know if the
> remote accesses are to a single page in which case splitting it would
> have zero benefit anyway.

Agreed.  There are several possible problems with that approach, but, as
you've stated, that's a different problem, so no reason to discuss
here :)

Thanks for the input, Mel!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] mm: thp: hugepage_vma_check has a blind spot

2014-01-22 Thread Alex Thorlton
On Tue, Jan 21, 2014 at 03:24:08PM -0800, David Rientjes wrote:
> On Tue, 21 Jan 2014, Alex Thorlton wrote:
> 
> > hugepage_vma_check is called during khugepaged_scan_mm_slot to ensure
> > that khugepaged doesn't try to allocate THPs in vmas where they are
> > disallowed, either due to THPs being disabled system-wide, or through
> > MADV_NOHUGEPAGE.
> > 
> > The logic that hugepage_vma_check uses doesn't seem to cover all cases,
> > in my opinion.  Looking at the original code:
> > 
> >if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
> >(vma->vm_flags & VM_NOHUGEPAGE))
> > 
> > We can see that it's possible to have THP disabled system-wide, but still
> > receive THPs in this vma.  It seems that it's assumed that just because
> > khugepaged_always == false, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG must be
> > set, which is not the case.  We could have VM_HUGEPAGE set, but have THP
> > set to "never" system-wide, in which case, the condition presented in the
> > if will evaluate to false, and (provided the other checks pass) we can
> > end up giving out a THP even though the behavior is set to "never."
> > 
> 
> You should be able to add a
> 
>   BUG_ON(current != khugepaged_thread);
> 
> here since khugepaged is supposed to be the only caller to the function.
> 
> > While we do properly check these flags in khugepaged_has_work, it looks
> > like it's possible to sleep after we check khugepaged_hask_work, but
> > before hugepage_vma_check, during which time, hugepages could have been
> > disabled system-wide, in which case, we could hand out THPs when we
> > shouldn't be.
> > 
> 
> You're talking about when thp is set to "never" and before khugepaged has 
> stopped, correct?

Yes, that's correct.

> That doesn't seem like a bug to me or anything that needs to be fixed, the 
> sysfs knob could be switched even after hugepage_vma_check() is called and 
> before a hugepage is actually collapsed so you have the same race.
> 
> The only thing that's guaranteed is that, upon writing "never" to 
> /sys/kernel/mm/transparent_hugepage/enabled, no more thp memory will be 
> collapsed after khugepaged has stopped.

That makes sense, I wasn't aware that that's the expected behavior here.
I suppose this isn't something that needs to be changed, in that case.
I needed the logic broken out a bit more explicitly (madvise/never case
need to be handled separately) for a patch that I'm working on - that's
when this caught my attention.  Good to know that a change to the
system-wide switch shouldn't affect khugepaged if it's already running.
I would've screwed up that behavior with my patch :)

Thanks, David!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] mm->def_flags cleanups (Was: Change khugepaged to respect MMF_THP_DISABLE flag)

2014-01-22 Thread Alex Thorlton
On Wed, Jan 22, 2014 at 06:45:53PM +0100, Oleg Nesterov wrote:
> Alex, Andrew, I think this simple series makes sense in any case,
> but _perhaps_ it can also help THP_DISABLE.
> 
> On 01/20, Alex Thorlton wrote:
> >
> > On Mon, Jan 20, 2014 at 09:15:25PM +0100, Oleg Nesterov wrote:
> > >
> > > Although I got lost a bit, and probably misunderstood... but it
> > > seems to me that whatever you do this patch should not touch
> > > khugepaged_scan_mm_slot.
> >
> > Maybe I've gotten myself confused as well :)  After looking through the
> > code some more, my understanding is that khugepaged_test_exit is used to
> > make sure that __khugepaged_exit isn't running from underneath at certain
> > times, so to have khugepaged_test_exit return true when __khugepaged_exit
> > is not necessarily running, seems incorrect to me.
> 
> Still can't understand... probably I need to see v3.

I think the appropriate place to check this is actually in
hugepage_vma_check, so you're correct that we don't need to directly
touch khugepaged_scan_mm_slot, we just need to change a different
function underneath it.

We'll table that for now, though.  I'll put out a v3 later today, so you
can see what I'm talking about, but I think your idea looks like it will
be better in the end.

> But you know, I have another idea. Not sure you will like it, and probably
> I missed something.
> 
> Can't we simply add VM_NOHUGEPAGE into ->def_flags? See the (untested)
> patch below, on top of this series.
> 
> What do you think?

At a glance, without testing, it looks like a good idea to me.  By
using def_flags, we leverage functionality that's already in place to
achieve the same result.  We don't need to add any new checks into the
fault path or into khugepaged, since we're just leveraging the
VM_HUGEPAGE/NOHUGEPAGE flag, which we already check for.  We also get
the behavior that you suggested (madvise is still respected, even with
the new THP disable prctl set), for free with this method.

I like the idea, but I think that it should probably be a separate
change from the other few cleanups that you proposed along with it, since
they're somewhat unrelated to this particular issue.  Do you agree?

Also, one small note on the code:

> diff --git a/kernel/sys.c b/kernel/sys.c
> index ac1842e..eb8b0fc 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2029,6 +2029,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   if (arg2 || arg3 || arg4 || arg5)
>   return -EINVAL;
>   return current->no_new_privs ? 1 : 0;
> + case PR_SET_THP_DISABLE:
> + case PR_GET_THP_DISABLE:
> + down_write(&me->mm->mmap_sem);
> + if (option == PR_SET_THP_DISABLE) {
> + if (arg2)
> + me->mm->def_flags |= VM_NOHUGEPAGE;
> + else
> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> + } else {
> + error = !!(me->mm->flags && VM_NOHUGEPAGE);

Should be:

error = !!(me->mm->def_flags && VM_NOHUGEPAGE);

Correct?

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] mm->def_flags cleanups (Was: Change khugepaged to respect MMF_THP_DISABLE flag)

2014-01-22 Thread Alex Thorlton
On Wed, Jan 22, 2014 at 08:43:27PM +0100, Oleg Nesterov wrote:
> On 01/22, Oleg Nesterov wrote:
> >
> > On 01/22, Alex Thorlton wrote:
> > > > +   case PR_SET_THP_DISABLE:
> > > > +   case PR_GET_THP_DISABLE:
> > > > +   down_write(&me->mm->mmap_sem);
> > > > +   if (option == PR_SET_THP_DISABLE) {
> > > > +   if (arg2)
> > > > +   me->mm->def_flags |= VM_NOHUGEPAGE;
> > > > +   else
> > > > +   me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > > +   } else {
> > > > +   error = !!(me->mm->flags && VM_NOHUGEPAGE);
> > >
> > > Should be:
> > >
> > > error = !!(me->mm->def_flags && VM_NOHUGEPAGE);
> >
> > No, we need to return 1 if this bit is set ;)
> 
> Damn, you are right of course, we need "&". I didn't notice "&&"
> in the patch I sent and misunderstood your "&&" above ;) Sorry.

Actually, I didn't catch that either!  Looking at it, though, we
definitely do want bitwise AND here, not logical.

However, what I was originally referring to is:  Shouldn't we be
checking mm->***def_flags*** for the VM_NOHUGEPAGE bit, as opposed
to mm->flags?  i.e. I think we want this:

error = !!(me->mm->def_flags & VM_NOHUGEPAGE);

As opposed to:

error = !!(me->mm->flags && VM_NOHUGEPAGE);

The way I understand it, the VM_NOHUGEPAGE bit is defined for
mm->vma->flags, but not for mm->flags.  Am I wrong here?

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] Linux 3.14 fails to boot with new EFI changes

2014-01-23 Thread Alex Thorlton
We've been hitting the following bug in the latest kernel, during boot:

kernel BUG at arch/x86/mm/init_64.c:351!
invalid opcode:  [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-medusa-04156-g90804ed-dirty 
#750
Hardware name: Intel Corp. Stoutland Platform, BIOS 2.0 UEFI2.10 PI1.0 X64 
2013-09-16
task: 88107c96c010 ti: 88107c96e000 task.ti: 88107c96e000
RIP: 0010:[]  [] 
__init_extra_mapping+0x111/0x143
RSP: :88107c96fd18  EFLAGS: 00010206
RAX: 0e00 RBX: 880001c4a018 RCX: 0002
RDX: 88107fcd1e00 RSI: 0400 RDI: f800
RBP: 88107c96fd48 R08:  R09: 
R10: 007e3b3c R11:  R12: f800
R13: 0400 R14: 8800f800 R15: 8000
FS:  () GS:88007320() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 8827fefff000 CR3: 017d7000 CR4: 06f0
Stack:
 81fb  0040 b018
 b010 b008 88107c96fd58 818aa71d
 88107c96fe28 818a5e20 000cb748 0282
Call Trace:
 [] init_extra_mapping_uc+0x13/0x15
 [] uv_system_init+0x22b/0x124b
 [] ? clockevents_register_device+0x138/0x13d
 [] ? setup_APIC_timer+0xc5/0xc7
 [] ? clockevent_delta2ns+0xb/0xd
 [] ? setup_boot_APIC_clock+0x4a8/0x4b7
 [] ? printk+0x72/0x74
 [] native_smp_prepare_cpus+0x389/0x3d6
 [] kernel_init_freeable+0xb7/0x1fb
 [] ? rest_init+0x74/0x74
 [] kernel_init+0x9/0xff
 [] ret_from_fork+0x7c/0xb0
 [] ? rest_init+0x74/0x74
Code: ff ff ff 3f 00 00 48 23 13 48 b8 00 00 00 00 00 88 ff ff 48 01 c2 4c 89 
e0 48 c1 e8 12 25 f8 0f 00 00 48 01 c2 48 83 3a 00 74 04 <0f> 0b eb fe 48 8b 45 
d0 49 81 ed 00 00 20 00 4c 09 e0 49 81 c4
RIP  [] __init_extra_mapping+0x111/0x143
 RSP 

I've bisected the issue down to this commit:

commit d2f7cbe7b26a74dbbbf8f325b2a6fd01bc34032c 


 
Author: Borislav Petkov   


   
Date:   Thu Oct 31 17:25:08 2013 +0100  


 



 
x86/efi: Runtime services virtual mapping   


 



 
We map the EFI regions needed for runtime services non-contiguously,


 
with preserved alignment on virtual addresses starting from -4G down


 
for a total max space of 64G. This way, we provide for stable runtime   


 
services addresses across kernels so that a kexec'd kernel can still use

[RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-10 Thread Alex Thorlton
This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
hugepages using prctl.  It is based on my original patch to add a
per-task_struct flag to disable THP:

v1 - https://lkml.org/lkml/2013/8/2/671
v2 - https://lkml.org/lkml/2013/8/2/703

After looking at alternate methods of modifying how THPs are handed out,
it sounds like people might be more in favor of this type of approach,
so I'm re-introducing the patch.

It seemed that everyone was in favor of moving this control over to the
mm_struct, if it is to be implemented.  That's the only major change
here, aside from the added ability to both set and clear the flag from
prctl.

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified and using a malloc hook with
madvise is not an option (i.e. statically allocated data).  This patch
allows us to do just that, without affecting other jobs running on the
system.

Here are some results showing the improvement that my test case gets
when the MMF_THP_DISABLE flag is clear vs. set:

MMF_THP_DISABLE clear:

# perf stat -a -r 3 ./prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g

 Performance counter stats for './prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 
512 -b 256g' (3 runs):

  267694862.049279 task-clock#  641.100 CPUs utilized   
 ( +-  0.23% ) [100.00%]
   908,846 context-switches  #0.000 M/sec   
 ( +-  0.23% ) [100.00%]
   874 CPU-migrations#0.000 M/sec   
 ( +-  4.01% ) [100.00%]
   131,966 page-faults   #0.000 M/sec   
 ( +-  2.75% )
351,127,909,744,906 cycles#1.312 GHz
  ( +-  0.27% ) [100.00%]
523,537,415,562,692 stalled-cycles-frontend   #  149.10% frontend cycles idle   
  ( +-  0.26% ) [100.00%]
392,400,753,609,156 stalled-cycles-backend#  111.75% backend  cycles idle   
  ( +-  0.29% ) [100.00%]
147,467,956,557,895 instructions  #0.42  insns per cycle
 #3.55  stalled cycles per insn 
 ( +-  0.09% ) [100.00%]
26,922,737,309,021 branches  #  100.572 M/sec   
 ( +-  0.24% ) [100.00%]
 1,308,714,545 branch-misses #0.00% of all branches 
 ( +-  0.18% )

 417.555688399 seconds time elapsed 
 ( +-  0.23% )


MMF_THP_DISABLE set:

# perf stat -a -r 3 ./prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g

 Performance counter stats for './prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 
512 -b 256g' (3 runs):

  141674994.160138 task-clock#  642.107 CPUs utilized   
 ( +-  0.23% ) [100.00%]
 1,190,415 context-switches  #0.000 M/sec   
 ( +- 42.87% ) [100.00%]
   688 CPU-migrations#0.000 M/sec   
 ( +-  2.47% ) [100.00%]
62,394,646 page-faults   #0.000 M/sec   
 ( +-  0.00% )
156,748,225,096,919 cycles#1.106 GHz
  ( +-  0.20% ) [100.00%]
211,440,354,290,433 stalled-cycles-frontend   #  134.89% frontend cycles idle   
  ( +-  0.40% ) [100.00%]
114,304,536,881,102 stalled-cycles-backend#   72.92% backend  cycles idle   
  ( +-  0.88% ) [100.00%]
179,939,084,230,732 instructions  #1.15  insns per cycle
 #1.18  stalled cycles per insn 
 ( +-  0.26% ) [100.00%]
26,659,099,949,509 branches  #  188.171 M/sec   
 ( +-  0.72% ) [100.00%]
   762,772,361 branch-misses #0.00% of all branches 
 ( +-  0.97% )

 220.640905073 seconds time elapsed 
 ( +-  0.23% )

As you can see, this particular test gets about a 2x performance boost
when THP is turned off.  Here's a link to the test, along with the
wrapper that I used:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctl.tar.gz

There are still a few things that might need tweaked here, but I wanted
to get the patch out there to get a discussion started.  Two things I
noted from the old patch discussion that will likely need to be
addressed are:

* Patch doesn't currently account for get_user_pages or khugepaged
  allocations.  Easy enough to fix, but it's not there yet.
* Current behavior is to have fork()/exec()'d processes inherit the
  flag.  Andrew Morton pointed out some possible issues with this, so we
  may need to rethink this behavior.
  - If parent process has THP disabled, and forks off a child, the child
will also have THP disabled.  This may not be the desired behavior.

Signed-off-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin He

Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-10 Thread Alex Thorlton
On Fri, Jan 10, 2014 at 10:23:10PM +0200, Kirill A. Shutemov wrote:
> Do you know what cause the difference? I prefer to fix THP instead of
> adding new knob to disable it.

The issue is that when you touch 1 byte of an untouched, contiguous 2MB
chunk, a THP will be handed out, and the THP will be stuck on whatever
node the chunk was originally referenced from.  If many remote nodes
need to do work on that same chunk, they'll be making remote accesses.
With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.  I give a bit better description here:

https://lkml.org/lkml/2013/8/27/397

I had been looking into better ways to handle this issues, but after
spinning through a few other ideas:

- Per cpuset flag to control THP:
https://lkml.org/lkml/2013/6/10/331

- Threshold to determine when to hand out THPs:
https://lkml.org/lkml/2013/12/12/394

We've arrived back here.  Andrea seemed to think that this is an
acceptable approach to solve the problem, at least as a starting point:

https://lkml.org/lkml/2013/12/17/397

I agree that we should, ideally, come up with a way to appropriately
handle this problem in the kernel, but as of right now, it appears that
that might be a rather large undertaking.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-10 Thread Alex Thorlton
On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> We already have the information to determine if a page is shared across
> nodes, Mel even had some prototype code to do splits under those
> conditions.

I'm aware that we can determine if pages are shared across nodes, but I
thought that Mel's code to split pages under these conditions had some
performance issues.  I know I've seen the code that Mel wrote to do
this, but I can't seem to dig it up right now.  Could you point me to
it?

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-11 Thread Alex Thorlton
On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote:
> On 01/10, Alex Thorlton wrote:
> >
> > This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent
> > hugepages using prctl.  It is based on my original patch to add a
> > per-task_struct flag to disable THP:
> 
> I leave the "whether we need this feature" to other reviewers, although
> personally I think it probably makes sense anyway.
> 
> But the patch doesn't look nice imho.
> 
> > @@ -373,7 +373,15 @@ extern int get_dumpable(struct mm_struct *mm);
> >  #define MMF_HAS_UPROBES19  /* has uprobes */
> >  #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define MMF_THP_DISABLE21  /* disable THP for this mm */
> > +#define MMF_THP_DISABLE_MASK   (1 << MMF_THP_DISABLE)
> > +
> > +#define MMF_INIT_MASK  (MMF_DUMPABLE_MASK | 
> > MMF_DUMP_FILTER_MASK | MMF_THP_DISABLE_MASK)
> > +#else
> >  #define MMF_INIT_MASK  (MMF_DUMPABLE_MASK | 
> > MMF_DUMP_FILTER_MASK)
> > +#endif
> 
> It would be nice to lessen the number of ifdef's. Why we can't define
> MMF_THP_DISABLE unconditionally and include it into MMF_INIT_MASK?
> Or define it == 0 if !CONFIG_THP. But this is minor.

That's a good idea.  I guess I was thinking that we don't want to define
any THP specific stuff when THP isn't configured, but I guess it doesn't
make much of a difference since the flag will never be set if THP isn't
configured.

> > +#define PR_SET_THP_DISABLE 41
> > +#define PR_CLEAR_THP_DISABLE   42
> > +#define PR_GET_THP_DISABLE 43
> 
> Why we can't add 2 PR_'s, set and get?

See response below.

> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -818,6 +818,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> > mm->pmd_huge_pte = NULL;
> >  #endif
> > +
> > if (!mm_init(mm, tsk))
> > goto fail_nomem;
> 
> Why? looks like the accidental change.

Ah, yes.  Didn't catch that when I looked over the patch.  I'll fix
that.

> 
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1835,6 +1835,42 @@ static int prctl_get_tid_address(struct task_struct 
> > *me, int __user **tid_addr)
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static int prctl_set_thp_disable(struct task_struct *me)
> > +{
> > +   set_bit(MMF_THP_DISABLE, &me->mm->flags);
> > +   return 0;
> > +}
> > +
> > +static int prctl_clear_thp_disable(struct task_struct *me)
> > +{
> > +   clear_bit(MMF_THP_DISABLE, &me->mm->flags);
> > +   return 0;
> > +}
> > +
> > +static int prctl_get_thp_disable(struct task_struct *me,
> > + int __user *thp_disabled)
> > +{
> > +   return put_user(test_bit(MMF_THP_DISABLE, &me->mm->flags), 
> > thp_disabled);
> > +}
> > +#else
> > +static int prctl_set_thp_disable(struct task_struct *me)
> > +{
> > +   return -EINVAL;
> > +}
> > +
> > +static int prctl_clear_thp_disable(struct task_struct *me)
> > +{
> > +   return -EINVAL;
> > +}
> > +
> > +static int prctl_get_thp_disable(struct task_struct *me,
> > + int __user *thp_disabled)
> > +{
> > +   return -EINVAL;
> > +}
> > +#endif
> > +
> >  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, 
> > arg3,
> > unsigned long, arg4, unsigned long, arg5)
> >  {
> > @@ -1998,6 +2034,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> > arg2, unsigned long, arg3,
> > if (arg2 || arg3 || arg4 || arg5)
> > return -EINVAL;
> > return current->no_new_privs ? 1 : 0;
> > +   case PR_SET_THP_DISABLE:
> > +   error = prctl_set_thp_disable(me);
> > +   break;
> > +   case PR_CLEAR_THP_DISABLE:
> > +   error = prctl_clear_thp_disable(me);
> > +   break;
> > +   case PR_GET_THP_DISABLE:
> > +   error = prctl_get_thp_disable(me, (int __user *) arg2);
> > +   break;
> > default:
> > error = -EINVAL;
> > break;
> 
> I simply can't understand, this all looks like overkill. Can't you simply add
> 
>   #idfef CONFIG_TRANSPARENT_HUGEPAGE

Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-27 Thread Alex Thorlton
On Sat, Nov 23, 2013 at 12:09:24AM +, Mel Gorman wrote:
> On Fri, Nov 22, 2013 at 11:05:24PM +, Mel Gorman wrote:
> > On Fri, Nov 22, 2013 at 03:28:07PM -0600, Alex Thorlton wrote:
> > > > If the warning added by that patch does *not* trigger than can you also
> > > > test this patch? It removes the barriers which should not be necessary
> > > > and takes a reference tot he page before waiting on the lock. The
> > > > previous version did not take the reference because otherwise the
> > > > WARN_ON could not distinguish between a migration waiter and a surprise
> > > > gup.
> > > 
> > > Sorry for the delay; been a bit busy.  I tested both of these patches on
> > > top of this one (separately, of course):
> > > 
> > > http://www.spinics.net/lists/linux-mm/msg63919.html
> > > 
> > > I think that's the one you were referring to, if not send me a pointer
> > > to the correct one and I'll give it another shot.  Both patches still
> > > segfaulted, so it doesn't appear that either of these solved the
> > > problem. 
> > 
> > I see. Does THP have to be enabled or does it segfault even with THP
> > disabled?

It occurs with both THP on, and off.  I get RCU stalls with THP on
though.  That's probably related to not having Kirill/Naoya's patches
applied though.

> > 
> 
> On a semi-related note, is the large machine doing anything with xpmem
> or anything that depends on MMU notifiers to work properly? I noted
> while looking at this that THP migration is not invalidating pages which
> might be confusing a driver depending on it.

I'm not using xpmem on any of the machines that I've been testing on,
and I don't think that anything should be using MMU notifiers.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-13 Thread Alex Thorlton
On Sun, Jan 12, 2014 at 02:56:00PM +0100, Oleg Nesterov wrote:
> On 01/11, Alex Thorlton wrote:
> >
> > On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote:
> >
> > > I simply can't understand, this all looks like overkill. Can't you simply 
> > > add
> > >
> > >   #idfef CONFIG_TRANSPARENT_HUGEPAGE
> > >   case GET:
> > >   error = test_bit(MMF_THP_DISABLE);
> > >   break;
> > >   case PUT:
> > >   if (arg2)
> > >   set_bit();
> > >   else
> > >   clear_bit();
> > >   break;
> > >   #endif
> > >
> > > into sys_prctl() ?
> >
> > That's probably a better solution.  I wasn't sure whether or not it was
> > better to have two functions to handle this, or to have one function
> > handle both.  If you think it's better to just handle both with one,
> > that's easy enough to change.
> 
> Personally I think sys_prctl() can handle this itself, without a helper.
> But of course I won't argue, this is up to you.
> 
> My only point is, the kernel is already huge ;) Imho it makes sense to
> try to lessen the code size, when the logic is simple.

I agree with you here as well.  There was a mixed bag of PRCTLs using
helpers vs. ones that put the code right into sys_prctl.  I just
arbitrarily chose to use a helper here.  I'll switch that over for v2.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP

2014-01-14 Thread Alex Thorlton
On Tue, Jan 14, 2014 at 03:44:57PM +, Mel Gorman wrote:
> On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote:
> > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote:
> > > We already have the information to determine if a page is shared across
> > > nodes, Mel even had some prototype code to do splits under those
> > > conditions.
> > 
> > I'm aware that we can determine if pages are shared across nodes, but I
> > thought that Mel's code to split pages under these conditions had some
> > performance issues.  I know I've seen the code that Mel wrote to do
> > this, but I can't seem to dig it up right now.  Could you point me to
> > it?
> > 
> 
> It was a lot of revisions ago! The git branches no longer exist but the
> diff from the monolithic patches is below. The baseline was v3.10 and
> this will no longer apply but you'll see the two places where I added a
> split_huge_page and prevented khugepaged collapsing them again. 

Thanks, Mel.  I remember seeing this code a while back when we were
discussing THP/locality issues.

> At the time, the performance with it applied was much worse but it was a
> 10 minute patch as a distraction. There was a range of basic problems that
> had to be tackled before there was any point looking at splitting THP due
> to locality. I did not pursue it further and have not revisited it since.

So, in your opinion, is this something we should look into further
before moving towards the per-mm switch that I propose here?  I
personally think that it will be tough to get this to perform as well as
a method that totally disables THP when requested, or a method that
tries to prevent THPs from being handed out in certain situations, since
we'll be doing the work of both making and splitting a THP in the case
where remote accesses are made to the page.

I also think there could be some issues with over-zealously splitting
pages, since it sounds like we can only determine if an access is from a
remote node.  We don't have a good way of determining how many accesses
are remote vs. local, or how many separate nodes are accessing a page.

For example, I can see this being a problem if we have a large
multi-node system, where only two nodes are accessing a THP.  We might
end up splitting that THP, but if relatively few remote nodes are
accessing it, it may not be worth the time.  The split only seems
worthwhile to me if the majority of accesses are remote, which sounds
like it would be hard to determine.

One thing we could possibly do would be to add some structures to do a
bit of accounting work into the mm_struct or some other appropriate
location, then we could keep track of how many distinct remote nodes are
accessing a THP and decide to split based on that.  However, there's
still the overhead to creating/splitting the THP, and the extra
space/time needed to do the proper accounting work may be
counterproductive (if this is even possible, I'm just thinking out loud
here).

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-07 Thread Alex Thorlton
On Wed, Nov 06, 2013 at 01:10:48PM +, Mel Gorman wrote:
> On Mon, Nov 04, 2013 at 02:03:46PM -0600, Alex Thorlton wrote:
> > On Mon, Nov 04, 2013 at 02:58:28PM +, Mel Gorman wrote:
> > > On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> > > > Hi guys,
> > > > 
> > > > I ran into a bug a week or so ago, that I believe has something to do
> > > > with NUMA balancing, but I'm having a tough time tracking down exactly
> > > > what is causing it.  When running with the following configuration
> > > > options set:
> > > > 
> > > 
> > > Can you test with patches
> > > cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1
> > > applied? They fix some known memory corruption problems, were merged for
> > > 3.12 (so alternatively just test 3.12) and have been tagged for -stable.
> > 
> > I just finished testing with 3.12, and I'm still seeing the same issue.
> 
> Ok, I plugged your test into mmtests and ran it a few times but was not
> able to reproduce the same issue. It's a much smaller machine which
> might be a factor.
> 
> > I'll poke around a bit more on this in the next few days and see if I
> > can come up with any more information.  In the meantime, let me know if
> > you have any other suggestions.
> > 
> 
> Try the following patch on top of 3.12. It's a patch that is expected to
> be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> -stable but corruption trumps performance and the full series is not
> going to be considered acceptable for -stable

I gave this patch a shot, and it didn't seem to solve the problem.
Actually I'm running into what appear to be *worse* problems on the 3.12
kernel.  Here're a couple stack traces of what I get when I run the test
on 3.12, 512 cores:

(These are just two of the CPUs, obviously, but most of the memscale
processes appeared to be in one of these two spots)

Nov  7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6
Nov  7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not tainted 
3.12.0-rc7-medusa-6-g0255d49 #381
Nov  7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, 
BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
Nov  7 13:54:39 uvpsw1 kernel: task: 8810647e0300 ti: 88106413e000 
task.ti: 88106413e000
Nov  7 13:54:39 uvpsw1 kernel: RIP: 0010:[]  
[] _raw_spin_lock+0x1a/0x25
Nov  7 13:54:39 uvpsw1 kernel: RSP: 0018:88106413fd38  EFLAGS: 0283
Nov  7 13:54:39 uvpsw1 kernel: RAX: a1a9a0fe RBX: 0206 RCX: 
8800
Nov  7 13:54:41 uvpsw1 kernel: RDX: a1a9 RSI: 3000 RDI: 
8907ded35494
Nov  7 13:54:41 uvpsw1 kernel: RBP: 88106413fd38 R08: 0006 R09: 
0002
Nov  7 13:54:41 uvpsw1 kernel: R10: 0007 R11: 88106413ff40 R12: 
8907ded35494
Nov  7 13:54:42 uvpsw1 kernel: R13: 88106413fe1c R14: 8810637a05f0 R15: 
0206
Nov  7 13:54:42 uvpsw1 kernel: FS:  7fffd5def700() 
GS:88107d98() knlGS:
Nov  7 13:54:42 uvpsw1 kernel: CS:  0010 DS:  ES:  CR0: 8005003b
Nov  7 13:54:42 uvpsw1 kernel: CR2: 7fffd5ded000 CR3: 0107dfbcf000 CR4: 
07e0
Nov  7 13:54:42 uvpsw1 kernel: Stack:
Nov  7 13:54:42 uvpsw1 kernel:  88106413fda8 810d670a 
0002 0006
Nov  7 13:54:42 uvpsw1 kernel:  7fff57dde000 8810640e1cc0 
02006413fe10 8907ded35440
Nov  7 13:54:45 uvpsw1 kernel:  88106413fda8 0206 
0002 
Nov  7 13:54:45 uvpsw1 kernel: Call Trace:
Nov  7 13:54:45 uvpsw1 kernel:  [] 
follow_page_mask+0x123/0x3f1
Nov  7 13:54:45 uvpsw1 kernel:  [] 
__get_user_pages+0x3e3/0x488
Nov  7 13:54:45 uvpsw1 kernel:  [] get_user_pages+0x4d/0x4f
Nov  7 13:54:45 uvpsw1 kernel:  [] 
SyS_get_mempolicy+0x1a9/0x3e0
Nov  7 13:54:45 uvpsw1 kernel:  [] 
system_call_fastpath+0x16/0x1b
Nov  7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 
d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 
66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00

Nov  7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8
Nov  7 13:55:59 uvpsw1 kernel: INFO: NMI handler 
(arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs
Nov  7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not tainted 
3.12.0-rc7-medusa-6-g0255d49 #381
Nov  7 13:56:04 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, 
BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
Nov  7 13:56:04 uvpsw1 kernel: task: 881063c56380 ti: 8810621b8000 
task.ti: 8810621b8000
Nov  7 13:56:04 uvpsw1 kernel: RIP: 0010

Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-07 Thread Alex Thorlton
On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> Hi guys,
> 
> I ran into a bug a week or so ago, that I believe has something to do
> with NUMA balancing, but I'm having a tough time tracking down exactly
> what is causing it.  When running with the following configuration
> options set:
> 
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> CONFIG_NUMA_BALANCING=y
> # CONFIG_HUGETLBFS is not set
> # CONFIG_HUGETLB_PAGE is not set
> 
> I get intermittent segfaults when running the memscale test that we've
> been using to test some of the THP changes.  Here's a link to the test:
> 
> ftp://shell.sgi.com/collect/memscale/

For anyone who's interested, this test has been moved to:

http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz

It should remain there permanently.

> 
> I typically run the test with a line similar to this:
> 
> ./thp_memscale -C 0 -m 0 -c  -b 
> 
> Where  is the number of cores to spawn threads on, and 
> is the amount of memory to reserve from each core.  The  field
> can accept values like 512m or 1g, etc.  I typically run 256 cores and
> 512m, though I think the problem should be reproducable on anything with
> 128+ cores.
> 
> The test never seems to have any problems when running with hugetlbfs
> on and NUMA balancing off, but it segfaults every once in a while with
> the config options above.  It seems to occur more frequently, the more
> cores you run on.  It segfaults on about 50% of the runs at 256 cores,
> and on almost every run at 512 cores.  The fewest number of cores I've
> seen a segfault on has been 128, though it seems to be rare on this many
> cores.
> 
> At this point, I'm not familiar enough with NUMA balancing code to know
> what could be causing this, and we don't typically run with NUMA
> balancing on, so I don't see this in my everyday testing, but I felt
> that it was definitely worth bringing up.
> 
> If anybody has any ideas of where I could poke around to find a
> solution, please let me know.
> 
> - Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-08 Thread Alex Thorlton
On Fri, Nov 08, 2013 at 11:20:54AM +, Mel Gorman wrote:
> On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > > Try the following patch on top of 3.12. It's a patch that is expected to
> > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > > -stable but corruption trumps performance and the full series is not
> > > going to be considered acceptable for -stable
> > 
> > I gave this patch a shot, and it didn't seem to solve the problem.
> > Actually I'm running into what appear to be *worse* problems on the 3.12
> > kernel.  Here're a couple stack traces of what I get when I run the test
> > on 3.12, 512 cores:
> > 
> 
> Ok, so there are two issues at least. Whatever is causing your
> corruption (which I still cannot reproduce) and the fact that 3.12 is
> worse. The largest machine I've tested with is 40 cores. I'm trying to
> get time on a 60 core machine to see if has a better chance. I will not
> be able to get access to anything resembling 512 cores.

At this point, the smallest machine I've been able to recreate this
issue on has been a 128 core, but it's rare on a machine that small.
I'll kick off a really long run on a 64 core over the weekend and see if
I can hit it on there at all, but I haven't been able to previously.

> 
> > (These are just two of the CPUs, obviously, but most of the memscale
> > processes appeared to be in one of these two spots)
> > 
> > Nov  7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6
> > Nov  7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not 
> > tainted 3.12.0-rc7-medusa-6-g0255d49 #381
> > Nov  7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland 
> > Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
> > Nov  7 13:54:39 uvpsw1 kernel: task: 8810647e0300 ti: 88106413e000 
> > task.ti: 88106413e000
> > Nov  7 13:54:39 uvpsw1 kernel: RIP: 0010:[]  
> > [] _raw_spin_lock+0x1a/0x25
> > Nov  7 13:54:39 uvpsw1 kernel: RSP: 0018:88106413fd38  EFLAGS: 0283
> > Nov  7 13:54:39 uvpsw1 kernel: RAX: a1a9a0fe RBX: 0206 
> > RCX: 8800
> > Nov  7 13:54:41 uvpsw1 kernel: RDX: a1a9 RSI: 3000 
> > RDI: 8907ded35494
> > Nov  7 13:54:41 uvpsw1 kernel: RBP: 88106413fd38 R08: 0006 
> > R09: 0002
> > Nov  7 13:54:41 uvpsw1 kernel: R10: 0007 R11: 88106413ff40 
> > R12: 8907ded35494
> > Nov  7 13:54:42 uvpsw1 kernel: R13: 88106413fe1c R14: 8810637a05f0 
> > R15: 0206
> > Nov  7 13:54:42 uvpsw1 kernel: FS:  7fffd5def700() 
> > GS:88107d98() knlGS:
> > Nov  7 13:54:42 uvpsw1 kernel: CS:  0010 DS:  ES:  CR0: 
> > 8005003b
> > Nov  7 13:54:42 uvpsw1 kernel: CR2: 7fffd5ded000 CR3: 0107dfbcf000 
> > CR4: 07e0
> > Nov  7 13:54:42 uvpsw1 kernel: Stack:
> > Nov  7 13:54:42 uvpsw1 kernel:  88106413fda8 810d670a 
> > 0002 0006
> > Nov  7 13:54:42 uvpsw1 kernel:  7fff57dde000 8810640e1cc0 
> > 02006413fe10 8907ded35440
> > Nov  7 13:54:45 uvpsw1 kernel:  88106413fda8 0206 
> > 0002 
> > Nov  7 13:54:45 uvpsw1 kernel: Call Trace:
> > Nov  7 13:54:45 uvpsw1 kernel:  [] 
> > follow_page_mask+0x123/0x3f1
> > Nov  7 13:54:45 uvpsw1 kernel:  [] 
> > __get_user_pages+0x3e3/0x488
> > Nov  7 13:54:45 uvpsw1 kernel:  [] 
> > get_user_pages+0x4d/0x4f
> > Nov  7 13:54:45 uvpsw1 kernel:  [] 
> > SyS_get_mempolicy+0x1a9/0x3e0
> > Nov  7 13:54:45 uvpsw1 kernel:  [] 
> > system_call_fastpath+0x16/0x1b
> > Nov  7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 
> > 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 
> > 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 
> > 00 00
> > 
> 
> That is probably the mm->page_table_lock being contended on. Kirill has
> patches that split this which will help the scalability when THP is
> enabled. They should be merged for 3.13-rc1. In itself it should not
> cause bugs other than maybe soft lockups.
> 
> > Nov  7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8
> > Nov  7 13:55:59 uvpsw1 kernel: INFO: NMI handler 
> > (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs
> > Nov  7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not 
> > tainted 3.12.0-rc7-medusa-6-g0255d49 #381
&

Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-12 Thread Alex Thorlton
On Fri, Nov 08, 2013 at 04:13:29PM -0600, Alex Thorlton wrote:
> On Fri, Nov 08, 2013 at 11:20:54AM +, Mel Gorman wrote:
> > On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > > > Try the following patch on top of 3.12. It's a patch that is expected to
> > > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > > > -stable but corruption trumps performance and the full series is not
> > > > going to be considered acceptable for -stable
> > > 
> > > I gave this patch a shot, and it didn't seem to solve the problem.
> > > Actually I'm running into what appear to be *worse* problems on the 3.12
> > > kernel.  Here're a couple stack traces of what I get when I run the test
> > > on 3.12, 512 cores:
> > > 
> > 
> > Ok, so there are two issues at least. Whatever is causing your
> > corruption (which I still cannot reproduce) and the fact that 3.12 is
> > worse. The largest machine I've tested with is 40 cores. I'm trying to
> > get time on a 60 core machine to see if has a better chance. I will not
> > be able to get access to anything resembling 512 cores.
> 
> At this point, the smallest machine I've been able to recreate this
> issue on has been a 128 core, but it's rare on a machine that small.
> I'll kick off a really long run on a 64 core over the weekend and see if
> I can hit it on there at all, but I haven't been able to previously.

Just a quick update, I ran this test 500 times on 64 cores, allocating
512m per core, and every single test completed successfully.  At this
point, it looks like you definitely need at least 128 cores to reproduce
the issue.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-10-16 Thread Alex Thorlton
Hi guys,

I ran into a bug a week or so ago, that I believe has something to do
with NUMA balancing, but I'm having a tough time tracking down exactly
what is causing it.  When running with the following configuration
options set:

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA_BALANCING=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set

I get intermittent segfaults when running the memscale test that we've
been using to test some of the THP changes.  Here's a link to the test:

ftp://shell.sgi.com/collect/memscale/

I typically run the test with a line similar to this:

./thp_memscale -C 0 -m 0 -c  -b 

Where  is the number of cores to spawn threads on, and 
is the amount of memory to reserve from each core.  The  field
can accept values like 512m or 1g, etc.  I typically run 256 cores and
512m, though I think the problem should be reproducable on anything with
128+ cores.

The test never seems to have any problems when running with hugetlbfs
on and NUMA balancing off, but it segfaults every once in a while with
the config options above.  It seems to occur more frequently, the more
cores you run on.  It segfaults on about 50% of the runs at 256 cores,
and on almost every run at 512 cores.  The fewest number of cores I've
seen a segfault on has been 128, though it seems to be rare on this many
cores.

At this point, I'm not familiar enough with NUMA balancing code to know
what could be causing this, and we don't typically run with NUMA
balancing on, so I don't see this in my everyday testing, but I felt
that it was definitely worth bringing up.

If anybody has any ideas of where I could poke around to find a
solution, please let me know.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-10-17 Thread Alex Thorlton
On Thu, Oct 17, 2013 at 07:30:58PM +0800, Bob Liu wrote:
> Hi Alex,
> 
> On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton  wrote:
> > Hi guys,
> >
> > I ran into a bug a week or so ago, that I believe has something to do
> > with NUMA balancing, but I'm having a tough time tracking down exactly
> > what is causing it.  When running with the following configuration
> > options set:
> >
> > CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> > CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> > CONFIG_NUMA_BALANCING=y
> > # CONFIG_HUGETLBFS is not set
> > # CONFIG_HUGETLB_PAGE is not set
> >
> 
> What's your kernel version?
> And did you enable CONFIG_TRANSPARENT_HUGEPAGE ?

Ah, two important things that I forgot to include!  The kernel I
originally spotted the problem on was 3.11 and it continued to be an
issue up through 3.12-rc4, but after running a 30-trial run of the
test today, it appears that the issue must have cleared up after the
3.12-rc5 release on Monday.  I'll still include the requested
information, but I guess this is no longer an issue.

I rolled all the way back to 3.7 while researching the issue, and that
appears to be the last kernel where the problem didn't show up.  3.8
is the first kernel where the bug appears; I believe this is also the
kernel where NUMA balancing was officially introduced.

Here are my settings related to THP:

CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set

I did most of my testing with THP set to "always", although the problem
still occurs with THP set to "never".  

> 
> > I get intermittent segfaults when running the memscale test that we've
> > been using to test some of the THP changes.  Here's a link to the test:
> >
> > ftp://shell.sgi.com/collect/memscale/
> >
> > I typically run the test with a line similar to this:
> >
> > ./thp_memscale -C 0 -m 0 -c  -b 
> >
> > Where  is the number of cores to spawn threads on, and 
> > is the amount of memory to reserve from each core.  The  field
> > can accept values like 512m or 1g, etc.  I typically run 256 cores and
> > 512m, though I think the problem should be reproducable on anything with
> > 128+ cores.
> >
> > The test never seems to have any problems when running with hugetlbfs
> > on and NUMA balancing off, but it segfaults every once in a while with
> > the config options above.  It seems to occur more frequently, the more
> > cores you run on.  It segfaults on about 50% of the runs at 256 cores,
> > and on almost every run at 512 cores.  The fewest number of cores I've
> > seen a segfault on has been 128, though it seems to be rare on this many
> > cores.
> >
> 
> Could you please attach some logs?

Here are the relevant chunks from the syslog for a 10-shot run at 256
cores, each chunk is from a separate run.  4 out of 10 failed with
segfaults:

Oct 17 11:36:41 harp83-sys kernel: thp_memscale[21566]: segfault at 0 ip
   (null) sp 7ff8531fcdc0 error 14 in thp_memscale[40+5000]
Oct 17 11:36:41 harp83-sys kernel: thp_memscale[21565]: segfault at 0 ip
   (null) sp 7ff8539fddc0 error 14 in thp_memscale[40+5000]
---
Oct 17 12:08:14 harp83-sys kernel: thp_memscale[22893]: segfault at 0 ip
   (null) sp 7ff69cffddc0 error 14 in thp_memscale[40+5000]
---
Oct 17 12:26:30 harp83-sys kernel: thp_memscale[23995]: segfault at 0 ip
   (null) sp 7fe7af1fcdc0 error 14 in thp_memscale[40+5000]
Oct 17 12:26:30 harp83-sys kernel: thp_memscale[23994]: segfault at 0 ip
   (null) sp 7fe7af9fddc0 error 14 in thp_memscale[40+5000]
---
Oct 17 12:32:29 harp83-sys kernel: thp_memscale[24116]: segfault at 0 ip
   (null) sp 7ff77a9fcdc0 error 14 in thp_memscale[40+5000]

Since this has cleared up in the latest release, I won't be pursuing the
issue any further (though I'll keep an eye out to make sure that it
doesn't show back up).  I am, however, still curious as to what the
cause of the problem was...

> 
> > At this point, I'm not familiar enough with NUMA balancing code to know
> > what could be causing this, and we don't typically run with NUMA
> > balancing on, so I don't see this in my everyday testing, but I felt
> > that it was definitely worth bringing up.
> >
> > If anybody has any ideas of where I could poke around to find a
> > solution, please let me know.
> >
> 
> -- 
> Regards,
> --Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on

2013-11-22 Thread Alex Thorlton
> If the warning added by that patch does *not* trigger than can you also
> test this patch? It removes the barriers which should not be necessary
> and takes a reference tot he page before waiting on the lock. The
> previous version did not take the reference because otherwise the
> WARN_ON could not distinguish between a migration waiter and a surprise
> gup.

Sorry for the delay; been a bit busy.  I tested both of these patches on
top of this one (separately, of course):

http://www.spinics.net/lists/linux-mm/msg63919.html

I think that's the one you were referring to, if not send me a pointer
to the correct one and I'll give it another shot.  Both patches still
segfaulted, so it doesn't appear that either of these solved the
problem.  If you have anything else you'd like for me to try let me
know.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 2/3] Add tunable to control THP behavior

2013-12-12 Thread Alex Thorlton
This part of the patch adds a tunable to
/sys/kernel/mm/transparent_hugepage called threshold.  This threshold
determines how many pages a user must fault in from a single node before
a temporary compound page is turned into a THP.

Signed-off-by: Alex Thorlton 
Cc: Andrew Morton 
Cc: Nate Zimmer 
Cc: Cliff Wickman 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Cc: Rik van Riel 
Cc: Wanpeng Li 
Cc: Mel Gorman 
Cc: Michel Lespinasse 
Cc: Benjamin LaHaise 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: David Rientjes 
Cc: Zhang Yanfei 
Cc: Peter Zijlstra 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Jiang Liu 
Cc: Cody P Schafer 
Cc: Glauber Costa 
Cc: Kamezawa Hiroyuki 
Cc: David Rientjes 
Cc: Naoya Horiguchi 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org

---
 include/linux/huge_mm.h  |  2 ++
 include/linux/mm_types.h |  1 +
 mm/huge_memory.c | 30 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..0943b1b6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -177,6 +177,8 @@ static inline struct page *compound_trans_head(struct page 
*page)
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
+extern int thp_threshold_check(void);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b5efa23 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -408,6 +408,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+   int thp_threshold;
 #endif
 #ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..5d388e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
(1< HPAGE_PMD_NR)
+   return -EINVAL;
+
+   transparent_hugepage_threshold = value;
+
+   return count;
+}
+static struct kobj_attribute threshold_attr =
+   __ATTR(threshold, 0644, threshold_show, threshold_store);
 #ifdef CONFIG_DEBUG_VM
 static ssize_t debug_cow_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
@@ -397,6 +426,7 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
+   &threshold_attr.attr,
&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
-- 
1.7.12.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/3] Change how we determine when to hand out THPs

2013-12-12 Thread Alex Thorlton
 to be cleaned up a bit

Just to let people know, I've been doing most of my testing with the
memscale test:

http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz

The pthread test hits the first bug I mentioned here much more often,
but the patch seems to be more stable when tested with memscale.  I
typically run something like this to test:

# ./thp_memscale -C 0 -m 0 -c 32 -b 16m

As you increase the amount of memory/number of cores, you become more
likely to run into issues.

Although there's still work to be done here, I wanted to get an early
version of the patch out so that everyone could give their
opinions/suggestions.  The patch should apply cleanly to the 3.12
kernel.  I'll rebase it as soon as some of the remaining issues have
been sorted out, this will also mean changing over to the split PTL
where appropriate.

Signed-off-by: Alex Thorlton 
Cc: Andrew Morton 
Cc: Nate Zimmer 
Cc: Cliff Wickman 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Cc: Rik van Riel 
Cc: Wanpeng Li 
Cc: Mel Gorman 
Cc: Michel Lespinasse 
Cc: Benjamin LaHaise 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: David Rientjes 
Cc: Zhang Yanfei 
Cc: Peter Zijlstra 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Jiang Liu 
Cc: Cody P Schafer 
Cc: Glauber Costa 
Cc: Kamezawa Hiroyuki 
Cc: David Rientjes 
Cc: Naoya Horiguchi 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org

Alex Thorlton (3):
  Add flags for temporary compound pages
  Add tunable to control THP behavior
  Change THP behavior

 include/linux/gfp.h  |   5 +
 include/linux/huge_mm.h  |   8 ++
 include/linux/mm_types.h |  14 +++
 kernel/fork.c|   1 +
 mm/huge_memory.c | 313 +++
 mm/internal.h|   1 +
 mm/memory.c  |  29 -
 mm/page_alloc.c  |  66 +-
 8 files changed, 430 insertions(+), 7 deletions(-)

-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 3/3] Change THP behavior

2013-12-12 Thread Alex Thorlton
This patch implements the functionality we're really going for here.
It adds the decision making behavior to determine when to grab a
temporary compound page, and whether or not to fault in single pages or
to turn the temporary page into a THP.  This one is rather large, might
split it up a bit more for later versions

I've left most of my comments in here just to provide people with some
insight into what I may have been thinking when I chose to do something
in a certain way.  These will probably be trimmed down in later versions
of the patch.

Signed-off-by: Alex Thorlton 
Cc: Andrew Morton 
Cc: Nate Zimmer 
Cc: Cliff Wickman 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Cc: Rik van Riel 
Cc: Wanpeng Li 
Cc: Mel Gorman 
Cc: Michel Lespinasse 
Cc: Benjamin LaHaise 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: David Rientjes 
Cc: Zhang Yanfei 
Cc: Peter Zijlstra 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Jiang Liu 
Cc: Cody P Schafer 
Cc: Glauber Costa 
Cc: Kamezawa Hiroyuki 
Cc: David Rientjes 
Cc: Naoya Horiguchi 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org

---
 include/linux/huge_mm.h  |   6 +
 include/linux/mm_types.h |  13 +++
 kernel/fork.c|   1 +
 mm/huge_memory.c | 283 +++
 mm/internal.h|   1 +
 mm/memory.c  |  29 -
 mm/page_alloc.c  |  66 ++-
 7 files changed, 392 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0943b1b6..c1e407d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmd,
  unsigned int flags);
+extern struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm,
+pmd_t *pmd);
+extern int do_huge_pmd_temp_page(struct mm_struct *mm,
+struct vm_area_struct *vma,
+unsigned long address, pmd_t *pmd,
+unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b5efa23..d48c6ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -322,6 +322,17 @@ struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct temp_hugepage {
+   pmd_t *pmd;
+   struct page *page;
+   spinlock_t temp_hugepage_lock;
+   int node;   /* node id of the first page in the 
chunk */
+   int ref_count;  /* number of pages faulted in from the 
chunk */
+   struct list_head list;  /* pointers to next/prev chunks */
+};
+#endif
+
 struct kioctx_table;
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
@@ -408,7 +419,9 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+   spinlock_t thp_list_lock; /* lock to protect thp_temp_list */
int thp_threshold;
+   struct list_head thp_temp_list; /* list of 512 page chunks for THPs */
 #endif
 #ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..a3ccf857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -816,6 +816,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
+   INIT_LIST_HEAD(&mm->thp_temp_list);
 #endif
 #ifdef CONFIG_NUMA_BALANCING
mm->first_nid = NUMA_PTE_SCAN_INIT;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d388e4..43ea095 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,6 +788,20 @@ static inline struct page *alloc_hugepage_vma(int defrag,
   HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
+static inline gfp_t alloc_temp_hugepage_gfpmask(gfp_t extra_gfp)
+{
+   return GFP_TEMP_TRANSHUGE | extra_gfp;
+}
+
+static inline struct page *alloc_temp_hugepage_vma(int defrag,
+ struct vm_area_struct *vma,
+ unsigned long haddr, int nd,
+ gfp_t extra_gfp)
+{
+   return alloc_pages_vma(alloc_temp_hugepage_gfpmask(extra_gfp),
+  HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepa

[RFC PATCH 1/3] Add flags for temporary compound pages

2013-12-12 Thread Alex Thorlton
This chunk of the patch just adds the necessary page flags to support
what I call a temporary compound page.  Temporary compound pages provide
us with the ability to turn a chunk of pages into a proper THP at some
point after pulling them off the free list, but also allow us to fault
in single 4K pages from the chunk.

Signed-off-by: Alex Thorlton 
Cc: Andrew Morton 
Cc: Nate Zimmer 
Cc: Cliff Wickman 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Cc: Rik van Riel 
Cc: Wanpeng Li 
Cc: Mel Gorman 
Cc: Michel Lespinasse 
Cc: Benjamin LaHaise 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: David Rientjes 
Cc: Zhang Yanfei 
Cc: Peter Zijlstra 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Jiang Liu 
Cc: Cody P Schafer 
Cc: Glauber Costa 
Cc: Kamezawa Hiroyuki 
Cc: David Rientjes 
Cc: Naoya Horiguchi 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org

---
 include/linux/gfp.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..6ec63dd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD   0x40u
 #define ___GFP_OTHER_NODE  0x80u
 #define ___GFP_WRITE   0x100u
+#define ___GFP_COMP_TEMP   0x200u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -77,6 +78,7 @@ struct vm_area_struct;
 #define __GFP_NORETRY  ((__force gfp_t)___GFP_NORETRY) /* See above */
 #define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to 
emergency reserves */
 #define __GFP_COMP ((__force gfp_t)___GFP_COMP)/* Add compound page 
metadata */
+#define __GFP_COMP_TEMP((__force gfp_t)___GFP_COMP_TEMP)   /* 
Treat as temp compound page */
 #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)/* Return zeroed page 
on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use 
emergency reserves.
 * This takes 
precedence over the
@@ -121,6 +123,9 @@ struct vm_area_struct;
 #define GFP_TRANSHUGE  (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
 __GFP_NO_KSWAPD)
+#define GFP_TEMP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP_TEMP | \
+__GFP_NOMEMALLOC | __GFP_NORETRY | \
+__GFP_NOWARN | __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE   (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
-- 
1.7.12.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs

2013-12-12 Thread Alex Thorlton
Ugggh.  Looks like mutt clobbered the message ID on the cover letter,
screwing up the reply chain.  Sorry about that :/

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] Add tunable to control THP behavior

2013-12-12 Thread Alex Thorlton
> Is there a setting that will turn off the must-be-the-same-node
> behavior?  There are workloads where TLB matters more than cross-node
> traffic (or where all the pages are hopelessly shared between nodes,
> but hugepages are still useful).

That's pretty much how THPs already behave in the kernel, so if you want
to allow THPs to be handed out to one node, but referenced from many
others, you'd just set the threshold to 1, and let the existing code
take over.

As for the must-be-the-same-node behavior:  I'd actually say it's more
like a "must have so much on one node" behavior, in that, if you set the
threshold to 16, for example, 16 4K pages must be faulted in on the same
node, in the same contiguous 2M chunk, before a THP will be created.
What happens after that THP is created is out of our control, it could
be referenced from anywhere.

The idea here is that we can tune things so that jobs that behave poorly
with THP on will not be given THPs, but the jobs that like THPs can
still get them.  Granted, there are still issues with this approach, but
I think it's a bit better than just handing out a THP because we touched
one byte in a 2M chunk.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] Add tunable to control THP behavior

2013-12-12 Thread Alex Thorlton
> Right.  I like that behavior for my workload.  (Although I currently
> allocate huge pages -- when I wrote that code, THP interacted so badly
> with pagecache that it was a non-starter.  I think it's fixed now,
> though.)

In that case, it's probably best to just stick with current behavior,
and leave the threshold at 1, unless we implement something like I
discuss below.

> In that case, I guess I misunderstood your description.  Are saying
> that, once any node accesses this many pages in the potential THP,
> then the whole THP will be mapped?

Well, right now, this patch completely gives up on mapping a THP if two
different nodes take a page from our chunk before the threshold is hit,
so I think you're mostly understanding it correctly.

One thing we could consider is adding an option to map the THP on
the node with the *most* references to the potential THP, instead of
giving up on mapping the THP when multiple nodes reference it.  That
might be a good middle ground, but I can see some performance issues
coming into play there if the threshold is set too high, since we'll
have to move all the pages in the chunk to the node that hit the
threshold.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] Add tunable to control THP behavior

2013-12-12 Thread Alex Thorlton
On Thu, Dec 12, 2013 at 04:37:11PM -0500, Rik van Riel wrote:
> On 12/12/2013 01:00 PM, Alex Thorlton wrote:
> >This part of the patch adds a tunable to
> >/sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> >determines how many pages a user must fault in from a single node before
> >a temporary compound page is turned into a THP.
> 
> >+++ b/mm/huge_memory.c
> >@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
> > (1< > (1< >
> >+/* default to 1 page threshold for handing out thps; maintains old behavior 
> >*/
> >+static int transparent_hugepage_threshold = 1;
> 
> I assume the motivation for writing all this code is that "1"
> was not a good value in your tests.

Yes, that's correct.

> That makes me wonder, why should 1 be the default value with
> your patches?

The main reason I set the default to 1 was because the majority of
jobs aren't hurt by the existing THP behavior.  I figured it would be
best to default to having things behave the same as they do now, but
provide the option to increase the threshold on systems that run jobs
that could be adversely affected by the current behavior.

> If there is a better value, why should we not use that?
>
> What is the upside of using a better value?
>
> What is the downside?

The problem here is that what the "better" value is can vary greatly
depending on how a particular task allocates memory.  Setting the
threshold too high can negatively affect the performance of jobs that
behave well with the current behavior, setting it too low won't yield a
performance increase for the jobs that are hurt by the current
behavior.  With some more thorough testing, I'm sure that we could
arrive at a value that will help out jobs which behave poorly under
current conditions, while having a minimal effect on jobs that already
perform well.  At this point, I'm looking more to ensure that everybody
likes this approach to solving the problem before putting the finishing
touches on the patches, and doing testing to find a good middle ground.
 
> Is there a value that would to bound the downside, so it
> is almost always smaller than the upside?

Again, the problem here is that, to find a good value, we have to know
quite a bit about why a particular value is bad for a particular job.
While, as stated above, I think we can probably find a good middle
ground to use as a default, in the end it will be the job of individual
sysadmins to determine what value works best for their particular
applications, and tune things accordingly.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs

2013-12-16 Thread Alex Thorlton
> Please cc Andrea on this.

I'm going to clean up a few small things for a v2 pretty soon, I'll be
sure to cc Andrea there.

> > My proposed solution to the problem is to allow users to set a
> > threshold at which THPs will be handed out.  The idea here is that, when
> > a user faults in a page in an area where they would usually be handed a
> > THP, we pull 512 pages off the free list, as we would with a regular
> > THP, but we only fault in single pages from that chunk, until the user
> > has faulted in enough pages to pass the threshold we've set.  Once they
> > pass the threshold, we do the necessary work to turn our 512 page chunk
> > into a proper THP.  As it stands now, if the user tries to fault in
> > pages from different nodes, we completely give up on ever turning a
> > particular chunk into a THP, and just fault in the 4K pages as they're
> > requested.  We may want to make this tunable in the future (i.e. allow
> > them to fault in from only 2 different nodes).
> 
> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
> disabled those 512 pages would have resided closer to the CPUs which
> instantiated them.  

As it stands right now, yes, since we're pulling a 512 page contiguous
chunk off the free list, everything from that chunk will reside on the
same node, but as I (stupidly) forgot to mention in my original e-mail,
one piece I have yet to add is the functionality to put the remaining
unfaulted pages from our chunk *back* on the free list after we give up
on handing out a THP.  Once this is in there, things will behave more
like they do when THP is turned completely off, i.e. pages will get
faulted in closer to the CPU that first referenced them once we give up
on handing out the THP.

> So the expected result will be somewhere in between
> the 93 secs and the 76 secs?

Yes.  Due to the time it takes to search for the temporary THP, I'm sure
we won't get down to 76 secs, but hopefully we'll get close.  I'm also
considering switching the linked list that stores the temporary THPs
over to an rbtree to make that search faster, just fyi.

> That being said, I don't see a downside to the idea, apart from some
> additional setup cost in kernel code.

Good to hear.  I still need to address some of the issues that others
have raised, and finish up the few pieces that aren't fully
working/finished.  I'll get things polished up and get some more
informative test results out soon.

Thanks for looking at the patch!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] Change THP behavior

2013-12-16 Thread Alex Thorlton
> Hm. I think this part is not correct: you collapse temp thp page
> into real one only for current procees. What will happen if a process with
> temp thp pages was forked?

That's a scenario that I hadn't yet addressed, but definitely something
I'll consider going forward.  I think we can come up with a way to
appropriately handle that situation.

> And I don't think this problem is an easy one. khugepaged can't collapse
> pages with page->_count != 1 for the same reason: to make it properly you
> need to take mmap_sem for all processes and collapse all pages at once.
> And if a page is pinned, we also can't collapse.

Again, a few things here that I hadn't taken into account.  I'll look
for a way to address these concerns.

> Sorry, I don't think the whole idea has much potential. :(

I understand that there are some issues with this initial pass at the
idea, but I think we can get things corrected and come up with a
workable solution here.  When it comes down to it, there are relatively
few options to correct the issue that I'm focusing on here, and this one
seems to be the most appropriate one that we've tried so far.  We've
looked at disabling THP on a per-cpuset/per-process basis, and that has
been met with fairly strong resistance, for good reason.  I'll take
another pass at this and hopefully be able to address some of your
concerns with the idea.

Thanks for taking the time to look the patch over!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCHv2 1/2] Add mm flag to control THP

2014-01-16 Thread Alex Thorlton
ddressed.  From what I know, get
  user pages calls down to handle_mm_fault, which should prevent THPs
  from being handed out where necessary.  If anybody can confirm that,
  it would be appreciated.
* Current behavior is to have fork()/exec()'d processes inherit the
  flag.  Andrew Morton pointed out some possible issues with this, so we
  may need to rethink this behavior.
  - If parent process has THP disabled, and forks off a child, the child
will also have THP disabled.  This may not be the desired behavior.

Signed-off-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Cc: Rik van Riel 
Cc: Naoya Horiguchi 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: Kees Cook 
Cc: Andrea Arcangeli 
Cc: linux-kernel@vger.kernel.org

---
 include/linux/huge_mm.h|  6 --
 include/linux/sched.h  |  6 +-
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 11 +++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 91672e2..475f59f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_HUGE_MM_H
 #define _LINUX_HUGE_MM_H
 
+#include 
+
 extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmd,
@@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct 
*vma);
   (1<vm_flags & VM_HUGEPAGE))) &&   \
 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&\
-!is_vma_temporary_stack(__vma))
+!is_vma_temporary_stack(__vma) &&  \
+!test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags))
 #define transparent_hugepage_defrag(__vma) \
((transparent_hugepage_flags &  \
  (1<no_new_privs ? 1 : 0;
+   case PR_SET_THP_DISABLE:
+   if (arg2)
+   set_bit(MMF_THP_DISABLE, &me->mm->flags);
+   else
+   clear_bit(MMF_THP_DISABLE, &me->mm->flags);
+   break;
+   case PR_GET_THP_DISABLE:
+   error = put_user(test_bit(MMF_THP_DISABLE,
+&me->mm->flags),
+(int __user *) arg2);
+   break;
default:
error = -EINVAL;
break;
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCHv2 2/2] Change khugepaged to respect MMF_THP_DISABLE flag

2014-01-16 Thread Alex Thorlton
This just adds a simple check to get khugepaged to behave
appropriately when MMF_THP_DISABLE is set.

Signed-off-by: Alex Thorlton 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Andrew Morton 
Cc: "Kirill A. Shutemov" 
Cc: Benjamin Herrenschmidt 
Cc: Rik van Riel 
Cc: Naoya Horiguchi 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: Kees Cook 
Cc: Andrea Arcangeli 
Cc: linux-kernel@vger.kernel.org

---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c0b172..3cfe6b4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2049,7 +2049,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
-   return atomic_read(&mm->mm_users) == 0;
+   return atomic_read(&mm->mm_users) == 0 ||
+  (mm->flags & MMF_THP_DISABLE_MASK);
 }
 
 int __khugepaged_enter(struct mm_struct *mm)
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >