Re: [PATCH 1/8] THP: Use real address for NUMA policy
> Here's more up-to-date version: https://lkml.org/lkml/2012/8/20/337 These don't seem to give us a noticeable performance change either: With THP: real22m34.279s user10797m35.984s sys 39m18.188s Without THP: real4m48.957s user2118m23.208s sys 113m12.740s Looks like we got a few minutes faster on the with THP case, but it's still significantly slower, and that could just be a fluke result; we're still floating at about a 5x performance degradation. I talked with one of our performance/benchmarking experts last week and he's done a bit more research into the actual problem here, so I've got a bit more information: The real performance hit, based on our testing, seems to be coming from the increased latency that comes into play on large NUMA systems when a process has to go off-node to read from/write to memory. To give an extreme example, say we have a 16 node system with 8 cores per node. If we have a job that shares a 2MB data structure between 128 threads, with THP on, the first thread to touch the structure will allocate all 2MB of space for that structure in a 2MB page, local to its socket. This means that all the memory accessses for the other 120 threads will be remote acceses. With THP off, each thread could locally allocate a number of 4K pages sufficient to hold the chunk of the structure on which it needs to work, significantly reducing the number of remote accesses that each thread will need to perform. So, with that in mind, do we agree that a per-process tunable (or something similar) to control THP seems like a reasonable method to handle this issue? Just want to confirm that everyone likes this approach before moving forward with another revision of the patch. I'm currently in favor of moving this to a per-mm tunable, since that seems to make more sense when it comes to threaded jobs. Also, a decent chunk of the code I've already written can be reused with this approach, and prctl will still be an appropriate place from which to control the behavior. Andrew Morton suggested possibly controlling this through the ELF header, but I'm going to lean towards the per-mm route unless anyone has a major objection to it. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 00/10] split page table lock for PMD tables
Kirill, I've pasted in my results for 512 cores below. Things are looking really good here. I don't have a test for HUGETLBFS, but if you want to pass me the one you used, I can run that too. I suppose I could write one, but why reinvent the wheel? :) Sorry for the delay on these results. I hit some strange issues with running thp_memscale on systems with either of the following combinations of configuration options set: [thp off] HUGETLBFS=y HUGETLB_PAGE=y NUMA_BALANCING=y NUMA_BALANCING_DEFAULT_ENABLED=y [thp on or off] HUGETLBFS=n HUGETLB_PAGE=n NUMA_BALANCING=y NUMA_BALANCING_DEFAULT_ENABLED=y I'm getting segfaults intermittently, as well as some weird RCU sched errors. This happens in vanilla 3.12-rc2, so it doesn't have anything to do with your patches, but I thought I'd let you know. There didn't used to be any issues with this test, so I think there's a subtle kernel bug here. That's, of course, an entirely separate issue though. As far as these patches go, I think everything looks good (save for the bit of discussion you were having with Andrew earlier, which I think you've worked out). My testing shows that the page fault rates are actually better on this threaded test than in the non-threaded case! - Alex On Fri, Sep 27, 2013 at 04:16:17PM +0300, Kirill A. Shutemov wrote: > Alex Thorlton noticed that some massively threaded workloads work poorly, > if THP enabled. This patchset fixes this by introducing split page table > lock for PMD tables. hugetlbfs is not covered yet. > > This patchset is based on work by Naoya Horiguchi. > > Please review and consider applying. > > Changes: > v4: > - convert hugetlb to new locking; > v3: > - fix USE_SPLIT_PMD_PTLOCKS; > - fix warning in fs/proc/task_mmu.c; > v2: > - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock; > - s/huge_pmd_lock/pmd_lock/g; > - assume pgtable_pmd_page_ctor() can fail; > - fix format line in task_mem() for VmPTE; > > THP off, v3.12-rc2: > --- > > Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs): > > 1037072.835207 task-clock# 57.426 CPUs utilized >( +- 3.59% ) > 95,093 context-switches #0.092 K/sec >( +- 3.93% ) >140 cpu-migrations#0.000 K/sec >( +- 5.28% ) > 10,000,550 page-faults #0.010 M/sec >( +- 0.00% ) > 2,455,210,400,261 cycles#2.367 GHz >( +- 3.62% ) [83.33%] > 2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle >( +- 3.67% ) [83.33%] > 1,975,960,019,659 stalled-cycles-backend# 80.48% backend cycles idle >( +- 3.88% ) [66.68%] > 46,503,296,013 instructions #0.02 insns per cycle > # 52.24 stalled cycles per > insn ( +- 3.21% ) [83.34%] > 9,278,997,542 branches #8.947 M/sec >( +- 4.00% ) [83.34%] > 89,881,640 branch-misses #0.97% of all branches >( +- 1.17% ) [83.33%] > > 18.059261877 seconds time elapsed >( +- 2.65% ) > > THP on, v3.12-rc2: > -- > > Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs): > > 3114745.395974 task-clock# 73.875 CPUs utilized >( +- 1.84% ) >267,356 context-switches #0.086 K/sec >( +- 1.84% ) > 99 cpu-migrations#0.000 K/sec >( +- 1.40% ) > 58,313 page-faults #0.019 K/sec >( +- 0.28% ) > 7,416,635,817,510 cycles#2.381 GHz >( +- 1.83% ) [83.33%] > 7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle >( +- 1.88% ) [83.33%] > 6,267,671,641,967 stalled-cycles-backend# 84.51% backend cycles idle >( +- 2.03% ) [66.67%] >117,819,935,165 instructions #0.02 insns per cycle > # 62.32 stalled cycles per > insn ( +- 4.39% ) [83.34%] > 28,899,314,777 branches #9.278 M/sec >( +- 4.48% ) [83.34%] > 71,787,032 branch-misses #0.25% of all branches >( +- 1.03% ) [83.33%] > > 42.162306788 seconds time elapsed >( +- 1.73% ) THP on, v3.12-rc2: -
Re: [PATCHv4 00/10] split page table lock for PMD tables
On Fri, Oct 04, 2013 at 11:26:02PM +0300, Kirill A. Shutemov wrote: > Alex Thorlton wrote: > > Kirill, > > > > I've pasted in my results for 512 cores below. Things are looking > > really good here. I don't have a test for HUGETLBFS, but if you want to > > pass me the one you used, I can run that too. I suppose I could write > > one, but why reinvent the wheel? :) > > Patch below. Good deal, thanks. I'll get some test results put up soon. > > > Sorry for the delay on these results. I hit some strange issues with > > running thp_memscale on systems with either of the following > > combinations of configuration options set: > > > > [thp off] > > HUGETLBFS=y > > HUGETLB_PAGE=y > > NUMA_BALANCING=y > > NUMA_BALANCING_DEFAULT_ENABLED=y > > > > [thp on or off] > > HUGETLBFS=n > > HUGETLB_PAGE=n > > NUMA_BALANCING=y > > NUMA_BALANCING_DEFAULT_ENABLED=y > > > > I'm getting segfaults intermittently, as well as some weird RCU sched > > errors. This happens in vanilla 3.12-rc2, so it doesn't have anything > > to do with your patches, but I thought I'd let you know. There didn't > > used to be any issues with this test, so I think there's a subtle kernel > > bug here. That's, of course, an entirely separate issue though. > > I'll take a look next week, if nobody does it before. I'm starting a bisect now. Not sure how long it'll take, but I'll keep you posted. > > > > > As far as these patches go, I think everything looks good (save for the > > bit of discussion you were having with Andrew earlier, which I think > > you've worked out). My testing shows that the page fault rates are > > actually better on this threaded test than in the non-threaded case! > > > > - Alex > > > > THP on, v3.12-rc2: > > -- > > > > Performance counter stats for './thp_memscale -C 0 -m 0 -c 512 -b 512m' (5 > > runs): > > > > 568668865.944994 task-clock# 528.547 CPUs utilized > > ( +- 0.21% ) [100.00%] > > 1,491,589 context-switches #0.000 M/sec > > ( +- 0.25% ) [100.00%] > > 1,085 CPU-migrations#0.000 M/sec > > ( +- 1.80% ) [100.00%] > >400,822 page-faults #0.000 M/sec > > ( +- 0.41% ) > > 1,306,612,476,049,478 cycles#2.298 GHz > > ( +- 0.23% ) [100.00%] > > 1,277,211,694,318,724 stalled-cycles-frontend # 97.75% frontend cycles > > idle ( +- 0.21% ) [100.00%] > > 1,163,736,844,232,064 stalled-cycles-backend# 89.07% backend cycles > > idle ( +- 0.20% ) [100.00%] > > 53,855,178,678,230 instructions #0.04 insns per cycle > > > > # 23.72 stalled cycles per > > insn ( +- 1.15% ) [100.00%] > > 21,041,661,816,782 branches # 37.002 M/sec > > ( +- 0.64% ) [100.00%] > >606,665,092 branch-misses #0.00% of all branches > > ( +- 0.63% ) > > > > 1075.909782795 seconds time elapsed > > ( +- 0.21% ) > > > > THP on, patched: > > > > > > Performance counter stats for './runt -t -c 512 -b 512m' (5 runs): > > > >15836198.490485 task-clock# 533.304 CPUs utilized > > ( +- 0.95% ) [100.00%] > >127,507 context-switches #0.000 M/sec > > ( +- 1.65% ) [100.00%] > > 1,223 CPU-migrations#0.000 M/sec > > ( +- 3.23% ) [100.00%] > >302,080 page-faults #0.000 M/sec > > ( +- 6.88% ) > > 18,925,875,973,975 cycles#1.195 GHz > > ( +- 0.43% ) [100.00%] > > 18,325,469,464,007 stalled-cycles-frontend # 96.83% frontend cycles > > idle ( +- 0.44% ) [100.00%] > > 17,522,272,147,141 stalled-cycles-backend# 92.58% backend cycles > > idle ( +- 0.49% ) [100.00%] > > 2,686,490,067,197 instructions #0.14 insns per cycle > > > > #6.82 stalled cycles per > > insn ( +- 2.16% ) [100.00%] > >944,712,646,402 branche
Re: [PATCHv2 0/9] split page table lock for PMD tables
On Mon, Sep 16, 2013 at 02:25:31PM +0300, Kirill A. Shutemov wrote: > Alex Thorlton noticed that some massivly threaded workloads work poorly, > if THP enabled. This patchset fixes this by introducing split page table > lock for PMD tables. hugetlbfs is not covered yet. > > This patchset is based on work by Naoya Horiguchi. > > Changes: > v2: > - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock; > - s/huge_pmd_lock/pmd_lock/g; > - assume pgtable_pmd_page_ctor() can fail; > - fix format line in task_mem() for VmPTE; > > Benchmark (from Alex): > ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz > Run on 4 socket Westmere with 128 GiB of RAM. Kirill, I'm hitting some performance issues with these patches on our larger machines (>=128 cores/256 threads). I've managed to livelock larger systems with one of our tests (I'll publish this one soon), and I'm actually seeing a performance hit on some of the smaller ones. I'm currently collecting some results to show the problems I'm hitting, and trying to research what's causing the livelock. For now I just wanted to let you know that I'm seeing some issues. I'll be in touch with more details. Sorry for the delayed response, I wanted to be really sure that I was seeing problems before I put up a red flag. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/8] THP: Use real address for NUMA policy
On Tue, Aug 27, 2013 at 12:01:01PM -0500, Robin Holt wrote: > Alex, > > Although the explanation seems plausible, have you verified this is > actually possible? You could make a simple pthread test case which > allocates a getpagesize() * area, prints its > address and then each thread migrate and reference their page. Have > the task then sleep() before exit. Look at the physical > address space with dlook for those virtual addresses in both the THP > and non-THP cases. > > Thanks, > Robin Robin, I tweaked one of our other tests to behave pretty much exactly as I described, and I can see a very significant increase in performance with THP turned off. The test behaves as follows: - malloc a large array - Spawn a specified number of threads - Have each thread touch small, evenly spaced chunks of the array (e.g. for 128 threads, the array is divided into 128 chunks, and each thread touches 1/128th of each chunk, dividing the array into 16,384 pieces) With THP off, the majority of each thread's pages are node local. With THP on, most of the pages end up as THPs on the first thread's nodes, since it is touching chunks that are close enough together to be collapsed into THPs which will, of course, remain on the first node for the duration of the test. Here are some timings for 128 threads, allocating a total of 64gb: THP on: real1m6.394s user16m1.160s sys 75m25.232s THP off: real0m35.251s user26m37.316s sys 3m28.472s The performance hit here isn't as severe as shown with the SPEC workload that we originally used, but it still appears to consistently take about twice as long with THP enabled. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/8] THP: Use real address for NUMA policy
> Robin, > > I tweaked one of our other tests to behave pretty much exactly as I > - malloc a large array > - Spawn a specified number of threads > - Have each thread touch small, evenly spaced chunks of the array (e.g. > for 128 threads, the array is divided into 128 chunks, and each thread > touches 1/128th of each chunk, dividing the array into 16,384 pieces) Forgot to mention that the threads don't touch their chunks of memory concurrently, i.e. thread 2 has to wait for thread 1 to finish first. This is important to note, since the pages won't all get stuck on the first node without this behavior. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] thp: support split page table lock
On Thu, Sep 05, 2013 at 05:27:46PM -0400, Naoya Horiguchi wrote: > Thp related code also uses per process mm->page_table_lock now. > So making it fine-grained can provide better performance. > > This patch makes thp support split page table lock by using page->ptl > of the pages storing "pmd_trans_huge" pmds. > > Some functions like pmd_trans_huge_lock() and page_check_address_pmd() > are expected by their caller to pass back the pointer of ptl, so this > patch adds to those functions new arguments for that. Rather than that, > this patch gives only straightforward replacement. > > ChangeLog v3: > - fixed argument of huge_pmd_lockptr() in copy_huge_pmd() > - added missing declaration of ptl in do_huge_pmd_anonymous_page() I've applied these and tested them using the same tests program that I used when I was working on the same issue, and I'm running into some bugs. Here's a stack trace: general protection fault: [#1] SMP Modules linked in: CPU: 268 PID: 32381 Comm: memscale Not tainted 3.11.0-medusa-03121-g757f8ca #184 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013 task: 880fbdd82180 ti: 880fc0c5a000 task.ti: 880fc0c5a000 RIP: 0010:[] [] pgtable_trans_huge_withdraw+0x38/0x60 RSP: 0018:880fc0c5bc88 EFLAGS: 00010297 RAX: ea17cebe8838 RBX: 0015309bd000 RCX: ea01f623b028 RDX: dead00100100 RSI: 8dcf77d84c30 RDI: 880fbda67580 RBP: 880fc0c5bc88 R08: 0013 R09: 00014da0 R10: 880fc0c5bc88 R11: 888f7efda000 R12: 8dcf77d84c30 R13: 880fc0c5bdf8 R14: 85cf401ff067 R15: 8b4de5fabff8 FS: () GS:880fffd8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7768b0b8 CR3: 01a0b000 CR4: 000407e0 Stack: 880fc0c5bcc8 810f7643 880fc0c5bcc8 810d8297 ea1456237510 7fc7b0e0 7fc7b0c0 880fc0c5bda8 810d85ba 880fc0c5bd48 880fc0c5bd68 Call Trace: [] zap_huge_pmd+0x4c/0x101 [] ? tlb_flush_mmu+0x58/0x75 [] unmap_single_vma+0x306/0x7d6 [] unmap_vmas+0x4f/0x82 [] exit_mmap+0x8b/0x113 [] ? __delayacct_add_tsk+0x170/0x182 [] mmput+0x3e/0xc4 [] do_exit+0x380/0x907 [] ? vfs_write+0x149/0x1a3 [] do_group_exit+0x72/0x9b [] SyS_exit_group+0x12/0x16 [] system_call_fastpath+0x16/0x1b Code: 51 20 48 8d 41 20 48 39 c2 75 0d 48 c7 87 28 03 00 00 00 00 00 00 eb 36 48 8d 42 e0 48 89 87 28 03 00 00 48 8b 51 20 48 8b 41 28 <48> 89 42 08 48 89 10 48 ba 00 01 10 00 00 00 ad de 48 b8 00 02 RIP [] pgtable_trans_huge_withdraw+0x38/0x60 RSP ---[ end trace e5413b388b6ea448 ]--- Fixing recursive fault but reboot is needed! general protection fault: [#2] SMP Modules linked in: CPU: 268 PID: 1722 Comm: kworker/268:1 Tainted: G D 3.11.0-medusa-03121-g757f8ca #184 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013 Workqueue: events vmstat_update task: 880fc1a74280 ti: 880fc1a76000 task.ti: 880fc1a76000 RIP: 0010:[] [] free_pcppages_bulk+0x97/0x329 RSP: 0018:880fc1a77c98 EFLAGS: 00010082 RAX: 880fffd94d68 RBX: dead002001e0 RCX: 880fffd94d50 RDX: 880fffd94d68 RSI: 001f RDI: 888f7efdac68 RBP: 880fc1a77cf8 R08: 0400 R09: 81a8bf00 R10: 884f7efdac00 R11: 81009bae R12: dead00200200 R13: 888f7efdac00 R14: 001f R15: FS: () GS:880fffd8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7768b0b8 CR3: 01a0b000 CR4: 000407e0 Stack: 880fc1a77ce8 880fffd94d68 0010 880fffd94d50 001ff9276a68 880fffd94d60 001f 880fffd94d50 0292 880fc1a77d38 880fffd95d05 Call Trace: [] drain_zone_pages+0x33/0x42 [] refresh_cpu_vm_stats+0xcc/0x11e [] vmstat_update+0x11/0x43 [] process_one_work+0x260/0x389 [] worker_thread+0x1e2/0x332 [] ? process_one_work+0x389/0x389 [] kthread+0xb3/0xbd [] ? process_one_work+0x389/0x389 [] ? kthread_freezable_should_stop+0x5b/0x5b [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x5b/0x5b Code: 48 89 55 c8 48 39 14 08 74 ce 41 83 fe 03 44 0f 44 75 c4 48 83 c2 08 48 89 45 b0 48 89 55 a8 48 8b 45 a8 4c 8b 20 49 8d 5c 24 e0 <48> 8b 53 20 48 8b 43 28 48 89 42 08 48 89 10 48 ba 00 01 10 00 RIP [] free_pcppages_bulk+0x97/0x329 RSP ---[ end trace e5413b388b6ea449 ]--- BUG: unable to handle kernel paging request at ffd8 IP: [] kthread_data+0xb/0x11 PGD 1a0c067 PUD 1a0e067 PMD 0 Oops: [#3] SMP Modules linked in: CPU: 268 PID: 1722 Comm: kworker/268:1 Tainted: G D 3.11.0-medusa-03121-g757f8ca #184 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013 task: 880fc1a74280 ti: 880fc1a76000 task.ti: 880fc1a76000 RIP: 0010:[] [] kthread_data+0xb/0x11 RSP: 0
Re: [PATCH 2/2] thp: support split page table lock
On Mon, Sep 09, 2013 at 10:34:45AM +0800, Daniel J Blueman wrote: > On Saturday, 7 September 2013 02:10:02 UTC+8, Naoya Horiguchi wrote: > >Hi Alex, > > > >On Fri, Sep 06, 2013 at 11:04:23AM -0500, Alex Thorlton wrote: > >> On Thu, Sep 05, 2013 at 05:27:46PM -0400, Naoya Horiguchi wrote: > >> > Thp related code also uses per process mm->page_table_lock now. > >> > So making it fine-grained can provide better performance. > >> > > >> > This patch makes thp support split page table lock by using page->ptl > >> > of the pages storing "pmd_trans_huge" pmds. > >> > > >> > Some functions like pmd_trans_huge_lock() and > page_check_address_pmd() > >> > are expected by their caller to pass back the pointer of ptl, so this > >> > patch adds to those functions new arguments for that. Rather than > that, > >> > this patch gives only straightforward replacement. > >> > > >> > ChangeLog v3: > >> > - fixed argument of huge_pmd_lockptr() in copy_huge_pmd() > >> > - added missing declaration of ptl in do_huge_pmd_anonymous_page() > >> > >> I've applied these and tested them using the same tests program that I > >> used when I was working on the same issue, and I'm running into some > >> bugs. Here's a stack trace: > > > >Thank you for helping testing. This bug is new to me. > > With 3.11, this patch series and CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS, > I consistently hit the same failure when exiting one of my > stress-testers [1] when using eg 24 cores. > > Doesn't happen with 8 cores, so likely needs enough virtual memory > to use multiple split locks. Otherwise, this is very promising work! Daniel, Hillf Danton (dhi...@gmail.com) suggested putting the big page_table_lock back into the two functions seen below. I re-tested with this change and it seems to resolve the issue. - Alex --- a/mm/pgtable-generic.c Sat Sep 7 15:17:52 2013 +++ b/mm/pgtable-generic.c Sat Sep 7 15:20:28 2013 @@ -127,12 +127,14 @@ void pmdp_splitting_flush(struct vm_area void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, pgtable_t pgtable) { + spin_lock(&mm->page_table_lock); /* FIFO */ if (!mm->pmd_huge_pte) INIT_LIST_HEAD(&pgtable->lru); else list_add(&pgtable->lru, &mm->pmd_huge_pte->lru); mm->pmd_huge_pte = pgtable; + spin_unlock(&mm->page_table_lock); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif @@ -144,6 +146,7 @@ pgtable_t pgtable_trans_huge_withdraw(st { pgtable_t pgtable; + spin_lock(&mm->page_table_lock); /* FIFO */ pgtable = mm->pmd_huge_pte; if (list_empty(&pgtable->lru)) @@ -153,6 +156,7 @@ pgtable_t pgtable_trans_huge_withdraw(st struct page, lru); list_del(&pgtable->lru); } + spin_unlock(&mm->page_table_lock); return pgtable; } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/8] THP: Use real address for NUMA policy
On Thu, Sep 05, 2013 at 01:15:10PM +0200, Ingo Molnar wrote: > > * Alex Thorlton wrote: > > > > Robin, > > > > > > I tweaked one of our other tests to behave pretty much exactly as I > > > - malloc a large array > > > - Spawn a specified number of threads > > > - Have each thread touch small, evenly spaced chunks of the array (e.g. > > > for 128 threads, the array is divided into 128 chunks, and each thread > > > touches 1/128th of each chunk, dividing the array into 16,384 pieces) > > > > Forgot to mention that the threads don't touch their chunks of memory > > concurrently, i.e. thread 2 has to wait for thread 1 to finish first. > > This is important to note, since the pages won't all get stuck on the > > first node without this behavior. > > Could you post the testcase please? > > Thanks, > > Ingo Sorry for the delay here, had to make sure that everything in my tests was okay to push out to the public. Here's a pointer to the test I wrote: ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz Everything to compile the test should be there (just run make in the thp_pthread directory). To run the test use something like: time ./thp_pthread -C 0 -m 0 -c -b I ran: time ./thp_pthread -C 0 -m 0 -c 128 -b 128g On a 256 core machine, with ~500gb of memory and got these results: THP off: real0m57.797s user46m22.156s sys 6m14.220s THP on: real1m36.906s user0m2.612s sys 143m13.764s I snagged some code from another test we use, so I can't vouch for the usefulness/accuracy of all the output (actually, I know some of it is wrong). I've mainly been looking at the total run time. Don't want to bloat this e-mail up with too many test results, but I found this one really interesting. Same machine, using all the cores, with the same amount of memory. This means that each cpu is actually doing *less* work, since the chunk we reserve gets divided up evenly amongst the cpus: time ./thp_pthread -C 0 -m 0 -c 256 -b 128g THP off: real1m1.028s user104m58.448s sys 8m52.908s THP on: real2m26.072s user60m39.404s sys 337m10.072s Seems that the test scales really well in the THP off case, but, once again, with THP on, we really see the performance start to degrade. I'm planning to start investigating possible ways to split up THPs, if we detect that that majority of the references to a THP are off-node. I've heard some horror stories about migrating pages in this situation (i.e., process switches cpu and then all the pages follow it), but I think we might be able to get some better results if we can cleverly determine an appropriate time to split up pages. I've heard a bit of talk about doing something similar to this from a few people, but haven't seen any code/test results. If anybody has any input on that topic, it would be greatly appreciated. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 00/10] split page table lock for PMD tables
On Mon, Oct 07, 2013 at 12:48:20PM +0300, Kirill A. Shutemov wrote: > Alex Thorlton wrote: > > > > Sorry for the delay on these results. I hit some strange issues with > > > > running thp_memscale on systems with either of the following > > > > combinations of configuration options set: > > > > > > > > [thp off] > > > > HUGETLBFS=y > > > > HUGETLB_PAGE=y > > > > NUMA_BALANCING=y > > > > NUMA_BALANCING_DEFAULT_ENABLED=y > > > > > > > > [thp on or off] > > > > HUGETLBFS=n > > > > HUGETLB_PAGE=n > > > > NUMA_BALANCING=y > > > > NUMA_BALANCING_DEFAULT_ENABLED=y > > > > > > > > I'm getting segfaults intermittently, as well as some weird RCU sched > > > > errors. This happens in vanilla 3.12-rc2, so it doesn't have anything > > > > to do with your patches, but I thought I'd let you know. There didn't > > > > used to be any issues with this test, so I think there's a subtle kernel > > > > bug here. That's, of course, an entirely separate issue though. > > > > > > I'll take a look next week, if nobody does it before. > > > > I'm starting a bisect now. Not sure how long it'll take, but I'll keep > > you posted. > > I don't see the issue. Could you share your kernel config? I put my kernel config up on ftp at: ftp://shell.sgi.com/collect/atconfig/config_bug I've been investigating the issue today and the smallest run I've seen the problem on was with 128 threads, so this might not be something that most people will hit. With the config I've shared here the problem appears to only be intermittent at 128 threads. It happened on every run of the test when I ran it with 512 threads. Just for something to compare to, here's a config that seems to behave just fine for any number of threads: ftp://shell.sgi.com/collect/atconfig/config_good It looks like this is a problem all the way back to the current 3.8 stable tree. I'm still working on tracing back to a kernel where this problem doesn't show up. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] Change THP code to use pud_page(pud)->ptl lock page_table_lock
This patch changes out the page_table_lock for the pud_page ptl in the THP fault path; pretty self-explanatory. I got lazy and commented out the spinlock assertion in follow_trans_huge_pmd instead of digging up the pud_page ptl in this function. This is just a proof of concept, so I didn't feel that it was too important to keep around for now. --- mm/huge_memory.c | 4 ++-- mm/memory.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a92012a..d3b34e2f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1240,10 +1240,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned int flags) { - struct mm_struct *mm = vma->vm_mm; +// struct mm_struct *mm = vma->vm_mm; struct page *page = NULL; - assert_spin_locked(&mm->page_table_lock); +// assert_spin_locked(&mm->page_table_lock); if (flags & FOLL_WRITE && !pmd_write(*pmd)) goto out; diff --git a/mm/memory.c b/mm/memory.c index af84bc0..5b4e910 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1527,15 +1527,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma, split_huge_page_pmd(vma, address, pmd); goto split_fallthrough; } - spin_lock(&mm->page_table_lock); + spin_lock(&pud_page(*pud)->ptl); if (likely(pmd_trans_huge(*pmd))) { if (unlikely(pmd_trans_splitting(*pmd))) { - spin_unlock(&mm->page_table_lock); + spin_unlock(&pud_page(*pud)->ptl); wait_split_huge_page(vma->anon_vma, pmd); } else { page = follow_trans_huge_pmd(vma, address, pmd, flags); - spin_unlock(&mm->page_table_lock); + spin_unlock(&pud_page(*pud)->ptl); *page_mask = HPAGE_PMD_NR - 1; goto out; } -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] Increase locking granularity in THP page fault code
We have a threaded page fault scaling test that is performing very poorly due to the use of the page_table_lock in the THP fault code path. By replacing the page_table_lock with the ptl on the pud_page (need CONFIG_SPLIT_PTLOCKS for this to work), we're able to increase the granularity of the locking here by a factor of 512, i.e. instead of locking all 256TB of addressable memory, we only lock the 512GB that is handled by a single pud_page. The test I'm running creates 512 threads, pins each thread to a cpu, has the threads allocate 512mb of memory each and then touch the first byte of each 4k chunk of the allocated memory. Here are the timings from this test on 3.11-rc7, clean, THP on: real22m50.904s user15m26.072s sys 11430m19.120s And here are the timings with my modified kernel, THP on: real0m37.018s user21m39.164s sys 155m9.132s As you can see, we get a huge performance boost by locking a more targeted chunk of memory instead of locking the whole page table. At this point, I'm comfortable saying that there are obvious benefits to increasing the granularity of the locking, but I'm not sure that I've done this in the best way possible. Mainly, I'm not positive that using the pud_page lock actually protects everything that we're concerned about locking here. I'm hoping that everyone can provide some input on whether or not this seems like a reasonable move to make and, if so, confirm that I've locked things appropriately. As a side note, we still have some pretty significant scaling issues with this test, both with THP on, and off. I'm cleaning up the locking here first as this is causing the biggest performance hit. Alex Thorlton (1): Change THP code to use pud_page(pud)->ptl lock instead of page_table_lock mm/huge_memory.c | 4 ++-- mm/memory.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/9] split page table lock for PMD tables
> Let me guess: you have HUGETLBFS enabled in your config, right? ;) > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if > HUGETLBFS is enabled. Ah, that's got it! I double checked my config a million times to make sure that I wasn't going crazy, but I must've missed that. With that fixed, it's performing exactly how I thought it should. Looking great to me! Sorry for the confusion! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/9] split page table lock for PMD tables
On Fri, Sep 27, 2013 at 12:38:07AM +0300, Kirill A. Shutemov wrote: > Alex Thorlton wrote: > > > Let me guess: you have HUGETLBFS enabled in your config, right? ;) > > > > > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if > > > HUGETLBFS is enabled. > > > > Ah, that's got it! I double checked my config a million times to > > make sure that I wasn't going crazy, but I must've missed that. With > > that fixed, it's performing exactly how I thought it should. Looking > > great to me! > > Can I use your Reviewed-by? Sure, no problem. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/9] split page table lock for PMD tables
On Fri, Sep 27, 2013 at 12:42:46AM +0300, Kirill A. Shutemov wrote: > Kirill A. Shutemov wrote: > > Alex Thorlton wrote: > > > > Let me guess: you have HUGETLBFS enabled in your config, right? ;) > > > > > > > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock > > > > if > > > > HUGETLBFS is enabled. > > > > > > Ah, that's got it! I double checked my config a million times to > > > make sure that I wasn't going crazy, but I must've missed that. With > > > that fixed, it's performing exactly how I thought it should. Looking > > > great to me! > > > > Can I use your Reviewed-by? > > s/Reviewed-by/Tested-by/ That too. Whatever you need, haha. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/9] split page table lock for PMD tables
Kirill, I've merged a couple of our e-mails together for this reply: On Fri, Sep 20, 2013 at 03:31:37PM +0300, Kirill A. Shutemov wrote: > Alex Thorlton noticed that some massivly threaded workloads work poorly, > if THP enabled. This patchset fixes this by introducing split page table > lock for PMD tables. hugetlbfs is not covered yet. > > This patchset is based on work by Naoya Horiguchi. > > Changes: > v2: > - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock; > - s/huge_pmd_lock/pmd_lock/g; > - assume pgtable_pmd_page_ctor() can fail; > - fix format line in task_mem() for VmPTE; > > Benchmark (from Alex): > ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz > Run on 4 socket Westmere with 128 GiB of RAM. First off, this test, although still somewhat relevant here, wasn't exactly crafted to show the particular page fault scaling problem that I was looking to address by changing to the split PTLs. I wrote this one to show what happens when an application allocates memory in such a way that all of its pages get stuck on one node due to the use of THP. This problem is much more evident on large machines, although it appears to still be a significant issue even on this 4 socket. > THP off: > > > Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 > runs): > > 1738259.808012 task-clock# 47.571 CPUs utilized >( +- 9.49% ) >147,359 context-switches #0.085 K/sec >( +- 9.67% ) > 14 cpu-migrations#0.000 K/sec >( +- 13.25% ) > 24,410,139 page-faults #0.014 M/sec >( +- 0.00% ) > 4,149,037,526,252 cycles#2.387 GHz >( +- 9.50% ) > 3,649,839,735,027 stalled-cycles-frontend # 87.97% frontend cycles idle >( +- 6.60% ) > 2,455,558,969,567 stalled-cycles-backend# 59.18% backend cycles idle >( +- 22.92% ) > 1,434,961,518,604 instructions #0.35 insns per cycle > #2.54 stalled cycles per > insn ( +- 92.86% ) >241,472,020,951 branches # 138.916 M/sec >( +- 91.72% ) > 84,022,172 branch-misses #0.03% of all branches >( +- 3.16% ) > > 36.540185552 seconds time elapsed >( +- 18.36% ) I'm assuming this was THP off, no patchset, correct? > THP on, no patchset: > > Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 > runs): > > 2528378.966949 task-clock# 50.715 CPUs utilized >( +- 11.86% ) >214,063 context-switches #0.085 K/sec >( +- 11.94% ) > 19 cpu-migrations#0.000 K/sec >( +- 22.72% ) > 49,226 page-faults #0.019 K/sec >( +- 0.33% ) > 6,034,640,598,498 cycles#2.387 GHz >( +- 11.91% ) > 5,685,933,794,081 stalled-cycles-frontend # 94.22% frontend cycles idle >( +- 7.67% ) > 4,414,381,393,353 stalled-cycles-backend# 73.15% backend cycles idle >( +- 2.09% ) >952,086,804,776 instructions #0.16 insns per cycle > #5.97 stalled cycles per > insn ( +- 89.59% ) >166,191,211,974 branches # 65.730 M/sec >( +- 85.52% ) > 33,341,022 branch-misses #0.02% of all branches >( +- 3.90% ) > > 49.854741504 seconds time elapsed >( +- 14.76% ) > > THP on, with patchset: > -- > > echo always > /sys/kernel/mm/transparent_hugepage/enabled > Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 > runs): > > 1538763.343568 task-clock# 45.386 CPUs utilized >( +- 7.21% ) >130,469 context-switches #0.085 K/sec >( +- 7.32% ) > 14 cpu-migrations#0.000 K/sec >( +- 23.58% ) > 49,299 page-faults #0.032 K/sec >( +- 0.15% ) > 3,666,748,502,650 cycles#2.383 GHz >( +- 7.25% ) > 3,330,488,035,212 stalled-cycles-frontend # 90.83% frontend cycles
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote: > On Tue, 11 Jun 2013, Alex Thorlton wrote: > > > This patch adds the ability to control THPs on a per cpuset basis. > > Please see > > the additions to Documentation/cgroups/cpusets.txt for more > > information. > > > > What's missing from both this changelog and the documentation you > point to > is why this change is needed. > > I can understand how you would want a subset of processes to not use > thp > when it is enabled. This is typically where MADV_NOHUGEPAGE is used > with > some type of malloc hook. > > I don't think we need to do this on a cpuset level, so unfortunately I > think this needs to be reworked. Would it make sense to add a > per-process > tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() > and > mmap() calls? Perhaps, but then you would need to justify why it > can't be > done with a malloc hook in userspace. > > This seems to just be working around a userspace issue or for a matter > of > convenience, right? David, Thanks for your input, however, I believe the method of using a malloc hook falls apart when it comes to static binaries, since we wont' have any shared libraries to hook into. Although using a malloc hook is a perfectly suitable solution for most cases, we're looking to implement a solution that can be used in all situations. Aside from that particular shortcoming of the malloc hook solution, there are some other situations having a cpuset-based option is a much simpler and more efficient solution than the alternatives. One such situation that comes to mind would be an environment where a batch scheduler is in use to ration system resources. If an administrator determines that a users jobs run more efficiently with thp always on, the administrator can simply set the users jobs to always run with that setting, instead of having to coordinate with that user to get them to run their jobs in a different way. I feel that, for cases such as this, the this additional flag is in line with the other capabilities that cgroups and cpusets provide. While there are likely numerous other situations where having a flag to control thp on the cpuset level makes things a bit easier to manage, the one real show-stopper here is that we really have no other options when it comes to static binaries. - Alex Thorlton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Make transparent hugepages cpuset aware
This patch adds the ability to control THPs on a per cpuset basis. Please see the additions to Documentation/cgroups/cpusets.txt for more information. Signed-off-by: Alex Thorlton Reviewed-by: Robin Holt Cc: Li Zefan Cc: Rob Landley Cc: Andrew Morton Cc: Mel Gorman Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Johannes Weiner Cc: Xiao Guangrong Cc: David Rientjes Cc: linux-...@vger.kernel.org Cc: linux...@kvack.org --- Documentation/cgroups/cpusets.txt | 50 ++- include/linux/cpuset.h| 5 ++ include/linux/huge_mm.h | 25 +- kernel/cpuset.c | 181 ++ mm/huge_memory.c | 3 + 5 files changed, 261 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 12e01d4..b7b2c83 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -22,12 +22,14 @@ CONTENTS: 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is thp_enabled ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus 2.3 Setting flags 2.4 Attaching processes + 2.5 Setting thp_enabled flags 3. Questions 4. Contact @@ -581,7 +583,34 @@ If your situation is: then increasing 'sched_relax_domain_level' would benefit you. -1.9 How do I use cpusets ? +1.9 What is thp_enabled ? +--- + +The thp_enabled file contained within each cpuset controls how transparent +hugepages are handled within that cpuset. + +The root cpuset's thp_enabled flags mirror the flags set in +/sys/kernel/mm/transparent_hugepage/enabled. The flags in the root cpuset can +only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The +thp_enabled file for the root cpuset is read only. These flags cause the +root cpuset to behave as one might expect: + +- When set to always, THPs are used whenever practical +- When set to madvise, THPs are used only on chunks of memory that have the + MADV_HUGEPAGE flag set +- When set to never, THPs are never allowed for tasks in this cpuset + +The behavior of thp_enabled for children of the root cpuset is where things +become a bit more interesting. The child cpusets accept the same flags as the +root, but also have a default flag, which, when set, causes a cpuset to use the +behavior of its parent. When a child cpuset is created, its default flag is +always initially set. + +Since the flags on child cpusets are allowed to differ from the flags on their +parents, we are able to enable THPs for tasks in specific cpusets, and disable +them in others. + +1.10 How do I use cpusets ? -- In order to minimize the impact of cpusets on critical kernel @@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance cpuset.mem_exclusive cpuset.sched_relax_domain_level cpuset.mem_hardwallnotify_on_release cpuset.memory_migrate tasks +thp_enabled Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using @@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one after another: ... # /bin/echo PIDn > tasks +2.5 Setting thp_enabled flags +- + +The syntax for setting these flags is similar to setting thp flags in +/sys/kernel/mm/transparent_hugepage/enabled. In order to change the flags you +simply echo the name of the flag you want to set to the thp_enabled file of the +desired cpuset: + +# /bin/echo always > thp_enabled -> always use THPs when practical +# /bin/echo madvise > thp_enabled -> only use THPs in madvise sections +# /bin/echo never > thp_enabled-> never use THPs +# /bin/echo default > thp_enabled -> use parent cpuset's THP flags + +Note that the flags on the root cpuset cannot be changed in /dev/cpuset. These +flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can only +be modified there. 3. Questions diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index cc1b01c..624aafd 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are defined in system? */ extern int cpuset_init(void); extern void cpuset_init_smp(void); +extern void cpuset_update_top_thp_flags(void); extern void cpuset_update_active_cpus(bool cpu_online); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern void cpuset_cpus_allowed_fallback(struct task_struct *p); +extern int cpuset_thp_always(struct task_struct *p); +extern int cpuset_thp_madvise(struct task_struct *p); extern nodemask_t cpuset_mems_allowed(struct task_struct *p)
Re: [PATCH] Make transparent hugepages cpuset aware
On Tue, Jun 11, 2013 at 09:55:18AM +0300, Kirill A. Shutemov wrote: > Alex Thorlton wrote: > > This patch adds the ability to control THPs on a per cpuset basis. Please > > see > > the additions to Documentation/cgroups/cpusets.txt for more information. > > > > Signed-off-by: Alex Thorlton > > Reviewed-by: Robin Holt > > Cc: Li Zefan > > Cc: Rob Landley > > Cc: Andrew Morton > > Cc: Mel Gorman > > Cc: Rik van Riel > > Cc: Kirill A. Shutemov > > Cc: Johannes Weiner > > Cc: Xiao Guangrong > > Cc: David Rientjes > > Cc: linux-...@vger.kernel.org > > Cc: linux...@kvack.org > > --- > > Documentation/cgroups/cpusets.txt | 50 ++- > > include/linux/cpuset.h| 5 ++ > > include/linux/huge_mm.h | 25 +- > > kernel/cpuset.c | 181 > > ++ > > mm/huge_memory.c | 3 + > > 5 files changed, 261 insertions(+), 3 deletions(-) > > > > diff --git a/Documentation/cgroups/cpusets.txt > > b/Documentation/cgroups/cpusets.txt > > index 12e01d4..b7b2c83 100644 > > --- a/Documentation/cgroups/cpusets.txt > > +++ b/Documentation/cgroups/cpusets.txt > > @@ -22,12 +22,14 @@ CONTENTS: > >1.6 What is memory spread ? > >1.7 What is sched_load_balance ? > >1.8 What is sched_relax_domain_level ? > > - 1.9 How do I use cpusets ? > > + 1.9 What is thp_enabled ? > > + 1.10 How do I use cpusets ? > > 2. Usage Examples and Syntax > >2.1 Basic Usage > >2.2 Adding/removing cpus > >2.3 Setting flags > >2.4 Attaching processes > > + 2.5 Setting thp_enabled flags > > 3. Questions > > 4. Contact > > > > @@ -581,7 +583,34 @@ If your situation is: > > then increasing 'sched_relax_domain_level' would benefit you. > > > > > > -1.9 How do I use cpusets ? > > +1.9 What is thp_enabled ? > > +--- > > + > > +The thp_enabled file contained within each cpuset controls how transparent > > +hugepages are handled within that cpuset. > > + > > +The root cpuset's thp_enabled flags mirror the flags set in > > +/sys/kernel/mm/transparent_hugepage/enabled. The flags in the root cpuset > > can > > +only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. > > The > > +thp_enabled file for the root cpuset is read only. These flags cause the > > +root cpuset to behave as one might expect: > > + > > +- When set to always, THPs are used whenever practical > > +- When set to madvise, THPs are used only on chunks of memory that have the > > + MADV_HUGEPAGE flag set > > +- When set to never, THPs are never allowed for tasks in this cpuset > > + > > +The behavior of thp_enabled for children of the root cpuset is where things > > +become a bit more interesting. The child cpusets accept the same flags as > > the > > +root, but also have a default flag, which, when set, causes a cpuset to > > use the > > +behavior of its parent. When a child cpuset is created, its default flag > > is > > +always initially set. > > + > > +Since the flags on child cpusets are allowed to differ from the flags on > > their > > +parents, we are able to enable THPs for tasks in specific cpusets, and > > disable > > +them in others. > > Should we have a way for parent cgroup can enforce child behaviour? > Like a mask of allowed thp_enabled values children can choose. > We don't have a use case for that particular scenario, so we didn't include any such functionality. Our main goal here was to allow cpusets to override the /sys/kernel/mm/transparent_hugepage/enabled setting. If you have a use case for that scenario, then I think it would be more suitable to add that functionality in a separate patch. > > @@ -177,6 +177,29 @@ static inline struct page *compound_trans_head(struct > > page *page) > > return page; > > } > > > > +#ifdef CONFIG_CPUSETS > > +extern int cpuset_thp_always(struct task_struct *p); > > +extern int cpuset_thp_madvise(struct task_struct *p); > > + > > +static inline int transparent_hugepage_enabled(struct vm_area_struct *vma) > > +{ > > + if (cpuset_thp_always(current)) > > + return 1; > > Why do you ignore VM_NOHUGEPAGE? > And !is_vma_temporary_stack(__vma) is still relevant. > That was an oversight, on my part. I've fixed it and will submit the corrected patch shortly. Thanks for pointing that out.
[PATCH v2] Make transparent hugepages cpuset aware
This patch adds the ability to control THPs on a per cpuset basis. Please see the additions to Documentation/cgroups/cpusets.txt for more information. Signed-off-by: Alex Thorlton Reviewed-by: Robin Holt Cc: Li Zefan Cc: Rob Landley Cc: Andrew Morton Cc: Mel Gorman Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Johannes Weiner Cc: Xiao Guangrong Cc: David Rientjes Cc: linux-...@vger.kernel.org Cc: linux...@kvack.org --- Changes since last patch version: - Modified transparent_hugepage_enable to always check the vma for the VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack. - Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags Documentation/cgroups/cpusets.txt | 50 ++- include/linux/cpuset.h| 5 ++ include/linux/huge_mm.h | 27 +- kernel/cpuset.c | 181 ++ mm/huge_memory.c | 3 + 5 files changed, 263 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 12e01d4..b7b2c83 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -22,12 +22,14 @@ CONTENTS: 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is thp_enabled ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus 2.3 Setting flags 2.4 Attaching processes + 2.5 Setting thp_enabled flags 3. Questions 4. Contact @@ -581,7 +583,34 @@ If your situation is: then increasing 'sched_relax_domain_level' would benefit you. -1.9 How do I use cpusets ? +1.9 What is thp_enabled ? +--- + +The thp_enabled file contained within each cpuset controls how transparent +hugepages are handled within that cpuset. + +The root cpuset's thp_enabled flags mirror the flags set in +/sys/kernel/mm/transparent_hugepage/enabled. The flags in the root cpuset can +only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The +thp_enabled file for the root cpuset is read only. These flags cause the +root cpuset to behave as one might expect: + +- When set to always, THPs are used whenever practical +- When set to madvise, THPs are used only on chunks of memory that have the + MADV_HUGEPAGE flag set +- When set to never, THPs are never allowed for tasks in this cpuset + +The behavior of thp_enabled for children of the root cpuset is where things +become a bit more interesting. The child cpusets accept the same flags as the +root, but also have a default flag, which, when set, causes a cpuset to use the +behavior of its parent. When a child cpuset is created, its default flag is +always initially set. + +Since the flags on child cpusets are allowed to differ from the flags on their +parents, we are able to enable THPs for tasks in specific cpusets, and disable +them in others. + +1.10 How do I use cpusets ? -- In order to minimize the impact of cpusets on critical kernel @@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance cpuset.mem_exclusive cpuset.sched_relax_domain_level cpuset.mem_hardwallnotify_on_release cpuset.memory_migrate tasks +thp_enabled Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using @@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one after another: ... # /bin/echo PIDn > tasks +2.5 Setting thp_enabled flags +- + +The syntax for setting these flags is similar to setting thp flags in +/sys/kernel/mm/transparent_hugepage/enabled. In order to change the flags you +simply echo the name of the flag you want to set to the thp_enabled file of the +desired cpuset: + +# /bin/echo always > thp_enabled -> always use THPs when practical +# /bin/echo madvise > thp_enabled -> only use THPs in madvise sections +# /bin/echo never > thp_enabled-> never use THPs +# /bin/echo default > thp_enabled -> use parent cpuset's THP flags + +Note that the flags on the root cpuset cannot be changed in /dev/cpuset. These +flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can only +be modified there. 3. Questions diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index cc1b01c..624aafd 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are defined in system? */ extern int cpuset_init(void); extern void cpuset_init_smp(void); +extern void cpuset_update_top_thp_flags(void); extern void cpuset_update_active_cpus(bool cpu_online); extern void cpuset_cpus_allowed(struct task
[PATCH 0/8] Re: [PATCH] Add per-process flag to control thp
Here are the results from one of the benchmarks that performs particularly poorly when thp is enabled. Unfortunately the vclear patches don't seem to provide a performance boost. I've attached the patches that include the changes I had to make to get the vclear patches applied to the latest kernel. This first set of tests was run on the latest community kernel, with the vclear patches: Kernel string: Kernel 3.11.0-rc5-medusa-00021-g1a15a96-dirty harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# time ./run.sh ... Done. Terminating the simulation. real25m34.052s user10769m7.948s sys 37m46.524s harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# echo never > /sys/kernel/mm/transparent_hugepage/enabled harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l# time ./run.sh ... Done. Terminating the simulation. real5m0.377s user2202m0.684s sys 108m31.816s Here are the same tests on the clean kernel: Kernel string: Kernel 3.11.0-rc5-medusa-00013-g584d88b Kernel string: Kernel 3.11.0-rc5-medusa-00013-g584d88b athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l> cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l> time ./run.sh ... Done. Terminating the simulation. real21m44.052s user10809m55.356s sys 39m58.300s harp31-sys:~ # echo never > /sys/kernel/mm/transparent_hugepage/enabled athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l> cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] athorlton@harp31-sys:/home/estes04/athorlton/Testing/progs/thp_benchmarks/321.equake_l> time ./run.sh ... Done. Terminating the simulation. real4m52.502s user2127m18.548s sys 104m50.828s Working on getting some more information about the root of the performance issues now... Alex Thorlton (8): THP: Use real address for NUMA policy mm: make clear_huge_page tolerate non aligned address THP: Pass real, not rounded, address to clear_huge_page x86: Add clear_page_nocache mm: make clear_huge_page cache clear only around the fault address x86: switch the 64bit uncached page clear to SSE/AVX v2 remove KM_USER0 from kmap_atomic call fix up references to kernel_fpu_begin/end arch/x86/include/asm/page.h | 2 + arch/x86/include/asm/string_32.h | 5 ++ arch/x86/include/asm/string_64.h | 5 ++ arch/x86/lib/Makefile| 1 + arch/x86/lib/clear_page_nocache_32.S | 30 arch/x86/lib/clear_page_nocache_64.S | 92 arch/x86/mm/fault.c | 7 +++ mm/huge_memory.c | 17 +++ mm/memory.c | 31 ++-- 9 files changed, 179 insertions(+), 11 deletions(-) create mode 100644 arch/x86/lib/clear_page_nocache_32.S create mode 100644 arch/x86/lib/clear_page_nocache_64.S -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/8] THP: Pass real, not rounded, address to clear_huge_page
Conflicts: mm/huge_memory.c --- mm/huge_memory.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 55ec681..2e6f106 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -701,7 +701,8 @@ static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma) static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long haddr, pmd_t *pmd, + unsigned long haddr, + unsigned long address, pmd_t *pmd, struct page *page) { pgtable_t pgtable; @@ -711,12 +712,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, if (unlikely(!pgtable)) return VM_FAULT_OOM; - clear_huge_page(page, haddr, HPAGE_PMD_NR); /* * The memory barrier inside __SetPageUptodate makes sure that * clear_huge_page writes become visible before the set_pmd_at() * write. */ + clear_huge_page(page, address, HPAGE_PMD_NR); __SetPageUptodate(page); spin_lock(&mm->page_table_lock); @@ -825,8 +826,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, put_page(page); goto out; } - if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, - page))) { + if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, + address, pmd, page))) { mem_cgroup_uncharge_page(page); put_page(page); goto out; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/8] remove KM_USER0 from kmap_atomic call
--- arch/x86/mm/fault.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a088ed7..d3f6dc7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1233,7 +1233,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) void clear_user_highpage_nocache(struct page *page, unsigned long vaddr) { - void *p = kmap_atomic(page, KM_USER0); + void *p = kmap_atomic(page); clear_page_nocache(p); kunmap_atomic(p); } -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/8] fix up references to kernel_fpu_begin/end
--- arch/x86/lib/clear_page_nocache_64.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/lib/clear_page_nocache_64.S b/arch/x86/lib/clear_page_nocache_64.S index a6d938c..d03408e 100644 --- a/arch/x86/lib/clear_page_nocache_64.S +++ b/arch/x86/lib/clear_page_nocache_64.S @@ -23,7 +23,7 @@ ENTRY(clear_page_nocache) CFI_STARTPROC push %rdi - call kernel_fpu_begin + call __kernel_fpu_begin pop%rdi sub$16,%rsp CFI_ADJUST_CFA_OFFSET 16 @@ -43,7 +43,7 @@ ENTRY(clear_page_nocache) movdqu (%rsp),%xmm0 addq $16,%rsp CFI_ADJUST_CFA_OFFSET -16 - jmp kernel_fpu_end + jmp __kernel_fpu_end CFI_ENDPROC ENDPROC(clear_page_nocache) @@ -65,7 +65,7 @@ ENDPROC(clear_page_nocache) ENTRY(clear_page_nocache_avx) CFI_STARTPROC push %rdi - call kernel_fpu_begin + call __kernel_fpu_begin pop%rdi sub$32,%rsp CFI_ADJUST_CFA_OFFSET 32 @@ -85,7 +85,7 @@ ENTRY(clear_page_nocache_avx) vmovdqu (%rsp),%ymm0 addq $32,%rsp CFI_ADJUST_CFA_OFFSET -32 - jmp kernel_fpu_end + jmp __kernel_fpu_end CFI_ENDPROC ENDPROC(clear_page_nocache_avx) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/8] x86: switch the 64bit uncached page clear to SSE/AVX v2
--- arch/x86/lib/clear_page_nocache_64.S | 91 ++-- 1 file changed, 77 insertions(+), 14 deletions(-) diff --git a/arch/x86/lib/clear_page_nocache_64.S b/arch/x86/lib/clear_page_nocache_64.S index ee16d15..a6d938c 100644 --- a/arch/x86/lib/clear_page_nocache_64.S +++ b/arch/x86/lib/clear_page_nocache_64.S @@ -1,29 +1,92 @@ +/* + * Clear pages with cache bypass. + * + * Copyright (C) 2011, 2012 Intel Corporation + * Author: Andi Kleen + * + * This software may be redistributed and/or modified under the terms of + * the GNU General Public License ("GPL") version 2 only as published by the + * Free Software Foundation. + */ + #include +#include +#include #include +#define SSE_UNROLL 128 + /* * Zero a page avoiding the caches * rdi page */ ENTRY(clear_page_nocache) CFI_STARTPROC - xorl %eax,%eax - movl $4096/64,%ecx + push %rdi + call kernel_fpu_begin + pop%rdi + sub$16,%rsp + CFI_ADJUST_CFA_OFFSET 16 + movdqu %xmm0,(%rsp) + xorpd %xmm0,%xmm0 + movl $4096/SSE_UNROLL,%ecx .p2align 4 .Lloop: decl%ecx -#define PUT(x) movnti %rax,x*8(%rdi) - movnti %rax,(%rdi) - PUT(1) - PUT(2) - PUT(3) - PUT(4) - PUT(5) - PUT(6) - PUT(7) - leaq64(%rdi),%rdi + .set x,0 + .rept SSE_UNROLL/16 + movntdq %xmm0,x(%rdi) + .set x,x+16 + .endr + leaqSSE_UNROLL(%rdi),%rdi jnz .Lloop - nop - ret + movdqu (%rsp),%xmm0 + addq $16,%rsp + CFI_ADJUST_CFA_OFFSET -16 + jmp kernel_fpu_end CFI_ENDPROC ENDPROC(clear_page_nocache) + +#ifdef CONFIG_AS_AVX + + .section .altinstr_replacement,"ax" +1: .byte 0xeb /* jmp */ + .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b) + /* offset */ +2: + .previous + .section .altinstructions,"a" + altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\ +16, 2b-1b + .previous + +#define AVX_UNROLL 256 /* TUNE ME */ + +ENTRY(clear_page_nocache_avx) + CFI_STARTPROC + push %rdi + call kernel_fpu_begin + pop%rdi + sub$32,%rsp + CFI_ADJUST_CFA_OFFSET 32 + vmovdqu %ymm0,(%rsp) + vxorpd %ymm0,%ymm0,%ymm0 + movl $4096/AVX_UNROLL,%ecx + .p2align 4 +.Lloop_avx: + decl%ecx + .set x,0 + .rept AVX_UNROLL/32 + vmovntdq %ymm0,x(%rdi) + .set x,x+32 + .endr + leaqAVX_UNROLL(%rdi),%rdi + jnz .Lloop_avx + vmovdqu (%rsp),%ymm0 + addq $32,%rsp + CFI_ADJUST_CFA_OFFSET -32 + jmp kernel_fpu_end + CFI_ENDPROC +ENDPROC(clear_page_nocache_avx) + +#endif -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/8] mm: make clear_huge_page tolerate non aligned address
--- mm/memory.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 1ce2e2a..69a5a38 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4249,16 +4249,17 @@ void clear_huge_page(struct page *page, unsigned long addr, unsigned int pages_per_huge_page) { int i; + unsigned long haddr = addr & HPAGE_PMD_MASK; if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, addr, pages_per_huge_page); + clear_gigantic_page(page, haddr, pages_per_huge_page); return; } might_sleep(); for (i = 0; i < pages_per_huge_page; i++) { cond_resched(); - clear_user_highpage(page + i, addr + i * PAGE_SIZE); + clear_user_highpage(page + i, haddr + i * PAGE_SIZE); } } -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/8] THP: Use real address for NUMA policy
--- mm/huge_memory.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a92012a..55ec681 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -746,11 +746,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp) static inline struct page *alloc_hugepage_vma(int defrag, struct vm_area_struct *vma, - unsigned long haddr, int nd, + unsigned long address, int nd, gfp_t extra_gfp) { return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp), - HPAGE_PMD_ORDER, vma, haddr, nd); + HPAGE_PMD_ORDER, vma, address, nd); } #ifndef CONFIG_NUMA @@ -815,7 +815,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, return 0; } page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, address, numa_node_id(), 0); if (unlikely(!page)) { count_vm_event(THP_FAULT_FALLBACK); goto out; @@ -1160,7 +1160,7 @@ alloc: if (transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, address, numa_node_id(), 0); else new_page = NULL; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/8] x86: Add clear_page_nocache
Conflicts: arch/x86/mm/fault.c --- arch/x86/include/asm/page.h | 2 ++ arch/x86/include/asm/string_32.h | 5 + arch/x86/include/asm/string_64.h | 5 + arch/x86/lib/Makefile| 1 + arch/x86/lib/clear_page_nocache_32.S | 30 ++ arch/x86/lib/clear_page_nocache_64.S | 29 + arch/x86/mm/fault.c | 7 +++ 7 files changed, 79 insertions(+) create mode 100644 arch/x86/lib/clear_page_nocache_32.S create mode 100644 arch/x86/lib/clear_page_nocache_64.S diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h index c878924..ccec9c3 100644 --- a/arch/x86/include/asm/page.h +++ b/arch/x86/include/asm/page.h @@ -33,6 +33,8 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr, copy_page(to, from); } +void clear_user_highpage_nocache(struct page *page, unsigned long vaddr); + #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr) #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h index 3d3e835..3f2fbcf 100644 --- a/arch/x86/include/asm/string_32.h +++ b/arch/x86/include/asm/string_32.h @@ -3,6 +3,8 @@ #ifdef __KERNEL__ +#include + /* Let gcc decide whether to inline or use the out of line functions */ #define __HAVE_ARCH_STRCPY @@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern, #define __HAVE_ARCH_MEMSCAN extern void *memscan(void *addr, int c, size_t size); +#define ARCH_HAS_USER_NOCACHE 1 +asmlinkage void clear_page_nocache(void *page); + #endif /* __KERNEL__ */ #endif /* _ASM_X86_STRING_32_H */ diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 19e2c46..ca23d1d 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -3,6 +3,8 @@ #ifdef __KERNEL__ +#include + /* Written 2002 by Andi Kleen */ /* Only used for special circumstances. Stolen from i386/string.h */ @@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src); char *strcat(char *dest, const char *src); int strcmp(const char *cs, const char *ct); +#define ARCH_HAS_USER_NOCACHE 1 +asmlinkage void clear_page_nocache(void *page); + #endif /* __KERNEL__ */ #endif /* _ASM_X86_STRING_64_H */ diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 96b2c66..e48776c 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o lib-$(CONFIG_SMP) += rwlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o +lib-y += clear_page_nocache_$(BITS).o obj-y += msr.o msr-reg.o msr-reg-export.o diff --git a/arch/x86/lib/clear_page_nocache_32.S b/arch/x86/lib/clear_page_nocache_32.S new file mode 100644 index 000..2394e0c --- /dev/null +++ b/arch/x86/lib/clear_page_nocache_32.S @@ -0,0 +1,30 @@ +#include +#include + +/* + * Zero a page avoiding the caches + * rdi page + */ +ENTRY(clear_page_nocache) + CFI_STARTPROC + mov%eax,%edi + xorl %eax,%eax + movl $4096/64,%ecx + .p2align 4 +.Lloop: + decl%ecx +#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi) + PUT(0) + PUT(1) + PUT(2) + PUT(3) + PUT(4) + PUT(5) + PUT(6) + PUT(7) + lea 64(%edi),%edi + jnz .Lloop + nop + ret + CFI_ENDPROC +ENDPROC(clear_page_nocache) diff --git a/arch/x86/lib/clear_page_nocache_64.S b/arch/x86/lib/clear_page_nocache_64.S new file mode 100644 index 000..ee16d15 --- /dev/null +++ b/arch/x86/lib/clear_page_nocache_64.S @@ -0,0 +1,29 @@ +#include +#include + +/* + * Zero a page avoiding the caches + * rdi page + */ +ENTRY(clear_page_nocache) + CFI_STARTPROC + xorl %eax,%eax + movl $4096/64,%ecx + .p2align 4 +.Lloop: + decl%ecx +#define PUT(x) movnti %rax,x*8(%rdi) + movnti %rax,(%rdi) + PUT(1) + PUT(2) + PUT(3) + PUT(4) + PUT(5) + PUT(6) + PUT(7) + leaq64(%rdi),%rdi + jnz .Lloop + nop + ret + CFI_ENDPROC +ENDPROC(clear_page_nocache) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 654be4a..a088ed7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1230,3 +1230,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) __do_page_fault(regs, error_code); exception_exit(prev_state); } + +void clear_user_highpage_nocache(struct page *page, unsigned long vaddr) +{ + void *p = kmap_atomic(page, KM_USER0); + clear_page_nocache(p); + kunmap_atomic(p); +} -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@v
[PATCH 5/8] mm: make clear_huge_page cache clear only around the fault address
--- mm/memory.c | 28 ++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 69a5a38..17d61f0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4231,18 +4231,35 @@ EXPORT_SYMBOL(might_fault); #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) + +#ifndef ARCH_HAS_USER_NOCACHE +#define ARCH_HAS_USER_NOCACHE 0 +#endif + +#if ARCH_HAS_USER_NOCACHE == 0 +#define clear_user_highpage_nocache clear_user_highpage +#endif + static void clear_gigantic_page(struct page *page, unsigned long addr, unsigned int pages_per_huge_page) { int i; struct page *p = page; + unsigned long vaddr; + unsigned long haddr = addr & HPAGE_PMD_MASK; + int target = (addr - haddr) >> PAGE_SHIFT; might_sleep(); + vaddr = haddr; for (i = 0; i < pages_per_huge_page; i++, p = mem_map_next(p, page, i)) { cond_resched(); - clear_user_highpage(p, addr + i * PAGE_SIZE); + vaddr = haddr + i*PAGE_SIZE; + if (!ARCH_HAS_USER_NOCACHE || i == target) + clear_user_highpage(p, vaddr); + else + clear_user_highpage_nocache(p, vaddr); } } void clear_huge_page(struct page *page, @@ -4250,6 +4267,8 @@ void clear_huge_page(struct page *page, { int i; unsigned long haddr = addr & HPAGE_PMD_MASK; + unsigned long vaddr; + int target = (addr - haddr) >> PAGE_SHIFT; if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { clear_gigantic_page(page, haddr, pages_per_huge_page); @@ -4257,9 +4276,14 @@ void clear_huge_page(struct page *page, } might_sleep(); + vaddr = haddr; for (i = 0; i < pages_per_huge_page; i++) { cond_resched(); - clear_user_highpage(page + i, haddr + i * PAGE_SIZE); + vaddr = haddr + i*PAGE_SIZE; + if (!ARCH_HAS_USER_NOCACHE || i == target) + clear_user_highpage(page + i, vaddr); + else + clear_user_highpage_nocache(page + i, vaddr); } } -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/8] Re: [PATCH] Add per-process flag to control thp
> This 321.equake_l thing is not public, right? Do you have anything > that > is public that shows the same problem? The quake test comes from the SPEC OMP benchmarks. While I believe this suite is available to anybody, I don't think it's free. I was given the test by our benchmarking team, I'll have to ask them if this is publicly available/if they know of any publicly available tests that can reproduce the issue. > Do you have any idea where all the extra time is spend? I'm working on that as we speak. I was trying to eliminate the possibility that the vclear patches would help solve the problem, before digging any deeper. I should have some more information on exactly where it's getting hung up in a few hours. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/8] THP: Use real address for NUMA policy
> Could you add some actual descriptions to these patches that say why you > are doing this, and why this particular patch is needed and implemented > this way? > > You mention that THP is slow for you, then go on to implement some > non-cached page zero'ing, but you never quite connect the dots. I actually didn't write these patches (made a few tweaks to get them running on the latest kernel though). They were submitted last July by Peter Zijlstra. Andi Kleen suggested that I re-run some of my tests using these patches to see whether they solved my issue. I just included my updated patches so that people could confirm that I'd pulled them forward properly. The messages from the original submission can be found here: https://lkml.org/lkml/2012/7/20/165 I did write the patch that these were submitted in response to, to control THP on a per-process basis, but it looks like we're probably going to end up taking this in a different direction, pending some more test results. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/8] THP: Use real address for NUMA policy
On Fri, Aug 16, 2013 at 09:46:35PM +0200, Peter Zijlstra wrote: > On Fri, Aug 16, 2013 at 01:17:28PM -0500, Alex Thorlton wrote: > > I actually didn't write these patches (made a few tweaks to get them > > running on the latest kernel though). They were submitted last July by > > Peter Zijlstra. > > By Kirill, I don't think I've ever touched them. > > > The messages from the original submission can be found here: > > https://lkml.org/lkml/2012/7/20/165 > > As can be seen by this posting you reference ;-) Sorry about that! Not sure where I got the idea that you wrote them, must've gotten some names mixed up somewhere. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
We've re-evaluated the need for a patch to support some sort of finer grained control over thp, and, based on some tests performed by our benchmarking team, we're seeing that we'd definitely still like to implement some method to support this. Here's an e-mail from John Baron (jba...@sgi.com), on our benchmarking team, containing some data which shows a decrease in performance for some SPEC OMP benchmarks when thp is enabled: > Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64 > sockets. These show the performance ratio for jobs run with THP > disabled versus THP enabled (so > 1.0 means THP disabled is faster). > One possible reason for lower performance with THP enabled is that the > larger page granularity can result in more remote data accesses. > > > SPEC OMP2012: > > 350.md 1.0 > 351.bwaves 1.3 > 352.nab 1.0 > 357.bt331 0.9 > 358.botsalgn1.0 > 359.botsspar1.1 > 360.ilbdc 1.8 > 362.fma3d 1.0 > 363.swim1.4 > 367.imagick 0.9 > 370.mgrid3311.1 > 371.applu3310.9 > 372.smithwa 1.0 > 376.kdtree 1.0 > > SPEC OMPL2001: > > 311.wupwise_l 1.1 > 313.swim_l 1.5 > 315.mgrid_l 1.0 > 317.applu_l 1.1 > 321.equake_l5.8 > 325.apsi_l 1.5 > 327.gafort_l1.0 > 329.fma3d_l 1.0 > 331.art_l 0.8 > > One could argue that real-world applications could be modified to avoid > these kinds of effects, but (a) it is not always possible to modify code > (e.g. in benchmark situations) and (b) even if it is possible to do so, > it is not necessarily easy to do so (e.g. for customers with large > legacy Fortran codes). > > We have also observed on Intel Sandy Bridge processors that, as > counter-intuitive as it may seem, local memory bandwidth is actually > slightly lower with THP enabled (1-2%), even with unit stride data > accesses. This may not seem like much of a performance hit but > it is important for HPC workloads. No code modification will help here. In light of the previous issues discussed in this thread, and some suggestions from David Rientjes: > why not make it per-process so users don't have to configure > cpusets to control it? Robin and I have come up with a proposal for a way to replicate behavior similar to what this patch introduced, only on a per-process level instead of at the cpuset level. Our idea would be to add a flag somewhere in the task_struct to keep track of whether or not thp is enabled for each task. The flag would be controlled by an additional option included in prctl(), allowing programmers to set/unset this flag via the prctl() syscall. We would also introduce some code into the clone() syscall to ensure that this flag is copied down from parent to child tasks when necessary. The flag would be checked in the same place the the per-cpuset flag was checked in my original patch, thereby allowing the same behavior to be replicated on a per-process level. In this way, we will also be able to get static binaries to behave appropriately by setting this flag in a userland program, and then having that program exec the static binary for which we need to disable thp. This solution allows us to incorporate the behavior that we're looking for into the kernel, without abusing cpusets for the purpose of containerization. Please let me know if anyone has any objections to this approach, or if you have any suggestions as to how we could improve upon this idea. Thanks! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add per-process flag to control thp
This patch implements functionality to allow processes to disable the use of transparent hugepages through the prctl syscall. We've determined that some jobs perform significantly better with thp disabled, and we need a way to control thp on a per-process basis, without relying on madvise. --- include/linux/huge_mm.h| 14 +- include/linux/init_task.h | 8 include/linux/sched.h | 3 +++ include/uapi/linux/prctl.h | 3 +++ kernel/fork.c | 4 kernel/sys.c | 31 +++ 6 files changed, 62 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b60de92..53af3ca 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -1,6 +1,8 @@ #ifndef _LINUX_HUGE_MM_H #define _LINUX_HUGE_MM_H +#include + extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, @@ -66,7 +68,7 @@ extern pmd_t *page_check_address_pmd(struct page *page, extern bool is_vma_temporary_stack(struct vm_area_struct *vma); -#define transparent_hugepage_enabled(__vma)\ +#define _transparent_hugepage_enabled(__vma) \ ((transparent_hugepage_flags & \ (1sequential_io_avg= 0; #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + p->thp_disabled = current->thp_disabled; +#endif + /* Perform scheduler related setup. Assign this task to a CPU. */ sched_fork(p); diff --git a/kernel/sys.c b/kernel/sys.c index 771129b..416c8a6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1836,6 +1836,31 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static int prctl_set_thp_disabled(struct task_struct *me) +{ + me->thp_disabled = 1; + return 0; +} + +static int prctl_get_thp_disabled(struct task_struct *me, + int __user *thp_disabled) +{ + return put_user(me->thp_disabled, thp_disabled); +} +#else +static int prctl_set_thp_disabled(struct task_struct *me) +{ + return -EINVAL; +} + +static int prctl_get_thp_disabled(struct task_struct *me, + int __user *thp_disabled) +{ + return -EINVAL; +} +#endif + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -1999,6 +2024,12
Re: [PATCH] Add per-process flag to control thp
> Instead of blowing a whole int on this bool, we could add it > to the bitfield a few pages up where we have other prctl things like.. > > unsigned no_new_privs:1; Definitely a better way to go. I'll tweak that and float another version of the patch. Thanks for the input, Dave! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add per-process flag to control thp
> What kind of workloads are you talking about? Our benchmarking team has a list of several of the SPEC OMP benchmarks that perform significantly better when THP is disabled. I tried to get the list but one of our servers is acting up and I can't get to it right now :/ > What's wrong with madvise? Could you elaborate? The main issue with using madvise is that it's not an option with static binaries, but there are also some users who have legacy Fortran code that they're not willing/able to change. > And I think thp_disabled should be reset to 0 on exec. The main purpose for this getting carried down from the parent process is that we'd like to be able to have a userland program set this flag on itself, and then spawn off children who will also carry the flag. This allows us to set the flag for programs where we're unable to modify the code, thus resolving the issues detailed above. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2] Add per-process flag to control thp
This patch implements functionality to allow processes to disable the use of transparent hugepages through the prctl syscall. We've determined that some jobs perform significantly better with thp disabled, and we needed a way to control thp on a per-process basis, without relying on madvise. v2 - tweaked thp_disabled flag to be a single bit instead of an int --- include/linux/huge_mm.h| 14 +- include/linux/init_task.h | 8 include/linux/sched.h | 4 include/uapi/linux/prctl.h | 3 +++ kernel/fork.c | 4 kernel/sys.c | 31 +++ 6 files changed, 63 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b60de92..53af3ca 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -1,6 +1,8 @@ #ifndef _LINUX_HUGE_MM_H #define _LINUX_HUGE_MM_H +#include + extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, @@ -66,7 +68,7 @@ extern pmd_t *page_check_address_pmd(struct page *page, extern bool is_vma_temporary_stack(struct vm_area_struct *vma); -#define transparent_hugepage_enabled(__vma)\ +#define _transparent_hugepage_enabled(__vma) \ ((transparent_hugepage_flags & \ (1sequential_io_avg= 0; #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + p->thp_disabled = current->thp_disabled; +#endif + /* Perform scheduler related setup. Assign this task to a CPU. */ sched_fork(p); diff --git a/kernel/sys.c b/kernel/sys.c index 771129b..416c8a6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1836,6 +1836,31 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static int prctl_set_thp_disabled(struct task_struct *me) +{ + me->thp_disabled = 1; + return 0; +} + +static int prctl_get_thp_disabled(struct task_struct *me, + int __user *thp_disabled) +{ + return put_user(me->thp_disabled, thp_disabled); +} +#else +static int prctl_set_thp_disabled(struct task_struct *me) +{ + return -EINVAL; +} + +static int prctl_get_thp_disabled(struct task_struct *me, + int __user *thp_disabled) +{ + return -EINVAL; +} +#endif + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, ar
Re: [PATCHv2] Add per-process flag to control thp
think the changelog should explain why madvise() is bad. No problem. I wanted to keep it simple for the original submission, but that's probably something that should be included. > But I simply can't understand why this flag is per-thread. It should > be > mm flag, no? This is something that we (Robin and I) had discussed a while back, and, upon review, I'm beginning to agree that this might be the more sensible route to take. I'm going to try and gather a bit more data to see if we can get some more exact answers as to why THP is performing so poorly under certain conditions before trying to push this particular patch any further. Thanks for the input! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add per-process flag to control thp
> Please try it with the vclear patches. Thanks, Andi. I'll give that a shot and see if it makes any difference. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2] Add per-process flag to control thp
> It would be good to describe which jobs and why thp cannot be fixed for > these. I'm going to see if I can get the benchmarks that are causing trouble up and running on my own to get some more information (problem was originally identified by our benchmarking team). Thanks for the input! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2] Add per-process flag to control thp
> What everyone else said, plus... > > I worry about the inherit-across-fork thing. A scenario we should > think about is where the user doesn't build his own executables. So he > writes a little wrapper which runs prctl(PR_SET_THP_DISABLED, 0) then > execs the third-party-app. But what happens if that app execs children > who *do* want THP? He has to hunt down and wrap those as well? > > It all seems a bit unwieldy. I wonder if controlling it via the ELF > header would be more practical. Thanks, Andrew. I'm doing some more testing and looking into using a different method for controlling this. At this point, I think it's fair to say that we don't want to control this using the method that I've proposed here, no matter how we look at it. I've gotten my hands on some of the benchmarks/code that were used to originally uncover the performance issues we're seeing. I'm currently trying to separate out the performance issues that are being caused by the kernel code from issues involving hardware - the cost of remote memory accesses is a bit higher on our systems with node controllers vs. glueless QPI/Hypertransport-based systems. At this point, it's difficult to say whether or not the issue can be solved by "fixing the performance issues with thp," as several have suggested. Don't get me wrong I like the idea of that solution as well; we're just not sure, right now, if that's going to solve all of our problems. I'll be back when I've dug up a bit more info on the issue. Thanks for the input, everyone! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Skip UV runtime services mapping in the efi_runtime_disabled case
On Mon, Aug 01, 2016 at 02:49:57PM +0100, Matt Fleming wrote: > On Tue, 26 Jul, at 05:38:33PM, Alex Thorlton wrote: > > This problem has actually been in the UV code for a while, but we didn't > > catch it until recently, because we had been relying on EFI_OLD_MEMMAP > > to allow our systems to boot for a period of time. We noticed the issue > > when trying to kexec a recent community kernel, where we hit this NULL > > pointer dereference in efi_sync_low_kernel_mappings: > > > > [0.337515] BUG: unable to handle kernel NULL pointer dereference at > > 0880 > > [0.346276] IP: [] > > efi_sync_low_kernel_mappings+0x5d/0x1b0 > > > > The problem doesn't show up with EFI_OLD_MEMMAP because we skip the > > chunk of setup_efi_state that sets the efi_loader_signature for the > > kexec'd kernel. When the kexec'd kernel boots, it won't set EFI_BOOT in > > setup_arch, so we completely avoid the bug. > > > > We always kexec with noefi on the command line, so this shouldn't be an > > issue, but since we're not actually checking for efi_runtime_disabled in > > uv_bios_init, we end up trying to do EFI runtime callbacks when we > > shouldn't be. This patch just adds a check for efi_runtime_disabled in > > uv_bios_init so that we don't map in uv_systab when runtime_disabled == > > true. > > > > Signed-off-by: Alex Thorlton > > Cc: Russ Anderson > > Cc: Mike Travis > > Cc: Matt Fleming > > Cc: Borislav Petkov > > Cc: Thomas Gleixner > > Cc: Ingo Molnar > > Cc: "H. Peter Anvin" > > Cc: x...@kernel.org > > --- > > arch/x86/platform/uv/bios_uv.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c > > index 66b2166..0df8a03 100644 > > --- a/arch/x86/platform/uv/bios_uv.c > > +++ b/arch/x86/platform/uv/bios_uv.c > > @@ -187,7 +187,8 @@ EXPORT_SYMBOL_GPL(uv_bios_set_legacy_vga_target); > > void uv_bios_init(void) > > { > > uv_systab = NULL; > > - if ((efi.uv_systab == EFI_INVALID_TABLE_ADDR) || !efi.uv_systab) { > > + if ((efi.uv_systab == EFI_INVALID_TABLE_ADDR) || > > + !efi.uv_systab || efi_runtime_disabled()) { > > pr_crit("UV: UVsystab: missing\n"); > > return; > > } > > The fix looks fine, but I'm losing track of which kernels this patch > should be applied to. Does it just need to be applied for v4.8 or > earlier kernels too? Well, we *have* to boot v4.6 and v4.7 with efi=old_map, which will avoid our kexec problem entirely, so while the patch would apply just fine on those kernels, and achieve the desired effect, we wouldn't really get any benefit from it. So, it definitely needs to go in for v4.8, but it's kind of a toss-up for the older kernels. I'll discuss this with the other guys around here to see what they think, and get back to you a bit later, if that's alright? - Alex
Re: [RFC PATCH] Fix EFI callbacks on UV during kexec
On Mon, Aug 01, 2016 at 02:39:26PM +0100, Matt Fleming wrote: > On Tue, 26 Jul, at 05:38:32PM, Alex Thorlton wrote: > > > > After investigating the problem here and figuring out the proper way to > > get the noefi parameter working again, I noticed that there appears to > > be support for EFI runtime callbacks in a kexec'd kernel now... I > > think we need some more cleanup here to get that all working entirely. > > Without noefi, we hit a bad paging request when we try to do EFI > > callbacks: > > [...] > > > [0.341531] BUG: unable to handle kernel paging request at > > 6a1ab938 > > [0.349319] IP: [<6a1ab938>] 0x6a1ab938 > > [0.354386] PGD 354e0063 PUD 0 > > [0.357910] Oops: 0010 [#1] SMP > > [0.361414] Modules linked in: > > [0.364833] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > > 4.7.0-runtime-check+ #713 > > [...] > > > This is due to the fact that the efi_map_region_fixed calls in > > kexec_enter_virtual_mode, which map in the EFI runtime memory > > descriptors, only map the virtual address of the descriptor. > > Unfortunately, since we're still relying on the physical address of our > > EFI runtime code being mapped in, we don't have access to that code in > > the kexec scenario. > > > > A potential fix for this would be to map in the physical addresses of > > the descriptors as well as the virtual addresses in > > efi_map_region_fixed, but the more "correct" fix would be to update > > our system table pointer to its new virtual address during > > SetVirtualAddressMap. We intend to get that piece fixed up relatively > > soon, but haven't quite gotten around to it yet. > > I don't think it would be so bad if we did the 1:1 mappings in the > kexec kernel too, we've got our own page tables after all and the VA > space is available. It would be required if people ever want to use > kexec with mixed mode kernels too. Hmm... That's a good point. It certainly would be nice for us to have those mappings in the kexec kernel, at least for the time being. If you're not opposed to it, I can write up the patch. Pretty sure it's a one-liner. - Alex
Re: [RFC PATCH] Fix EFI callbacks on UV during kexec
On Thu, Aug 04, 2016 at 10:25:32AM +0100, Matt Fleming wrote: > On Mon, 01 Aug, at 09:34:10AM, Alex Thorlton wrote: > > > > Hmm... That's a good point. It certainly would be nice for us to have > > those mappings in the kexec kernel, at least for the time being. If > > you're not opposed to it, I can write up the patch. Pretty sure it's a > > one-liner. > > If it's that trivial, sure, please go ahead and submit. Sure thing. I played around with this a bit before I sent this patch up. I'm pretty sure it all worked as expected, but I'll need to double check everything. I'll try and get it out in the next day or two. Thanks, Matt! - Alex
[PATCH] Map in physical addresses in efi_map_region_fixed
This is a simple change to add in the physical mappings as well as the virtual mappings in efi_map_region_fixed. The motivation here is to get access to EFI runtime code that is only available via the 1:1 mappings on a kexec'd kernel. The added call is essentially the kexec analog of the first __map_region that Boris put in efi_map_region in commit d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping"). Signed-off-by: Alex Thorlton Cc: Russ Anderson Cc: Dimitri Sivanich Cc: Mike Travis Cc: Matt Fleming Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: linux-...@vger.kernel.org --- arch/x86/platform/efi/efi_64.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 459bcbb..b206126 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -363,6 +363,7 @@ void __init efi_map_region(efi_memory_desc_t *md) */ void __init efi_map_region_fixed(efi_memory_desc_t *md) { + __map_region(md, md->phys_addr); __map_region(md, md->virt_addr); } -- 1.8.5.6
[BUG] Boot hangs at clocksource_done_booting on large configs
Hey guys, We've been hitting issues with machines with 4096 or more logical cores hanging up at clocksource_done_booting, seemingly forever. The problem occurs somewhat intermittently, though it hits more consistently as you increase the core count. We've observed this problem on kernels starting at 3.10, all the way up through 4.2. We initially thought there was some issue with the stop_machine code, but Thomas G. cleared that up for us with a thorough explanation of the stop_machine mechanism. After ruling out issues in stop_machine, we started investigating possible hardware related scaling issues. I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule out any scaling issues related to multiple concurrent reads to our RTC's MMR. We start getting hung task timeouts during clocksource_done_booting: 8< calling clocksource_done_booting+0x0/0x42 @ 1 INFO: task swapper/0:1 blocked for more than 480 seconds. Not tainted 4.2.0-rc1-nortc+ #384 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. swapper/0 D 881efa933b48 0 1 0 0x 881efa933b48 881ef45e b5de7a13 00fdfa933b38 881efa934000 7fff 881efa933ca0 b5de7a13 881efa933b68 8158eae7 881efa933b68 Call Trace: [] schedule+0x37/0x80 [] schedule_timeout+0x1e7/0x270 [] ? _find_next_bit.part.0+0x19/0x70 [] wait_for_completion+0xa0/0x100 [] ? wake_up_q+0x70/0x70 [] ? cpu_stop_should_run+0x50/0x50 [] __stop_cpus+0x50/0x70 [] ? console_unlock+0x2dd/0x4a0 [] ? cpu_stop_should_run+0x50/0x50 [] stop_cpus+0x35/0x50 [] ? tk_setup_internals.constprop.10+0x140/0x140 [] __stop_machine+0xb2/0xe0 [] ? tk_setup_internals.constprop.10+0x140/0x140 [] ? tk_setup_internals.constprop.10+0x140/0x140 [] stop_machine+0x2e/0x50 [] timekeeping_notify+0x2d/0x50 [] __clocksource_select+0xcb/0x120 [] ? boot_override_clock+0x4d/0x4d [] clocksource_done_booting+0x32/0x42 [] do_one_initcall+0x169/0x1d0 [] kernel_init_freeable+0x1bb/0x248 [] ? rest_init+0x80/0x80 [] kernel_init+0xe/0xe0 [] ret_from_fork+0x3f/0x70 [] ? rest_init+0x80/0x80 >8 After this I NMId the machine to get backtraces for the other CPUs. Most are stuck around here: 8< CPU: 4867 PID: 19730 Comm: migration/4867 Tainted: G D 4.2.0-rc1-nortc+ #384 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013 task: ad5e798cc500 ti: ad5e798d4000 task.ti: ad5e798d4000 RIP: 0010:[] [] uv_handle_nmi+0x19/0x790 RSP: :c90020156da8 EFLAGS: 00010092 RAX: 02fe65b95384 RBX: 002d RCX: c90020162a98 RDX: 02fe65b95384 RSI: c90020156ef8 RDI: 0001 RBP: c90020156e18 R08: 0001 R09: R10: 0003 R11: 0020 R12: 02fe65b95384 R13: R14: 81a38880 R15: c90020156ef8 FS: () GS:c90020151000() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: da80 CR3: 01a0b000 CR4: 001406e0 Stack: 002d 02fe65b95384 Call Trace: [] ? cpu_stop_should_run+0x50/0x50 [] nmi_handle+0x79/0x100 [] ? nmi_handle+0x83/0x100 [] ? cpu_stop_should_run+0x50/0x50 [] unknown_nmi_error+0x1c/0x90 [] default_do_nmi+0xd8/0xf0 [] do_nmi+0x83/0xb0 [] end_repeat_nmi+0x1e/0x2e [] ? cpu_stop_should_run+0x50/0x50 [] ? read_hpet+0x16/0x20 [] ? read_hpet+0x16/0x20 [] ? read_hpet+0x16/0x20 <> [] ktime_get+0x37/0xb0 [] clockevents_program_event+0x4b/0x130 [] tick_handle_periodic+0x57/0x70 [] local_apic_timer_interrupt+0x3c/0x70 [] smp_apic_timer_interrupt+0x41/0x60 [] apic_timer_interrupt+0x6b/0x70 [] ? update_blocked_averages+0x381/0x860 [] ? multi_cpu_stop+0x98/0xf0 [] ? __switch_to+0x26a/0x590 [] ? cpu_stop_should_run+0x50/0x50 [] cpu_stopper_thread+0x88/0x110 [] ? __schedule+0x338/0x900 [] smpboot_thread_fn+0x135/0x190 [] ? sort_range+0x30/0x30 [] kthread+0xdb/0x100 [] ? kthread_create_on_node+0x180/0x180 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x180/0x180 Code: 31 c0 e8 a0 c2 52 00 e9 39 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f7 53 48 83 ec 48 <48> 8b 1d 00 f2 fa 7e 65 44 8b 2d 70 b8 fa 7e 9c 58 0f 1f 44 00 >8 I can provide full logs if desired. Anyways, based on my limited understanding of what's going on here, I tried disabling interrupts a bit further out in the cpu_stopper_thread code, to try and avoid that apic_timer_interrupt during the clocksource change. change_clocksource already disables interrupts, but, from the backtraces, it appears that we're taking a
Re: [BUG] Boot hangs at clocksource_done_booting on large configs
On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote: > I can provide full logs if desired. Full log is here: http://oss.sgi.com/projects/athorlton/harp50-6144-nortc I'd use wget to download it. Probably don't want to try to open the 15M file in your browser :) Also, I've filed a kernel BugZilla for the same issue, in case anyone would rather use BugZilla: https://bugzilla.kernel.org/show_bug.cgi?id=103821 - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Boot hangs at clocksource_done_booting on large configs
On Mon, Aug 31, 2015 at 10:32:50PM +0200, Peter Zijlstra wrote: > On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote: > q > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > > index fd643d8..8502521 100644 > > --- a/kernel/stop_machine.c > > +++ b/kernel/stop_machine.c > > @@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu) > > { > > struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); > > struct cpu_stop_work *work; > > + unsigned long flags; > > int ret; > > > > + local_irq_save(flags); > > + > > repeat: > > work = NULL; > > spin_lock_irq(&stopper->lock); > > @@ -452,6 +455,8 @@ repeat: > > cpu_stop_signal_done(done, true); > > goto repeat; > > } > > + > > + local_irq_restore(flags); > > } > > > > So I should probably just go sleep and not say anything.. _but_ > *confused*. > > That local_irq_save() will disable IRQs over: > > work = NULL; > > But that is _all_. The spin_unlock_irq() will re-enable IRQs, after > which things will run as usual. > > That local_irq_restore() is a total NOP, IRQs are guaranteed enabled at > the irq_local_save() (otherwise lockdep would've complained about > spin_unlock_irq() unconditionally enabling them) and by the time we get > to the restore that same unlock_irq will have enabled them already. Ahh, right. Well that code is worthless then :) Either way though, I guess that means that slight change just fudged the timing enough in that area to avoid the lockup we're seeing. Ignoring my useless code change, does anything else jump out at you as interesting, here? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Boot hangs at clocksource_done_booting on large configs
On Mon, Aug 31, 2015 at 11:12:12PM +0200, Thomas Gleixner wrote: > On Mon, 31 Aug 2015, Alex Thorlton wrote: > > I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule > > out any scaling issues related to multiple concurrent reads to our > > RTC's MMR. > > And to rule out scaling issues you replaced the RTC MMR with HPET. Not > a very good choice: > > HPET does not scale either. It's uncached memory mapped I/O. See > below. Ahhh. I wondered if there might be an issue there. I can say that I've seen it hit a very similar issue with some other possible solutions that we've tried, but this was the most recent, fairly unmodified kernel that I've played with, so I went with this. > > I'm hoping to get some input from the experts in this area, first of > > all, on whether the problem I'm seeing is actually what I think it is, > > and, if so, if I've solved it in the correct way. > > I fear both the analysis and the solution is wrong. > > Up to the point where the actual clocksource change happens there is > no reason why timer interrupts should not happen. And the code which > actually changes the clocksource is definitely called with interrupts > disabled. When that function returns the new clocksource is fully > functional and interrupts can happen again. > > Now looking at your backtraces. Most CPUs are in the migration thread > and a few (3073,3078,3079,3082) are in the idle task. > > From the trace artifacts (? read_hpet) it looks like the clock source > change has been done and the cpus are on the way back from stop > machine. That's good to note. I'll keep that in mind as I continue gathering info. > But they are obviously held off by something. And that something looks > like the timekeeper sequence lock. Ok. IIRC, that lock prevents reads to the clocksource until all necessary write operations have finished, correct? i.e. tk_core.seq? I looked at that as being a possible place to get stuck in ktime_get: do { seq = read_seqcount_begin(&tk_core.seq); base = tk->tkr_mono.base; nsecs = timekeeping_get_ns(&tk->tkr_mono); } while (read_seqcount_retry(&tk_core.seq, seq)); But I don't have enough evidence to point there for sure. > Too bad, that we don't have a backtrace for CPU0 in the log. I think that may have gotten swallowed up with all the lost printks that you'll notice in there. Does the hung task backtrace not cover that though? > I really wonder how a machine that large works with HPET as > clocksource at all. hpet_read() is uncached memory mapped IO which > takes thousands of CPU cycles. Last time I looked it was around > 1us. Let's take that number to do some math. > > If all CPUs do that access at the same time, then it takes NCPUS > microseconds to complete if the memory mapped I/O scheduling is > completely fair, which I doubt. So with 4k CPUs thats whopping 4.096ms > and it gets worse if you go larger. That's more than a tick with > HZ=250. > > I'm quite sure that you are staring at the HPET scalability bottleneck > and not at some actual kernel bug. That could well be the case. > Your patch shifts some timing around so the issue does not happen, but > that's certainly not a solution. I was fairly sure of that, but it's good to know that you agree. I'll try and get a backtrace from a completely unmodified kernel here in the next few days. Unfortunately, as mentioned, we only hit the issue intermittently. I was able to reproduce pretty regularly on kernels from 4.0 and before, but it seems to be less common on newer kernels, though we definitely still manage to hit it from time to time. Thanks for all the help! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] kzalloc overflow in lpfc driver on 6k core system
On Sat, Dec 06, 2014 at 03:14:45PM -0500, James Smart wrote: > Myself and several others here at Emulex maintain the code. The > recommendations look fine. Feel free to post something if you beat > us to the work. Great! Thanks for letting me know, James. I will probaby have some time to look at this later this week. Please let me know if somebody over there gets around to looking at it - no need to have two groups chasing the same issue :) Thanks again! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
On Mon, Dec 08, 2014 at 11:42:14AM +0100, Ingo Molnar wrote: > This patch fails to build with certain configs: > > kernel/sched/core.c:7130:33: error: incompatible types when assigning to type > ‘cpumask_var_t’ from type ‘void *’ Thanks for letting us know, Ingo. I believe George has something in mind to fix this. We'll take a look and get another version out ASAP! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Possible locking issues in stop_machine code on 6k core machine
On Thu, Dec 04, 2014 at 12:34:04AM +0100, Thomas Gleixner wrote: > first of all, could you please format your mail so it is actually > readable and can be replied to? My fault there - stupid mistake. > If you would provide real data instead of half baken assumptions > we might actually be able to help you. > > There might be some hidden issue in the stomp machine crap, but surely > not the one you described. I'll return with some actual data so that others can get a better picture of what we're seeing. I'll also see if I can come up with a way to reproduce the issue on a smaller configuration; might not be possible, but attempting to do so might provide some more insight into our problem. Thanks for your detailed description of the code path! I definitely understand that our assumption can't be the actual problem here. I'll dig back into the issue and try to get some better information. Thanks again! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] mm, thp: Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It also adds a prctl control which alllows us to set the THP disable bit in mm->def_flags so that VMs will pick up the setting as they are created. Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- include/linux/mm.h | 3 +++ include/uapi/linux/prctl.h | 3 +++ kernel/fork.c | 11 --- kernel/sys.c | 15 +++ 4 files changed, 29 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f28f46e..0314e4c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -177,6 +177,9 @@ extern unsigned int kobjsize(const void *objp); */ #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP) +/* This mask defines which mm->def_flags a process can inherit its parent */ +#define VM_INIT_DEF_MASK VM_NOHUGEPAGE + /* * mapping from the currently active vm_flags protection bits (the * low four bits) to a page protection mask.. diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 289760f..58afc04 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -149,4 +149,7 @@ #define PR_GET_TID_ADDRESS 40 +#define PR_SET_THP_DISABLE 41 +#define PR_GET_THP_DISABLE 42 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index a17621c..9fc0a30 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); - mm->flags = (current->mm) ? - (current->mm->flags & MMF_INIT_MASK) : default_dump_filter; mm->core_state = NULL; atomic_long_set(&mm->nr_ptes, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); @@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) mm_init_owner(mm, p); clear_tlb_flush_pending(mm); - if (likely(!mm_alloc_pgd(mm))) { + if (current->mm) { + mm->flags = current->mm->flags & MMF_INIT_MASK; + mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK; + } else { + mm->flags = default_dump_filter; mm->def_flags = 0; + } + + if (likely(!mm_alloc_pgd(mm))) { mmu_notifier_mm_init(mm); return mm; } diff --git a/kernel/sys.c b/kernel/sys.c index c0a58be..aa8d3a0 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1996,6 +1996,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, if (arg2 || arg3 || arg4 || arg5) return -EINVAL; return current->no_new_privs ? 1 : 0; + case PR_GET_THP_DISABLE: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + error = !!(me->mm->def_flags & VM_NOHUGEPAGE); + break; + case PR_SET_THP_DISABLE: + if (arg3 || arg4 || arg5) + return -EINVAL; + down_write(&me->mm->mmap_sem); + if (arg2) + me->mm->def_flags |= VM_NOHUGEPAGE; + else + me->mm->def_flags &= ~VM_NOHUGEPAGE; + up_write(&me->mm->mmap_sem); + break; default: error = -EINVAL; break; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85c, and adds in code to fix up any issues caused by the revert. The revert is necessary because hugepage_madvise would return -EINVAL when VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch set. Here's a snip of an e-mail from Gerald detailing the original purpose of this code, and providing justification for the revert: The intent of 8e72033f2a48 was to guard against any future programming errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings, which would crash the kernel. Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48 was to be reverted, because that check will also prevent a kernel crash in the case described above, it will now send a SIGSEGV instead. This would now also allow to do the madvise on other parts, if needed, so it is a more flexible approach. One could also say that it would have been better to do it this way right from the beginning... Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- arch/s390/mm/pgtable.c | 3 +++ mm/huge_memory.c | 4 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 3584ed9..a87cdb4 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment, if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, vmaddr)) return -ENOMEM; + /* large pmds cannot yet be handled */ + if (pmd_large(*pmd)) + return -EFAULT; /* pmd now points to a valid segment table entry. */ rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT); if (!rmap) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf..a4310a5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1968,8 +1968,6 @@ out: int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { - struct mm_struct *mm = vma->vm_mm; - switch (advice) { case MADV_HUGEPAGE: /* @@ -1977,8 +1975,6 @@ int hugepage_madvise(struct vm_area_struct *vma, */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; - if (mm->def_flags & VM_NOHUGEPAGE) - return -EINVAL; *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] exec: kill the unnecessary mm->def_flags setting in load_elf_binary()
load_elf_binary() sets current->mm->def_flags = def_flags and def_flags is always zero. Not only this looks strange, this is unnecessary because mm_init() has already set ->def_flags = 0. Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- fs/binfmt_elf.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 67be295..d09bd9c 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -579,7 +579,6 @@ static int load_elf_binary(struct linux_binprm *bprm) unsigned long start_code, end_code, start_data, end_data; unsigned long reloc_func_desc __maybe_unused = 0; int executable_stack = EXSTACK_DEFAULT; - unsigned long def_flags = 0; struct pt_regs *regs = current_pt_regs(); struct { struct elfhdr elf_ex; @@ -719,9 +718,6 @@ static int load_elf_binary(struct linux_binprm *bprm) if (retval) goto out_free_dentry; - /* OK, This is the point of no return */ - current->mm->def_flags = def_flags; - /* Do this immediately, since STACK_TOP as used in setup_arg_pages may depend on the personality. */ SET_PERSONALITY(loc->elf_ex); -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv4 0/3] [RESEND] mm, thp: Add mm flag to control THP
194.536 M/sec [100.00%] 714,066,351 branch-misses #0.00% of all branches 216.196778807 seconds time elapsed As with previous versions of the patch, We're getting about a 2x performance increase here. Here's a link to the test case I used, along with the little wrapper to activate the flag: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz Let me know if anybody has any further suggestions here. Thanks! (Sorry for the big cc list!) Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Alex Thorlton (3): Revert "thp: make MADV_HUGEPAGE check for mm->def_flags" Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE exec: kill the unnecessary mm->def_flags setting in load_elf_binary() arch/s390/mm/pgtable.c | 3 +++ fs/binfmt_elf.c| 4 include/linux/mm.h | 3 +++ include/uapi/linux/prctl.h | 3 +++ kernel/fork.c | 11 --- kernel/sys.c | 15 +++ mm/huge_memory.c | 4 7 files changed, 32 insertions(+), 11 deletions(-) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 0/3] [RESEND] mm, thp: Add mm flag to control THP
On Tue, Feb 25, 2014 at 02:54:03PM -0600, Alex Thorlton wrote: > (First send had too big of a cc list to make it into all the mailing lists.) Sorry for the double-send to a couple of lists/people. I got bumped from linux-api for having too long of a cc list, and I figured it had gotten bumped elsewhere as well. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
On Wed, Feb 26, 2014 at 05:57:59PM +0100, Peter Zijlstra wrote: > On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote: > Do we want a comment here, explaining why s390 is special again? Here's what I've got, with everybody's suggestions spun together: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf..7f01491 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1968,8 +1968,6 @@ out: int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { - struct mm_struct *mm = vma->vm_mm; - switch (advice) { case MADV_HUGEPAGE: /* @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_struct *vma, */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; - if (mm->def_flags & VM_NOHUGEPAGE) + +/* + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). + */ +#ifdef CONFIG_S390 + if (mm_has_pgste(vma->vm_mm)) return -EINVAL; +#endif + *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* I've compiled and tested and it works ok on my machines (I don't have an s390 to test on though :). Is everybody okay with this solution? BTW, Kirill, I looked at using IS_ENABLED to clean up the ifdef, but it won't work here, since mm_has_pgste isn't defined unless CONFIG_S390 is defined. i.e.: This line: if (IS_ENABLED(CONFIG_S390) && mm_has_pgste(vma->vm_mm)) won't compile. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
On Wed, Feb 26, 2014 at 08:27:36PM +0100, Christian Borntraeger wrote: > On 26/02/14 19:06, Oleg Nesterov wrote: > > On 02/26, Alex Thorlton wrote: > >> > >> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > >> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > >> + */ > >> +#ifdef CONFIG_S390 > >> + if (mm_has_pgste(vma->vm_mm)) > >> return -EINVAL; > >> +#endif > > > > The comment is not really right... > > > > And personally I think that > > > > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long > > address, unsigned long segment, > > if (!pmd_present(*pmd) && > > __pte_alloc(mm, vma, pmd, vmaddr)) > > return -ENOMEM; > > + /* large pmds cannot yet be handled */ > > + if (pmd_large(*pmd)) > > + return -EFAULT; > > > > change still makes sense, so that we can simply revert this s390- > > specific hack in hugepage_madvise(). > > Yes, it still makes sense to cover existing THPs here. Yes, it does. I just snipped the chunk of the original patch that I actually changed in my last e-mail. I didn't intend to actually remove the above portion in the final patch - sorry for the confusion! > > > > I'd suggest the patch below on top of your changes, but I won't argue. I like this approach, and your updated comment as well. This should probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 with this new patch included in the series, or will Andrew pull this in? > > > > It would be nice to also change thp_split_mm() to not not play with > > mm->def_flags, but I am not sure if we can do this. > > > > Oleg. > > --- > > > > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie() > > > > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE > > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for > > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm. > > > > Reported-by: Christian Borntraeger > > Suggested-by: Christian Borntraeger > > Signed-off-by: Oleg Nesterov > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index a4310a5..0e08d92 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma, > > { > > switch (advice) { > > case MADV_HUGEPAGE: > > +#ifdef CONFIG_S390 > > + /* > > +* MADV_HUGEPAGE is broken after s390_enable_sie(), qemu > > +* blindly does madvise(MADV_HUGEPAGE) for for all kvm pages > > +* and expects it must fail on s390. Avoid a possible SIGSEGV > > +* until qemu is changed. > > I prefer: >* until kvm/s390 can handle large pages in the host. > > Otherwise qemu has to be changed again, if we get THP working for kvm. > > > +*/ > > + if (mm_has_pgste(vma->vm_mm)) > > + return -EINVAL; > > +#endif > > /* > > * Be somewhat over-protective like KSM for now! > > */ > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > > return -EINVAL; > > + > > Unrelated white space? > > *vm_flags &= ~VM_NOHUGEPAGE; > > *vm_flags |= VM_HUGEPAGE; > > /* > > > > > > With the comment and white space change: > > Tested-by: Christian Borntraeger > Reviewed-by: Christian Borntraeger > > Thanks for the quick patch Yes, thank you all for the quick responses, and for looking over these patches! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Mon, Nov 04, 2013 at 02:58:28PM +, Mel Gorman wrote: > On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote: > > Hi guys, > > > > I ran into a bug a week or so ago, that I believe has something to do > > with NUMA balancing, but I'm having a tough time tracking down exactly > > what is causing it. When running with the following configuration > > options set: > > > > Can you test with patches > cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1 > applied? They fix some known memory corruption problems, were merged for > 3.12 (so alternatively just test 3.12) and have been tagged for -stable. I just finished testing with 3.12, and I'm still seeing the same issue. This is actually a bit strange to me, because, when I tested with 3.12-rc5 a while back, everything seemed to be ok (see previoues e-mail in this thread, to Bob Liu). I guess, embarrasingly enough, I must have been playing with a screwed up config that day, and managed to somehow avoid the problem... Either way, it appears that we still have a problem here. I'll poke around a bit more on this in the next few days and see if I can come up with any more information. In the meantime, let me know if you have any other suggestions. Thanks, - Alex > > Thanks. > > -- > Mel Gorman > SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote: > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton wrote: > > > > > > > > > > > I'd suggest the patch below on top of your changes, but I won't argue. > > > > I like this approach, and your updated comment as well. This should > > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 > > with this new patch included in the series, or will Andrew pull this in? > > Paolo's comments on Oleg's patch remain unaddressed, so please take a > look at that and send out the final version? Got it. I'll get that to you tonight/tomorrow morning. > An incremental patch would be preferred, please. I'll later fold that > into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to > avoid any bisection holes. Understood. I'll keep them separate. Thanks, Andrew! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] mm, s390: Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
This reverts commit 8e72033f2a489b6c98c4e3c7cc281b1afd6cb85c, and adds in code to fix up any issues caused by the revert. The revert is necessary because hugepage_madvise would return -EINVAL when VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch set. Here's a snip of an e-mail from Gerald detailing the original purpose of this code, and providing justification for the revert: The intent of 8e72033f2a48 was to guard against any future programming errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings, which would crash the kernel. Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48 was to be reverted, because that check will also prevent a kernel crash in the case described above, it will now send a SIGSEGV instead. This would now also allow to do the madvise on other parts, if needed, so it is a more flexible approach. One could also say that it would have been better to do it this way right from the beginning... Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- arch/s390/mm/pgtable.c | 3 +++ mm/huge_memory.c | 4 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 3584ed9..a87cdb4 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment, if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, vmaddr)) return -ENOMEM; + /* large pmds cannot yet be handled */ + if (pmd_large(*pmd)) + return -EFAULT; /* pmd now points to a valid segment table entry. */ rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT); if (!rmap) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf..a4310a5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1968,8 +1968,6 @@ out: int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { - struct mm_struct *mm = vma->vm_mm; - switch (advice) { case MADV_HUGEPAGE: /* @@ -1977,8 +1975,6 @@ int hugepage_madvise(struct vm_area_struct *vma, */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; - if (mm->def_flags & VM_NOHUGEPAGE) - return -EINVAL; *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] mm, s390: Ignore MADV_HUGEPAGE on s390 to prevent SIGSEGV in qemu
As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm. Paolo suggested that instead of failing on the call to madvise, we simply ignore the call (return 0). Reported-by: Christian Borntraeger Suggested-by: Paolo Bonzini Suggested-by: Oleg Nesterov Signed-off-by: Alex Thorlton Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- mm/huge_memory.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a4310a5..61d234d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1970,6 +1970,15 @@ int hugepage_madvise(struct vm_area_struct *vma, { switch (advice) { case MADV_HUGEPAGE: +#ifdef CONFIG_S390 + /* +* qemu blindly sets MADV_HUGEPAGE on all allocations, but s390 +* can't handle this properly after s390_enable_sie, so we simply +* ignore the madvise to prevent qemu from causing a SIGSEGV. +*/ + if (mm_has_pgste(vma->vm_mm)) + return 0; +#endif /* * Be somewhat over-protective like KSM for now! */ -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
On Wed, Feb 26, 2014 at 06:01:53PM -0600, Alex Thorlton wrote: > On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote: > > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton wrote: > > > > > > > > > > > > > > > I'd suggest the patch below on top of your changes, but I won't argue. > > > > > > I like this approach, and your updated comment as well. This should > > > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 > > > with this new patch included in the series, or will Andrew pull this in? > > > > Paolo's comments on Oleg's patch remain unaddressed, so please take a > > look at that and send out the final version? > > Got it. I'll get that to you tonight/tomorrow morning. Just kicked out another version of the patch that should cover all comments/suggestions. Let me know if you need anything else from me! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv5 0/4] mm, thp: Add mm flag to control THP
Here's a link to the test case I used, along with the little wrapper to activate the flag: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz Let me know if anybody has any further suggestions here. Thanks! Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Alex Thorlton (4): Revert "thp: make MADV_HUGEPAGE check for mm->def_flags" Ignore MADV_HUGEPAGE on s390 to prevent SIGSEGV in qemu Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE exec: kill the unnecessary mm->def_flags setting in load_elf_binary() arch/s390/mm/pgtable.c | 3 +++ fs/binfmt_elf.c| 4 include/linux/mm.h | 3 +++ include/uapi/linux/prctl.h | 3 +++ kernel/fork.c | 11 --- kernel/sys.c | 15 +++ mm/huge_memory.c | 13 + 7 files changed, 41 insertions(+), 11 deletions(-) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] mm, thp: Add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE
This patch adds a VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It also adds a prctl control which alllows us to set the THP disable bit in mm->def_flags so that VMs will pick up the setting as they are created. Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- include/linux/mm.h | 3 +++ include/uapi/linux/prctl.h | 3 +++ kernel/fork.c | 11 --- kernel/sys.c | 15 +++ 4 files changed, 29 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f28f46e..0314e4c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -177,6 +177,9 @@ extern unsigned int kobjsize(const void *objp); */ #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP) +/* This mask defines which mm->def_flags a process can inherit its parent */ +#define VM_INIT_DEF_MASK VM_NOHUGEPAGE + /* * mapping from the currently active vm_flags protection bits (the * low four bits) to a page protection mask.. diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 289760f..58afc04 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -149,4 +149,7 @@ #define PR_GET_TID_ADDRESS 40 +#define PR_SET_THP_DISABLE 41 +#define PR_GET_THP_DISABLE 42 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index a17621c..9fc0a30 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -529,8 +529,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); - mm->flags = (current->mm) ? - (current->mm->flags & MMF_INIT_MASK) : default_dump_filter; mm->core_state = NULL; atomic_long_set(&mm->nr_ptes, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); @@ -539,8 +537,15 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) mm_init_owner(mm, p); clear_tlb_flush_pending(mm); - if (likely(!mm_alloc_pgd(mm))) { + if (current->mm) { + mm->flags = current->mm->flags & MMF_INIT_MASK; + mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK; + } else { + mm->flags = default_dump_filter; mm->def_flags = 0; + } + + if (likely(!mm_alloc_pgd(mm))) { mmu_notifier_mm_init(mm); return mm; } diff --git a/kernel/sys.c b/kernel/sys.c index c0a58be..aa8d3a0 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1996,6 +1996,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, if (arg2 || arg3 || arg4 || arg5) return -EINVAL; return current->no_new_privs ? 1 : 0; + case PR_GET_THP_DISABLE: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + error = !!(me->mm->def_flags & VM_NOHUGEPAGE); + break; + case PR_SET_THP_DISABLE: + if (arg3 || arg4 || arg5) + return -EINVAL; + down_write(&me->mm->mmap_sem); + if (arg2) + me->mm->def_flags |= VM_NOHUGEPAGE; + else + me->mm->def_flags &= ~VM_NOHUGEPAGE; + up_write(&me->mm->mmap_sem); + break; default: error = -EINVAL; break; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] exec: kill the unnecessary mm->def_flags setting in load_elf_binary()
load_elf_binary() sets current->mm->def_flags = def_flags and def_flags is always zero. Not only this looks strange, this is unnecessary because mm_init() has already set ->def_flags = 0. Signed-off-by: Alex Thorlton Suggested-by: Oleg Nesterov Cc: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Andrew Morton Cc: Paolo Bonzini Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Alexander Viro Cc: linux...@de.ibm.com Cc: linux-s...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org --- fs/binfmt_elf.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 67be295..d09bd9c 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -579,7 +579,6 @@ static int load_elf_binary(struct linux_binprm *bprm) unsigned long start_code, end_code, start_data, end_data; unsigned long reloc_func_desc __maybe_unused = 0; int executable_stack = EXSTACK_DEFAULT; - unsigned long def_flags = 0; struct pt_regs *regs = current_pt_regs(); struct { struct elfhdr elf_ex; @@ -719,9 +718,6 @@ static int load_elf_binary(struct linux_binprm *bprm) if (retval) goto out_free_dentry; - /* OK, This is the point of no return */ - current->mm->def_flags = def_flags; - /* Do this immediately, since STACK_TOP as used in setup_arg_pages may depend on the personality. */ SET_PERSONALITY(loc->elf_ex); -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Linux 3.14 fails to boot with new EFI changes
On Thu, Feb 06, 2014 at 12:15:40AM +0100, Borislav Petkov wrote: > On Wed, Feb 05, 2014 at 03:45:36PM -0600, Alex Thorlton wrote: > > While working on an answer to this question, I ran across another issue > > on some newer hardware, that looks like it's definitely related to this > > problem, and might be the root cause. > > > > When booting on a UV2 we die in efi_enter_virtual_mode: > > > > BUG: unable to handle kernel paging request at 008f7e848020 > > IP: [<7dadb6a9>] 0x7dadb6a8 > > PGD 0 > > That looks very much like this other issue we're debugging right now: Have there been any developments on this since last week, Boris? Just trying to make sure that we stay in the loop on this issue. Let me know if there's anything else we can do from our end to help expedite the process. I'm always available to test new ideas. Thanks! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] mm: thp: hugepage_vma_check has a blind spot
hugepage_vma_check is called during khugepaged_scan_mm_slot to ensure that khugepaged doesn't try to allocate THPs in vmas where they are disallowed, either due to THPs being disabled system-wide, or through MADV_NOHUGEPAGE. The logic that hugepage_vma_check uses doesn't seem to cover all cases, in my opinion. Looking at the original code: if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) || (vma->vm_flags & VM_NOHUGEPAGE)) We can see that it's possible to have THP disabled system-wide, but still receive THPs in this vma. It seems that it's assumed that just because khugepaged_always == false, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG must be set, which is not the case. We could have VM_HUGEPAGE set, but have THP set to "never" system-wide, in which case, the condition presented in the if will evaluate to false, and (provided the other checks pass) we can end up giving out a THP even though the behavior is set to "never." While we do properly check these flags in khugepaged_has_work, it looks like it's possible to sleep after we check khugepaged_hask_work, but before hugepage_vma_check, during which time, hugepages could have been disabled system-wide, in which case, we could hand out THPs when we shouldn't be. This small fix makes hugepage_vma_check work more like transparent_hugepage_enabled, checking if THPs are set to "always" system-wide, then checking if THPs are set to "madvise," as well as making sure that VM_HUGEPAGE is set for this vma. Signed-off-by: Alex Thorlton Reported-by: Alex Thorlton Cc: Andrew Morton Cc: Mel Gorman Cc: Rik van Riel Cc: Ingo Molnar Cc: Peter Zijlstra Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 95d1acb..f62fba9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2394,7 +2394,8 @@ static struct page static bool hugepage_vma_check(struct vm_area_struct *vma) { - if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) || + if ((!khugepaged_always() || +(!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_req_madv())) || (vma->vm_flags & VM_NOHUGEPAGE)) return false; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
On Wed, Jan 22, 2014 at 10:26:21AM +, Mel Gorman wrote: > On Tue, Jan 14, 2014 at 01:38:01PM -0600, Alex Thorlton wrote: > > On Tue, Jan 14, 2014 at 03:44:57PM +, Mel Gorman wrote: > > > On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote: > > > > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote: > > > > > We already have the information to determine if a page is shared > > > > > across > > > > > nodes, Mel even had some prototype code to do splits under those > > > > > conditions. > > > > > > > > I'm aware that we can determine if pages are shared across nodes, but I > > > > thought that Mel's code to split pages under these conditions had some > > > > performance issues. I know I've seen the code that Mel wrote to do > > > > this, but I can't seem to dig it up right now. Could you point me to > > > > it? > > > > > > > > > > It was a lot of revisions ago! The git branches no longer exist but the > > > diff from the monolithic patches is below. The baseline was v3.10 and > > > this will no longer apply but you'll see the two places where I added a > > > split_huge_page and prevented khugepaged collapsing them again. > > > > Thanks, Mel. I remember seeing this code a while back when we were > > discussing THP/locality issues. > > > > > At the time, the performance with it applied was much worse but it was a > > > 10 minute patch as a distraction. There was a range of basic problems that > > > had to be tackled before there was any point looking at splitting THP due > > > to locality. I did not pursue it further and have not revisited it since. > > > > So, in your opinion, is this something we should look into further > > before moving towards the per-mm switch that I propose here? > > No because they have different purposes. Any potential split of THP from > automatic NUMA balancing context is due to it detecting that threads running > on CPUs on different nodes are accessing a THP. You are proposing to have > a per-mm flag that prevents THP being allocated in the first place. They > are two separate problems with decisions that are made at completely > different times. Makes sense. That discussion doesn't really fit here. > > I > > personally think that it will be tough to get this to perform as well as > > a method that totally disables THP when requested, or a method that > > tries to prevent THPs from being handed out in certain situations, since > > we'll be doing the work of both making and splitting a THP in the case > > where remote accesses are made to the page. > > > > I would expect that the alternative solution to a per-mm switch is to > reserve the naturally aligned pages for a THP promotion. Have a threshold > of pages pages that must be faulted before the full THP's worth of pages > is allocated, zero'd and a huge pmd established. That would defer the > THP setup costs until it was detected that it was necessary. I have some half-finished patches that I was working on a month or so ago, to do exactly this (I think you were involved with some of the discussion, maybe? I'd have to dig up the e-mails to be sure). After cycling through numerous other methods of handling this problem, I still like that idea, but I think it's going to require a decent amount of effort to get finished. > The per-mm THP switch is a massive hammer but not necessarily a bad one. I agree that it's a big hammer, but I suppose it's a bit more of a surgical strike than using the system-wide switch, and can be especially useful in some cases I've discussed, where madvise isn't really an option. > > I also think there could be some issues with over-zealously splitting > > pages, since it sounds like we can only determine if an access is from a > > remote node. We don't have a good way of determining how many accesses > > are remote vs. local, or how many separate nodes are accessing a page. > > > > Indeed not, but it's a different problem. We also do not know if the > remote accesses are to a single page in which case splitting it would > have zero benefit anyway. Agreed. There are several possible problems with that approach, but, as you've stated, that's a different problem, so no reason to discuss here :) Thanks for the input, Mel! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] mm: thp: hugepage_vma_check has a blind spot
On Tue, Jan 21, 2014 at 03:24:08PM -0800, David Rientjes wrote: > On Tue, 21 Jan 2014, Alex Thorlton wrote: > > > hugepage_vma_check is called during khugepaged_scan_mm_slot to ensure > > that khugepaged doesn't try to allocate THPs in vmas where they are > > disallowed, either due to THPs being disabled system-wide, or through > > MADV_NOHUGEPAGE. > > > > The logic that hugepage_vma_check uses doesn't seem to cover all cases, > > in my opinion. Looking at the original code: > > > >if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) || > >(vma->vm_flags & VM_NOHUGEPAGE)) > > > > We can see that it's possible to have THP disabled system-wide, but still > > receive THPs in this vma. It seems that it's assumed that just because > > khugepaged_always == false, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG must be > > set, which is not the case. We could have VM_HUGEPAGE set, but have THP > > set to "never" system-wide, in which case, the condition presented in the > > if will evaluate to false, and (provided the other checks pass) we can > > end up giving out a THP even though the behavior is set to "never." > > > > You should be able to add a > > BUG_ON(current != khugepaged_thread); > > here since khugepaged is supposed to be the only caller to the function. > > > While we do properly check these flags in khugepaged_has_work, it looks > > like it's possible to sleep after we check khugepaged_hask_work, but > > before hugepage_vma_check, during which time, hugepages could have been > > disabled system-wide, in which case, we could hand out THPs when we > > shouldn't be. > > > > You're talking about when thp is set to "never" and before khugepaged has > stopped, correct? Yes, that's correct. > That doesn't seem like a bug to me or anything that needs to be fixed, the > sysfs knob could be switched even after hugepage_vma_check() is called and > before a hugepage is actually collapsed so you have the same race. > > The only thing that's guaranteed is that, upon writing "never" to > /sys/kernel/mm/transparent_hugepage/enabled, no more thp memory will be > collapsed after khugepaged has stopped. That makes sense, I wasn't aware that that's the expected behavior here. I suppose this isn't something that needs to be changed, in that case. I needed the logic broken out a bit more explicitly (madvise/never case need to be handled separately) for a patch that I'm working on - that's when this caught my attention. Good to know that a change to the system-wide switch shouldn't affect khugepaged if it's already running. I would've screwed up that behavior with my patch :) Thanks, David! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] mm->def_flags cleanups (Was: Change khugepaged to respect MMF_THP_DISABLE flag)
On Wed, Jan 22, 2014 at 06:45:53PM +0100, Oleg Nesterov wrote: > Alex, Andrew, I think this simple series makes sense in any case, > but _perhaps_ it can also help THP_DISABLE. > > On 01/20, Alex Thorlton wrote: > > > > On Mon, Jan 20, 2014 at 09:15:25PM +0100, Oleg Nesterov wrote: > > > > > > Although I got lost a bit, and probably misunderstood... but it > > > seems to me that whatever you do this patch should not touch > > > khugepaged_scan_mm_slot. > > > > Maybe I've gotten myself confused as well :) After looking through the > > code some more, my understanding is that khugepaged_test_exit is used to > > make sure that __khugepaged_exit isn't running from underneath at certain > > times, so to have khugepaged_test_exit return true when __khugepaged_exit > > is not necessarily running, seems incorrect to me. > > Still can't understand... probably I need to see v3. I think the appropriate place to check this is actually in hugepage_vma_check, so you're correct that we don't need to directly touch khugepaged_scan_mm_slot, we just need to change a different function underneath it. We'll table that for now, though. I'll put out a v3 later today, so you can see what I'm talking about, but I think your idea looks like it will be better in the end. > But you know, I have another idea. Not sure you will like it, and probably > I missed something. > > Can't we simply add VM_NOHUGEPAGE into ->def_flags? See the (untested) > patch below, on top of this series. > > What do you think? At a glance, without testing, it looks like a good idea to me. By using def_flags, we leverage functionality that's already in place to achieve the same result. We don't need to add any new checks into the fault path or into khugepaged, since we're just leveraging the VM_HUGEPAGE/NOHUGEPAGE flag, which we already check for. We also get the behavior that you suggested (madvise is still respected, even with the new THP disable prctl set), for free with this method. I like the idea, but I think that it should probably be a separate change from the other few cleanups that you proposed along with it, since they're somewhat unrelated to this particular issue. Do you agree? Also, one small note on the code: > diff --git a/kernel/sys.c b/kernel/sys.c > index ac1842e..eb8b0fc 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2029,6 +2029,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, > arg2, unsigned long, arg3, > if (arg2 || arg3 || arg4 || arg5) > return -EINVAL; > return current->no_new_privs ? 1 : 0; > + case PR_SET_THP_DISABLE: > + case PR_GET_THP_DISABLE: > + down_write(&me->mm->mmap_sem); > + if (option == PR_SET_THP_DISABLE) { > + if (arg2) > + me->mm->def_flags |= VM_NOHUGEPAGE; > + else > + me->mm->def_flags &= ~VM_NOHUGEPAGE; > + } else { > + error = !!(me->mm->flags && VM_NOHUGEPAGE); Should be: error = !!(me->mm->def_flags && VM_NOHUGEPAGE); Correct? - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] mm->def_flags cleanups (Was: Change khugepaged to respect MMF_THP_DISABLE flag)
On Wed, Jan 22, 2014 at 08:43:27PM +0100, Oleg Nesterov wrote: > On 01/22, Oleg Nesterov wrote: > > > > On 01/22, Alex Thorlton wrote: > > > > + case PR_SET_THP_DISABLE: > > > > + case PR_GET_THP_DISABLE: > > > > + down_write(&me->mm->mmap_sem); > > > > + if (option == PR_SET_THP_DISABLE) { > > > > + if (arg2) > > > > + me->mm->def_flags |= VM_NOHUGEPAGE; > > > > + else > > > > + me->mm->def_flags &= ~VM_NOHUGEPAGE; > > > > + } else { > > > > + error = !!(me->mm->flags && VM_NOHUGEPAGE); > > > > > > Should be: > > > > > > error = !!(me->mm->def_flags && VM_NOHUGEPAGE); > > > > No, we need to return 1 if this bit is set ;) > > Damn, you are right of course, we need "&". I didn't notice "&&" > in the patch I sent and misunderstood your "&&" above ;) Sorry. Actually, I didn't catch that either! Looking at it, though, we definitely do want bitwise AND here, not logical. However, what I was originally referring to is: Shouldn't we be checking mm->***def_flags*** for the VM_NOHUGEPAGE bit, as opposed to mm->flags? i.e. I think we want this: error = !!(me->mm->def_flags & VM_NOHUGEPAGE); As opposed to: error = !!(me->mm->flags && VM_NOHUGEPAGE); The way I understand it, the VM_NOHUGEPAGE bit is defined for mm->vma->flags, but not for mm->flags. Am I wrong here? - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] Linux 3.14 fails to boot with new EFI changes
We've been hitting the following bug in the latest kernel, during boot: kernel BUG at arch/x86/mm/init_64.c:351! invalid opcode: [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-medusa-04156-g90804ed-dirty #750 Hardware name: Intel Corp. Stoutland Platform, BIOS 2.0 UEFI2.10 PI1.0 X64 2013-09-16 task: 88107c96c010 ti: 88107c96e000 task.ti: 88107c96e000 RIP: 0010:[] [] __init_extra_mapping+0x111/0x143 RSP: :88107c96fd18 EFLAGS: 00010206 RAX: 0e00 RBX: 880001c4a018 RCX: 0002 RDX: 88107fcd1e00 RSI: 0400 RDI: f800 RBP: 88107c96fd48 R08: R09: R10: 007e3b3c R11: R12: f800 R13: 0400 R14: 8800f800 R15: 8000 FS: () GS:88007320() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 8827fefff000 CR3: 017d7000 CR4: 06f0 Stack: 81fb 0040 b018 b010 b008 88107c96fd58 818aa71d 88107c96fe28 818a5e20 000cb748 0282 Call Trace: [] init_extra_mapping_uc+0x13/0x15 [] uv_system_init+0x22b/0x124b [] ? clockevents_register_device+0x138/0x13d [] ? setup_APIC_timer+0xc5/0xc7 [] ? clockevent_delta2ns+0xb/0xd [] ? setup_boot_APIC_clock+0x4a8/0x4b7 [] ? printk+0x72/0x74 [] native_smp_prepare_cpus+0x389/0x3d6 [] kernel_init_freeable+0xb7/0x1fb [] ? rest_init+0x74/0x74 [] kernel_init+0x9/0xff [] ret_from_fork+0x7c/0xb0 [] ? rest_init+0x74/0x74 Code: ff ff ff 3f 00 00 48 23 13 48 b8 00 00 00 00 00 88 ff ff 48 01 c2 4c 89 e0 48 c1 e8 12 25 f8 0f 00 00 48 01 c2 48 83 3a 00 74 04 <0f> 0b eb fe 48 8b 45 d0 49 81 ed 00 00 20 00 4c 09 e0 49 81 c4 RIP [] __init_extra_mapping+0x111/0x143 RSP I've bisected the issue down to this commit: commit d2f7cbe7b26a74dbbbf8f325b2a6fd01bc34032c Author: Borislav Petkov Date: Thu Oct 31 17:25:08 2013 +0100 x86/efi: Runtime services virtual mapping We map the EFI regions needed for runtime services non-contiguously, with preserved alignment on virtual addresses starting from -4G down for a total max space of 64G. This way, we provide for stable runtime services addresses across kernels so that a kexec'd kernel can still use
[RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent hugepages using prctl. It is based on my original patch to add a per-task_struct flag to disable THP: v1 - https://lkml.org/lkml/2013/8/2/671 v2 - https://lkml.org/lkml/2013/8/2/703 After looking at alternate methods of modifying how THPs are handed out, it sounds like people might be more in favor of this type of approach, so I'm re-introducing the patch. It seemed that everyone was in favor of moving this control over to the mm_struct, if it is to be implemented. That's the only major change here, aside from the added ability to both set and clear the flag from prctl. The main motivation behind this patch is to provide a way to disable THP for jobs where the code cannot be modified and using a malloc hook with madvise is not an option (i.e. statically allocated data). This patch allows us to do just that, without affecting other jobs running on the system. Here are some results showing the improvement that my test case gets when the MMF_THP_DISABLE flag is clear vs. set: MMF_THP_DISABLE clear: # perf stat -a -r 3 ./prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Performance counter stats for './prctl_wrapper_mm 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs): 267694862.049279 task-clock# 641.100 CPUs utilized ( +- 0.23% ) [100.00%] 908,846 context-switches #0.000 M/sec ( +- 0.23% ) [100.00%] 874 CPU-migrations#0.000 M/sec ( +- 4.01% ) [100.00%] 131,966 page-faults #0.000 M/sec ( +- 2.75% ) 351,127,909,744,906 cycles#1.312 GHz ( +- 0.27% ) [100.00%] 523,537,415,562,692 stalled-cycles-frontend # 149.10% frontend cycles idle ( +- 0.26% ) [100.00%] 392,400,753,609,156 stalled-cycles-backend# 111.75% backend cycles idle ( +- 0.29% ) [100.00%] 147,467,956,557,895 instructions #0.42 insns per cycle #3.55 stalled cycles per insn ( +- 0.09% ) [100.00%] 26,922,737,309,021 branches # 100.572 M/sec ( +- 0.24% ) [100.00%] 1,308,714,545 branch-misses #0.00% of all branches ( +- 0.18% ) 417.555688399 seconds time elapsed ( +- 0.23% ) MMF_THP_DISABLE set: # perf stat -a -r 3 ./prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Performance counter stats for './prctl_wrapper_mm 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs): 141674994.160138 task-clock# 642.107 CPUs utilized ( +- 0.23% ) [100.00%] 1,190,415 context-switches #0.000 M/sec ( +- 42.87% ) [100.00%] 688 CPU-migrations#0.000 M/sec ( +- 2.47% ) [100.00%] 62,394,646 page-faults #0.000 M/sec ( +- 0.00% ) 156,748,225,096,919 cycles#1.106 GHz ( +- 0.20% ) [100.00%] 211,440,354,290,433 stalled-cycles-frontend # 134.89% frontend cycles idle ( +- 0.40% ) [100.00%] 114,304,536,881,102 stalled-cycles-backend# 72.92% backend cycles idle ( +- 0.88% ) [100.00%] 179,939,084,230,732 instructions #1.15 insns per cycle #1.18 stalled cycles per insn ( +- 0.26% ) [100.00%] 26,659,099,949,509 branches # 188.171 M/sec ( +- 0.72% ) [100.00%] 762,772,361 branch-misses #0.00% of all branches ( +- 0.97% ) 220.640905073 seconds time elapsed ( +- 0.23% ) As you can see, this particular test gets about a 2x performance boost when THP is turned off. Here's a link to the test, along with the wrapper that I used: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctl.tar.gz There are still a few things that might need tweaked here, but I wanted to get the patch out there to get a discussion started. Two things I noted from the old patch discussion that will likely need to be addressed are: * Patch doesn't currently account for get_user_pages or khugepaged allocations. Easy enough to fix, but it's not there yet. * Current behavior is to have fork()/exec()'d processes inherit the flag. Andrew Morton pointed out some possible issues with this, so we may need to rethink this behavior. - If parent process has THP disabled, and forks off a child, the child will also have THP disabled. This may not be the desired behavior. Signed-off-by: Alex Thorlton Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrew Morton Cc: "Kirill A. Shutemov" Cc: Benjamin He
Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
On Fri, Jan 10, 2014 at 10:23:10PM +0200, Kirill A. Shutemov wrote: > Do you know what cause the difference? I prefer to fix THP instead of > adding new knob to disable it. The issue is that when you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be handed out, and the THP will be stuck on whatever node the chunk was originally referenced from. If many remote nodes need to do work on that same chunk, they'll be making remote accesses. With THP disabled, 4K pages can be handed out to separate nodes as they're needed, greatly reducing the amount of remote accesses to memory. I give a bit better description here: https://lkml.org/lkml/2013/8/27/397 I had been looking into better ways to handle this issues, but after spinning through a few other ideas: - Per cpuset flag to control THP: https://lkml.org/lkml/2013/6/10/331 - Threshold to determine when to hand out THPs: https://lkml.org/lkml/2013/12/12/394 We've arrived back here. Andrea seemed to think that this is an acceptable approach to solve the problem, at least as a starting point: https://lkml.org/lkml/2013/12/17/397 I agree that we should, ideally, come up with a way to appropriately handle this problem in the kernel, but as of right now, it appears that that might be a rather large undertaking. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote: > We already have the information to determine if a page is shared across > nodes, Mel even had some prototype code to do splits under those > conditions. I'm aware that we can determine if pages are shared across nodes, but I thought that Mel's code to split pages under these conditions had some performance issues. I know I've seen the code that Mel wrote to do this, but I can't seem to dig it up right now. Could you point me to it? - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote: > On 01/10, Alex Thorlton wrote: > > > > This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent > > hugepages using prctl. It is based on my original patch to add a > > per-task_struct flag to disable THP: > > I leave the "whether we need this feature" to other reviewers, although > personally I think it probably makes sense anyway. > > But the patch doesn't look nice imho. > > > @@ -373,7 +373,15 @@ extern int get_dumpable(struct mm_struct *mm); > > #define MMF_HAS_UPROBES19 /* has uprobes */ > > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > +#define MMF_THP_DISABLE21 /* disable THP for this mm */ > > +#define MMF_THP_DISABLE_MASK (1 << MMF_THP_DISABLE) > > + > > +#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | > > MMF_DUMP_FILTER_MASK | MMF_THP_DISABLE_MASK) > > +#else > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | > > MMF_DUMP_FILTER_MASK) > > +#endif > > It would be nice to lessen the number of ifdef's. Why we can't define > MMF_THP_DISABLE unconditionally and include it into MMF_INIT_MASK? > Or define it == 0 if !CONFIG_THP. But this is minor. That's a good idea. I guess I was thinking that we don't want to define any THP specific stuff when THP isn't configured, but I guess it doesn't make much of a difference since the flag will never be set if THP isn't configured. > > +#define PR_SET_THP_DISABLE 41 > > +#define PR_CLEAR_THP_DISABLE 42 > > +#define PR_GET_THP_DISABLE 43 > > Why we can't add 2 PR_'s, set and get? See response below. > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -818,6 +818,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk) > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS > > mm->pmd_huge_pte = NULL; > > #endif > > + > > if (!mm_init(mm, tsk)) > > goto fail_nomem; > > Why? looks like the accidental change. Ah, yes. Didn't catch that when I looked over the patch. I'll fix that. > > > --- a/kernel/sys.c > > +++ b/kernel/sys.c > > @@ -1835,6 +1835,42 @@ static int prctl_get_tid_address(struct task_struct > > *me, int __user **tid_addr) > > } > > #endif > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > +static int prctl_set_thp_disable(struct task_struct *me) > > +{ > > + set_bit(MMF_THP_DISABLE, &me->mm->flags); > > + return 0; > > +} > > + > > +static int prctl_clear_thp_disable(struct task_struct *me) > > +{ > > + clear_bit(MMF_THP_DISABLE, &me->mm->flags); > > + return 0; > > +} > > + > > +static int prctl_get_thp_disable(struct task_struct *me, > > + int __user *thp_disabled) > > +{ > > + return put_user(test_bit(MMF_THP_DISABLE, &me->mm->flags), > > thp_disabled); > > +} > > +#else > > +static int prctl_set_thp_disable(struct task_struct *me) > > +{ > > + return -EINVAL; > > +} > > + > > +static int prctl_clear_thp_disable(struct task_struct *me) > > +{ > > + return -EINVAL; > > +} > > + > > +static int prctl_get_thp_disable(struct task_struct *me, > > + int __user *thp_disabled) > > +{ > > + return -EINVAL; > > +} > > +#endif > > + > > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, > > arg3, > > unsigned long, arg4, unsigned long, arg5) > > { > > @@ -1998,6 +2034,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, > > arg2, unsigned long, arg3, > > if (arg2 || arg3 || arg4 || arg5) > > return -EINVAL; > > return current->no_new_privs ? 1 : 0; > > + case PR_SET_THP_DISABLE: > > + error = prctl_set_thp_disable(me); > > + break; > > + case PR_CLEAR_THP_DISABLE: > > + error = prctl_clear_thp_disable(me); > > + break; > > + case PR_GET_THP_DISABLE: > > + error = prctl_get_thp_disable(me, (int __user *) arg2); > > + break; > > default: > > error = -EINVAL; > > break; > > I simply can't understand, this all looks like overkill. Can't you simply add > > #idfef CONFIG_TRANSPARENT_HUGEPAGE
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Sat, Nov 23, 2013 at 12:09:24AM +, Mel Gorman wrote: > On Fri, Nov 22, 2013 at 11:05:24PM +, Mel Gorman wrote: > > On Fri, Nov 22, 2013 at 03:28:07PM -0600, Alex Thorlton wrote: > > > > If the warning added by that patch does *not* trigger than can you also > > > > test this patch? It removes the barriers which should not be necessary > > > > and takes a reference tot he page before waiting on the lock. The > > > > previous version did not take the reference because otherwise the > > > > WARN_ON could not distinguish between a migration waiter and a surprise > > > > gup. > > > > > > Sorry for the delay; been a bit busy. I tested both of these patches on > > > top of this one (separately, of course): > > > > > > http://www.spinics.net/lists/linux-mm/msg63919.html > > > > > > I think that's the one you were referring to, if not send me a pointer > > > to the correct one and I'll give it another shot. Both patches still > > > segfaulted, so it doesn't appear that either of these solved the > > > problem. > > > > I see. Does THP have to be enabled or does it segfault even with THP > > disabled? It occurs with both THP on, and off. I get RCU stalls with THP on though. That's probably related to not having Kirill/Naoya's patches applied though. > > > > On a semi-related note, is the large machine doing anything with xpmem > or anything that depends on MMU notifiers to work properly? I noted > while looking at this that THP migration is not invalidating pages which > might be confusing a driver depending on it. I'm not using xpmem on any of the machines that I've been testing on, and I don't think that anything should be using MMU notifiers. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
On Sun, Jan 12, 2014 at 02:56:00PM +0100, Oleg Nesterov wrote: > On 01/11, Alex Thorlton wrote: > > > > On Sat, Jan 11, 2014 at 04:53:37PM +0100, Oleg Nesterov wrote: > > > > > I simply can't understand, this all looks like overkill. Can't you simply > > > add > > > > > > #idfef CONFIG_TRANSPARENT_HUGEPAGE > > > case GET: > > > error = test_bit(MMF_THP_DISABLE); > > > break; > > > case PUT: > > > if (arg2) > > > set_bit(); > > > else > > > clear_bit(); > > > break; > > > #endif > > > > > > into sys_prctl() ? > > > > That's probably a better solution. I wasn't sure whether or not it was > > better to have two functions to handle this, or to have one function > > handle both. If you think it's better to just handle both with one, > > that's easy enough to change. > > Personally I think sys_prctl() can handle this itself, without a helper. > But of course I won't argue, this is up to you. > > My only point is, the kernel is already huge ;) Imho it makes sense to > try to lessen the code size, when the logic is simple. I agree with you here as well. There was a mixed bag of PRCTLs using helpers vs. ones that put the code right into sys_prctl. I just arbitrarily chose to use a helper here. I'll switch that over for v2. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm: thp: Add per-mm_struct flag to control THP
On Tue, Jan 14, 2014 at 03:44:57PM +, Mel Gorman wrote: > On Fri, Jan 10, 2014 at 04:39:09PM -0600, Alex Thorlton wrote: > > On Fri, Jan 10, 2014 at 11:10:10PM +0100, Peter Zijlstra wrote: > > > We already have the information to determine if a page is shared across > > > nodes, Mel even had some prototype code to do splits under those > > > conditions. > > > > I'm aware that we can determine if pages are shared across nodes, but I > > thought that Mel's code to split pages under these conditions had some > > performance issues. I know I've seen the code that Mel wrote to do > > this, but I can't seem to dig it up right now. Could you point me to > > it? > > > > It was a lot of revisions ago! The git branches no longer exist but the > diff from the monolithic patches is below. The baseline was v3.10 and > this will no longer apply but you'll see the two places where I added a > split_huge_page and prevented khugepaged collapsing them again. Thanks, Mel. I remember seeing this code a while back when we were discussing THP/locality issues. > At the time, the performance with it applied was much worse but it was a > 10 minute patch as a distraction. There was a range of basic problems that > had to be tackled before there was any point looking at splitting THP due > to locality. I did not pursue it further and have not revisited it since. So, in your opinion, is this something we should look into further before moving towards the per-mm switch that I propose here? I personally think that it will be tough to get this to perform as well as a method that totally disables THP when requested, or a method that tries to prevent THPs from being handed out in certain situations, since we'll be doing the work of both making and splitting a THP in the case where remote accesses are made to the page. I also think there could be some issues with over-zealously splitting pages, since it sounds like we can only determine if an access is from a remote node. We don't have a good way of determining how many accesses are remote vs. local, or how many separate nodes are accessing a page. For example, I can see this being a problem if we have a large multi-node system, where only two nodes are accessing a THP. We might end up splitting that THP, but if relatively few remote nodes are accessing it, it may not be worth the time. The split only seems worthwhile to me if the majority of accesses are remote, which sounds like it would be hard to determine. One thing we could possibly do would be to add some structures to do a bit of accounting work into the mm_struct or some other appropriate location, then we could keep track of how many distinct remote nodes are accessing a THP and decide to split based on that. However, there's still the overhead to creating/splitting the THP, and the extra space/time needed to do the proper accounting work may be counterproductive (if this is even possible, I'm just thinking out loud here). - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Wed, Nov 06, 2013 at 01:10:48PM +, Mel Gorman wrote: > On Mon, Nov 04, 2013 at 02:03:46PM -0600, Alex Thorlton wrote: > > On Mon, Nov 04, 2013 at 02:58:28PM +, Mel Gorman wrote: > > > On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote: > > > > Hi guys, > > > > > > > > I ran into a bug a week or so ago, that I believe has something to do > > > > with NUMA balancing, but I'm having a tough time tracking down exactly > > > > what is causing it. When running with the following configuration > > > > options set: > > > > > > > > > > Can you test with patches > > > cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1 > > > applied? They fix some known memory corruption problems, were merged for > > > 3.12 (so alternatively just test 3.12) and have been tagged for -stable. > > > > I just finished testing with 3.12, and I'm still seeing the same issue. > > Ok, I plugged your test into mmtests and ran it a few times but was not > able to reproduce the same issue. It's a much smaller machine which > might be a factor. > > > I'll poke around a bit more on this in the next few days and see if I > > can come up with any more information. In the meantime, let me know if > > you have any other suggestions. > > > > Try the following patch on top of 3.12. It's a patch that is expected to > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in > -stable but corruption trumps performance and the full series is not > going to be considered acceptable for -stable I gave this patch a shot, and it didn't seem to solve the problem. Actually I'm running into what appear to be *worse* problems on the 3.12 kernel. Here're a couple stack traces of what I get when I run the test on 3.12, 512 cores: (These are just two of the CPUs, obviously, but most of the memscale processes appeared to be in one of these two spots) Nov 7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6 Nov 7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-6-g0255d49 #381 Nov 7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20 Nov 7 13:54:39 uvpsw1 kernel: task: 8810647e0300 ti: 88106413e000 task.ti: 88106413e000 Nov 7 13:54:39 uvpsw1 kernel: RIP: 0010:[] [] _raw_spin_lock+0x1a/0x25 Nov 7 13:54:39 uvpsw1 kernel: RSP: 0018:88106413fd38 EFLAGS: 0283 Nov 7 13:54:39 uvpsw1 kernel: RAX: a1a9a0fe RBX: 0206 RCX: 8800 Nov 7 13:54:41 uvpsw1 kernel: RDX: a1a9 RSI: 3000 RDI: 8907ded35494 Nov 7 13:54:41 uvpsw1 kernel: RBP: 88106413fd38 R08: 0006 R09: 0002 Nov 7 13:54:41 uvpsw1 kernel: R10: 0007 R11: 88106413ff40 R12: 8907ded35494 Nov 7 13:54:42 uvpsw1 kernel: R13: 88106413fe1c R14: 8810637a05f0 R15: 0206 Nov 7 13:54:42 uvpsw1 kernel: FS: 7fffd5def700() GS:88107d98() knlGS: Nov 7 13:54:42 uvpsw1 kernel: CS: 0010 DS: ES: CR0: 8005003b Nov 7 13:54:42 uvpsw1 kernel: CR2: 7fffd5ded000 CR3: 0107dfbcf000 CR4: 07e0 Nov 7 13:54:42 uvpsw1 kernel: Stack: Nov 7 13:54:42 uvpsw1 kernel: 88106413fda8 810d670a 0002 0006 Nov 7 13:54:42 uvpsw1 kernel: 7fff57dde000 8810640e1cc0 02006413fe10 8907ded35440 Nov 7 13:54:45 uvpsw1 kernel: 88106413fda8 0206 0002 Nov 7 13:54:45 uvpsw1 kernel: Call Trace: Nov 7 13:54:45 uvpsw1 kernel: [] follow_page_mask+0x123/0x3f1 Nov 7 13:54:45 uvpsw1 kernel: [] __get_user_pages+0x3e3/0x488 Nov 7 13:54:45 uvpsw1 kernel: [] get_user_pages+0x4d/0x4f Nov 7 13:54:45 uvpsw1 kernel: [] SyS_get_mempolicy+0x1a9/0x3e0 Nov 7 13:54:45 uvpsw1 kernel: [] system_call_fastpath+0x16/0x1b Nov 7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00 Nov 7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8 Nov 7 13:55:59 uvpsw1 kernel: INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs Nov 7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-6-g0255d49 #381 Nov 7 13:56:04 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20 Nov 7 13:56:04 uvpsw1 kernel: task: 881063c56380 ti: 8810621b8000 task.ti: 8810621b8000 Nov 7 13:56:04 uvpsw1 kernel: RIP: 0010
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote: > Hi guys, > > I ran into a bug a week or so ago, that I believe has something to do > with NUMA balancing, but I'm having a tough time tracking down exactly > what is causing it. When running with the following configuration > options set: > > CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y > CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y > CONFIG_NUMA_BALANCING=y > # CONFIG_HUGETLBFS is not set > # CONFIG_HUGETLB_PAGE is not set > > I get intermittent segfaults when running the memscale test that we've > been using to test some of the THP changes. Here's a link to the test: > > ftp://shell.sgi.com/collect/memscale/ For anyone who's interested, this test has been moved to: http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz It should remain there permanently. > > I typically run the test with a line similar to this: > > ./thp_memscale -C 0 -m 0 -c -b > > Where is the number of cores to spawn threads on, and > is the amount of memory to reserve from each core. The field > can accept values like 512m or 1g, etc. I typically run 256 cores and > 512m, though I think the problem should be reproducable on anything with > 128+ cores. > > The test never seems to have any problems when running with hugetlbfs > on and NUMA balancing off, but it segfaults every once in a while with > the config options above. It seems to occur more frequently, the more > cores you run on. It segfaults on about 50% of the runs at 256 cores, > and on almost every run at 512 cores. The fewest number of cores I've > seen a segfault on has been 128, though it seems to be rare on this many > cores. > > At this point, I'm not familiar enough with NUMA balancing code to know > what could be causing this, and we don't typically run with NUMA > balancing on, so I don't see this in my everyday testing, but I felt > that it was definitely worth bringing up. > > If anybody has any ideas of where I could poke around to find a > solution, please let me know. > > - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Fri, Nov 08, 2013 at 11:20:54AM +, Mel Gorman wrote: > On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote: > > > Try the following patch on top of 3.12. It's a patch that is expected to > > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in > > > -stable but corruption trumps performance and the full series is not > > > going to be considered acceptable for -stable > > > > I gave this patch a shot, and it didn't seem to solve the problem. > > Actually I'm running into what appear to be *worse* problems on the 3.12 > > kernel. Here're a couple stack traces of what I get when I run the test > > on 3.12, 512 cores: > > > > Ok, so there are two issues at least. Whatever is causing your > corruption (which I still cannot reproduce) and the fact that 3.12 is > worse. The largest machine I've tested with is 40 cores. I'm trying to > get time on a 60 core machine to see if has a better chance. I will not > be able to get access to anything resembling 512 cores. At this point, the smallest machine I've been able to recreate this issue on has been a 128 core, but it's rare on a machine that small. I'll kick off a really long run on a 64 core over the weekend and see if I can hit it on there at all, but I haven't been able to previously. > > > (These are just two of the CPUs, obviously, but most of the memscale > > processes appeared to be in one of these two spots) > > > > Nov 7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6 > > Nov 7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not > > tainted 3.12.0-rc7-medusa-6-g0255d49 #381 > > Nov 7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland > > Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20 > > Nov 7 13:54:39 uvpsw1 kernel: task: 8810647e0300 ti: 88106413e000 > > task.ti: 88106413e000 > > Nov 7 13:54:39 uvpsw1 kernel: RIP: 0010:[] > > [] _raw_spin_lock+0x1a/0x25 > > Nov 7 13:54:39 uvpsw1 kernel: RSP: 0018:88106413fd38 EFLAGS: 0283 > > Nov 7 13:54:39 uvpsw1 kernel: RAX: a1a9a0fe RBX: 0206 > > RCX: 8800 > > Nov 7 13:54:41 uvpsw1 kernel: RDX: a1a9 RSI: 3000 > > RDI: 8907ded35494 > > Nov 7 13:54:41 uvpsw1 kernel: RBP: 88106413fd38 R08: 0006 > > R09: 0002 > > Nov 7 13:54:41 uvpsw1 kernel: R10: 0007 R11: 88106413ff40 > > R12: 8907ded35494 > > Nov 7 13:54:42 uvpsw1 kernel: R13: 88106413fe1c R14: 8810637a05f0 > > R15: 0206 > > Nov 7 13:54:42 uvpsw1 kernel: FS: 7fffd5def700() > > GS:88107d98() knlGS: > > Nov 7 13:54:42 uvpsw1 kernel: CS: 0010 DS: ES: CR0: > > 8005003b > > Nov 7 13:54:42 uvpsw1 kernel: CR2: 7fffd5ded000 CR3: 0107dfbcf000 > > CR4: 07e0 > > Nov 7 13:54:42 uvpsw1 kernel: Stack: > > Nov 7 13:54:42 uvpsw1 kernel: 88106413fda8 810d670a > > 0002 0006 > > Nov 7 13:54:42 uvpsw1 kernel: 7fff57dde000 8810640e1cc0 > > 02006413fe10 8907ded35440 > > Nov 7 13:54:45 uvpsw1 kernel: 88106413fda8 0206 > > 0002 > > Nov 7 13:54:45 uvpsw1 kernel: Call Trace: > > Nov 7 13:54:45 uvpsw1 kernel: [] > > follow_page_mask+0x123/0x3f1 > > Nov 7 13:54:45 uvpsw1 kernel: [] > > __get_user_pages+0x3e3/0x488 > > Nov 7 13:54:45 uvpsw1 kernel: [] > > get_user_pages+0x4d/0x4f > > Nov 7 13:54:45 uvpsw1 kernel: [] > > SyS_get_mempolicy+0x1a9/0x3e0 > > Nov 7 13:54:45 uvpsw1 kernel: [] > > system_call_fastpath+0x16/0x1b > > Nov 7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 > > 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 > > 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 > > 00 00 > > > > That is probably the mm->page_table_lock being contended on. Kirill has > patches that split this which will help the scalability when THP is > enabled. They should be merged for 3.13-rc1. In itself it should not > cause bugs other than maybe soft lockups. > > > Nov 7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8 > > Nov 7 13:55:59 uvpsw1 kernel: INFO: NMI handler > > (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs > > Nov 7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not > > tainted 3.12.0-rc7-medusa-6-g0255d49 #381 &
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Fri, Nov 08, 2013 at 04:13:29PM -0600, Alex Thorlton wrote: > On Fri, Nov 08, 2013 at 11:20:54AM +, Mel Gorman wrote: > > On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote: > > > > Try the following patch on top of 3.12. It's a patch that is expected to > > > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in > > > > -stable but corruption trumps performance and the full series is not > > > > going to be considered acceptable for -stable > > > > > > I gave this patch a shot, and it didn't seem to solve the problem. > > > Actually I'm running into what appear to be *worse* problems on the 3.12 > > > kernel. Here're a couple stack traces of what I get when I run the test > > > on 3.12, 512 cores: > > > > > > > Ok, so there are two issues at least. Whatever is causing your > > corruption (which I still cannot reproduce) and the fact that 3.12 is > > worse. The largest machine I've tested with is 40 cores. I'm trying to > > get time on a 60 core machine to see if has a better chance. I will not > > be able to get access to anything resembling 512 cores. > > At this point, the smallest machine I've been able to recreate this > issue on has been a 128 core, but it's rare on a machine that small. > I'll kick off a really long run on a 64 core over the weekend and see if > I can hit it on there at all, but I haven't been able to previously. Just a quick update, I ran this test 500 times on 64 cores, allocating 512m per core, and every single test completed successfully. At this point, it looks like you definitely need at least 128 cores to reproduce the issue. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
BUG: mm, numa: test segfaults, only when NUMA balancing is on
Hi guys, I ran into a bug a week or so ago, that I believe has something to do with NUMA balancing, but I'm having a tough time tracking down exactly what is causing it. When running with the following configuration options set: CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y CONFIG_NUMA_BALANCING=y # CONFIG_HUGETLBFS is not set # CONFIG_HUGETLB_PAGE is not set I get intermittent segfaults when running the memscale test that we've been using to test some of the THP changes. Here's a link to the test: ftp://shell.sgi.com/collect/memscale/ I typically run the test with a line similar to this: ./thp_memscale -C 0 -m 0 -c -b Where is the number of cores to spawn threads on, and is the amount of memory to reserve from each core. The field can accept values like 512m or 1g, etc. I typically run 256 cores and 512m, though I think the problem should be reproducable on anything with 128+ cores. The test never seems to have any problems when running with hugetlbfs on and NUMA balancing off, but it segfaults every once in a while with the config options above. It seems to occur more frequently, the more cores you run on. It segfaults on about 50% of the runs at 256 cores, and on almost every run at 512 cores. The fewest number of cores I've seen a segfault on has been 128, though it seems to be rare on this many cores. At this point, I'm not familiar enough with NUMA balancing code to know what could be causing this, and we don't typically run with NUMA balancing on, so I don't see this in my everyday testing, but I felt that it was definitely worth bringing up. If anybody has any ideas of where I could poke around to find a solution, please let me know. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
On Thu, Oct 17, 2013 at 07:30:58PM +0800, Bob Liu wrote: > Hi Alex, > > On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton wrote: > > Hi guys, > > > > I ran into a bug a week or so ago, that I believe has something to do > > with NUMA balancing, but I'm having a tough time tracking down exactly > > what is causing it. When running with the following configuration > > options set: > > > > CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y > > CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y > > CONFIG_NUMA_BALANCING=y > > # CONFIG_HUGETLBFS is not set > > # CONFIG_HUGETLB_PAGE is not set > > > > What's your kernel version? > And did you enable CONFIG_TRANSPARENT_HUGEPAGE ? Ah, two important things that I forgot to include! The kernel I originally spotted the problem on was 3.11 and it continued to be an issue up through 3.12-rc4, but after running a 30-trial run of the test today, it appears that the issue must have cleared up after the 3.12-rc5 release on Monday. I'll still include the requested information, but I guess this is no longer an issue. I rolled all the way back to 3.7 while researching the issue, and that appears to be the last kernel where the problem didn't show up. 3.8 is the first kernel where the bug appears; I believe this is also the kernel where NUMA balancing was officially introduced. Here are my settings related to THP: CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y # CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set I did most of my testing with THP set to "always", although the problem still occurs with THP set to "never". > > > I get intermittent segfaults when running the memscale test that we've > > been using to test some of the THP changes. Here's a link to the test: > > > > ftp://shell.sgi.com/collect/memscale/ > > > > I typically run the test with a line similar to this: > > > > ./thp_memscale -C 0 -m 0 -c -b > > > > Where is the number of cores to spawn threads on, and > > is the amount of memory to reserve from each core. The field > > can accept values like 512m or 1g, etc. I typically run 256 cores and > > 512m, though I think the problem should be reproducable on anything with > > 128+ cores. > > > > The test never seems to have any problems when running with hugetlbfs > > on and NUMA balancing off, but it segfaults every once in a while with > > the config options above. It seems to occur more frequently, the more > > cores you run on. It segfaults on about 50% of the runs at 256 cores, > > and on almost every run at 512 cores. The fewest number of cores I've > > seen a segfault on has been 128, though it seems to be rare on this many > > cores. > > > > Could you please attach some logs? Here are the relevant chunks from the syslog for a 10-shot run at 256 cores, each chunk is from a separate run. 4 out of 10 failed with segfaults: Oct 17 11:36:41 harp83-sys kernel: thp_memscale[21566]: segfault at 0 ip (null) sp 7ff8531fcdc0 error 14 in thp_memscale[40+5000] Oct 17 11:36:41 harp83-sys kernel: thp_memscale[21565]: segfault at 0 ip (null) sp 7ff8539fddc0 error 14 in thp_memscale[40+5000] --- Oct 17 12:08:14 harp83-sys kernel: thp_memscale[22893]: segfault at 0 ip (null) sp 7ff69cffddc0 error 14 in thp_memscale[40+5000] --- Oct 17 12:26:30 harp83-sys kernel: thp_memscale[23995]: segfault at 0 ip (null) sp 7fe7af1fcdc0 error 14 in thp_memscale[40+5000] Oct 17 12:26:30 harp83-sys kernel: thp_memscale[23994]: segfault at 0 ip (null) sp 7fe7af9fddc0 error 14 in thp_memscale[40+5000] --- Oct 17 12:32:29 harp83-sys kernel: thp_memscale[24116]: segfault at 0 ip (null) sp 7ff77a9fcdc0 error 14 in thp_memscale[40+5000] Since this has cleared up in the latest release, I won't be pursuing the issue any further (though I'll keep an eye out to make sure that it doesn't show back up). I am, however, still curious as to what the cause of the problem was... > > > At this point, I'm not familiar enough with NUMA balancing code to know > > what could be causing this, and we don't typically run with NUMA > > balancing on, so I don't see this in my everyday testing, but I felt > > that it was definitely worth bringing up. > > > > If anybody has any ideas of where I could poke around to find a > > solution, please let me know. > > > > -- > Regards, > --Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
> If the warning added by that patch does *not* trigger than can you also > test this patch? It removes the barriers which should not be necessary > and takes a reference tot he page before waiting on the lock. The > previous version did not take the reference because otherwise the > WARN_ON could not distinguish between a migration waiter and a surprise > gup. Sorry for the delay; been a bit busy. I tested both of these patches on top of this one (separately, of course): http://www.spinics.net/lists/linux-mm/msg63919.html I think that's the one you were referring to, if not send me a pointer to the correct one and I'll give it another shot. Both patches still segfaulted, so it doesn't appear that either of these solved the problem. If you have anything else you'd like for me to try let me know. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 2/3] Add tunable to control THP behavior
This part of the patch adds a tunable to /sys/kernel/mm/transparent_hugepage called threshold. This threshold determines how many pages a user must fault in from a single node before a temporary compound page is turned into a THP. Signed-off-by: Alex Thorlton Cc: Andrew Morton Cc: Nate Zimmer Cc: Cliff Wickman Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Wanpeng Li Cc: Mel Gorman Cc: Michel Lespinasse Cc: Benjamin LaHaise Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: David Rientjes Cc: Zhang Yanfei Cc: Peter Zijlstra Cc: Johannes Weiner Cc: Michal Hocko Cc: Jiang Liu Cc: Cody P Schafer Cc: Glauber Costa Cc: Kamezawa Hiroyuki Cc: David Rientjes Cc: Naoya Horiguchi Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org --- include/linux/huge_mm.h | 2 ++ include/linux/mm_types.h | 1 + mm/huge_memory.c | 30 ++ 3 files changed, 33 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 3935428..0943b1b6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -177,6 +177,8 @@ static inline struct page *compound_trans_head(struct page *page) extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pmd_t pmd, pmd_t *pmdp); +extern int thp_threshold_check(void); + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d9851ee..b5efa23 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -408,6 +408,7 @@ struct mm_struct { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ + int thp_threshold; #endif #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cca80d9..5d388e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = (1< HPAGE_PMD_NR) + return -EINVAL; + + transparent_hugepage_threshold = value; + + return count; +} +static struct kobj_attribute threshold_attr = + __ATTR(threshold, 0644, threshold_show, threshold_store); #ifdef CONFIG_DEBUG_VM static ssize_t debug_cow_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -397,6 +426,7 @@ static struct kobj_attribute debug_cow_attr = static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, + &threshold_attr.attr, &use_zero_page_attr.attr, #ifdef CONFIG_DEBUG_VM &debug_cow_attr.attr, -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/3] Change how we determine when to hand out THPs
to be cleaned up a bit Just to let people know, I've been doing most of my testing with the memscale test: http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz The pthread test hits the first bug I mentioned here much more often, but the patch seems to be more stable when tested with memscale. I typically run something like this to test: # ./thp_memscale -C 0 -m 0 -c 32 -b 16m As you increase the amount of memory/number of cores, you become more likely to run into issues. Although there's still work to be done here, I wanted to get an early version of the patch out so that everyone could give their opinions/suggestions. The patch should apply cleanly to the 3.12 kernel. I'll rebase it as soon as some of the remaining issues have been sorted out, this will also mean changing over to the split PTL where appropriate. Signed-off-by: Alex Thorlton Cc: Andrew Morton Cc: Nate Zimmer Cc: Cliff Wickman Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Wanpeng Li Cc: Mel Gorman Cc: Michel Lespinasse Cc: Benjamin LaHaise Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: David Rientjes Cc: Zhang Yanfei Cc: Peter Zijlstra Cc: Johannes Weiner Cc: Michal Hocko Cc: Jiang Liu Cc: Cody P Schafer Cc: Glauber Costa Cc: Kamezawa Hiroyuki Cc: David Rientjes Cc: Naoya Horiguchi Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Alex Thorlton (3): Add flags for temporary compound pages Add tunable to control THP behavior Change THP behavior include/linux/gfp.h | 5 + include/linux/huge_mm.h | 8 ++ include/linux/mm_types.h | 14 +++ kernel/fork.c| 1 + mm/huge_memory.c | 313 +++ mm/internal.h| 1 + mm/memory.c | 29 - mm/page_alloc.c | 66 +- 8 files changed, 430 insertions(+), 7 deletions(-) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 3/3] Change THP behavior
This patch implements the functionality we're really going for here. It adds the decision making behavior to determine when to grab a temporary compound page, and whether or not to fault in single pages or to turn the temporary page into a THP. This one is rather large, might split it up a bit more for later versions I've left most of my comments in here just to provide people with some insight into what I may have been thinking when I chose to do something in a certain way. These will probably be trimmed down in later versions of the patch. Signed-off-by: Alex Thorlton Cc: Andrew Morton Cc: Nate Zimmer Cc: Cliff Wickman Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Wanpeng Li Cc: Mel Gorman Cc: Michel Lespinasse Cc: Benjamin LaHaise Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: David Rientjes Cc: Zhang Yanfei Cc: Peter Zijlstra Cc: Johannes Weiner Cc: Michal Hocko Cc: Jiang Liu Cc: Cody P Schafer Cc: Glauber Costa Cc: Kamezawa Hiroyuki Cc: David Rientjes Cc: Naoya Horiguchi Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org --- include/linux/huge_mm.h | 6 + include/linux/mm_types.h | 13 +++ kernel/fork.c| 1 + mm/huge_memory.c | 283 +++ mm/internal.h| 1 + mm/memory.c | 29 - mm/page_alloc.c | 66 ++- 7 files changed, 392 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0943b1b6..c1e407d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags); +extern struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, +pmd_t *pmd); +extern int do_huge_pmd_temp_page(struct mm_struct *mm, +struct vm_area_struct *vma, +unsigned long address, pmd_t *pmd, +unsigned int flags); extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b5efa23..d48c6ab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -322,6 +322,17 @@ struct mm_rss_stat { atomic_long_t count[NR_MM_COUNTERS]; }; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct temp_hugepage { + pmd_t *pmd; + struct page *page; + spinlock_t temp_hugepage_lock; + int node; /* node id of the first page in the chunk */ + int ref_count; /* number of pages faulted in from the chunk */ + struct list_head list; /* pointers to next/prev chunks */ +}; +#endif + struct kioctx_table; struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ @@ -408,7 +419,9 @@ struct mm_struct { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ + spinlock_t thp_list_lock; /* lock to protect thp_temp_list */ int thp_threshold; + struct list_head thp_temp_list; /* list of 512 page chunks for THPs */ #endif #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; diff --git a/kernel/fork.c b/kernel/fork.c index 086fe73..a3ccf857 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -816,6 +816,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; + INIT_LIST_HEAD(&mm->thp_temp_list); #endif #ifdef CONFIG_NUMA_BALANCING mm->first_nid = NUMA_PTE_SCAN_INIT; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5d388e4..43ea095 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -788,6 +788,20 @@ static inline struct page *alloc_hugepage_vma(int defrag, HPAGE_PMD_ORDER, vma, haddr, nd); } +static inline gfp_t alloc_temp_hugepage_gfpmask(gfp_t extra_gfp) +{ + return GFP_TEMP_TRANSHUGE | extra_gfp; +} + +static inline struct page *alloc_temp_hugepage_vma(int defrag, + struct vm_area_struct *vma, + unsigned long haddr, int nd, + gfp_t extra_gfp) +{ + return alloc_pages_vma(alloc_temp_hugepage_gfpmask(extra_gfp), + HPAGE_PMD_ORDER, vma, haddr, nd); +} + #ifndef CONFIG_NUMA static inline struct page *alloc_hugepa
[RFC PATCH 1/3] Add flags for temporary compound pages
This chunk of the patch just adds the necessary page flags to support what I call a temporary compound page. Temporary compound pages provide us with the ability to turn a chunk of pages into a proper THP at some point after pulling them off the free list, but also allow us to fault in single 4K pages from the chunk. Signed-off-by: Alex Thorlton Cc: Andrew Morton Cc: Nate Zimmer Cc: Cliff Wickman Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Wanpeng Li Cc: Mel Gorman Cc: Michel Lespinasse Cc: Benjamin LaHaise Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: David Rientjes Cc: Zhang Yanfei Cc: Peter Zijlstra Cc: Johannes Weiner Cc: Michal Hocko Cc: Jiang Liu Cc: Cody P Schafer Cc: Glauber Costa Cc: Kamezawa Hiroyuki Cc: David Rientjes Cc: Naoya Horiguchi Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org --- include/linux/gfp.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd49..6ec63dd 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x40u #define ___GFP_OTHER_NODE 0x80u #define ___GFP_WRITE 0x100u +#define ___GFP_COMP_TEMP 0x200u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -77,6 +78,7 @@ struct vm_area_struct; #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */ #define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */ #define __GFP_COMP ((__force gfp_t)___GFP_COMP)/* Add compound page metadata */ +#define __GFP_COMP_TEMP((__force gfp_t)___GFP_COMP_TEMP) /* Treat as temp compound page */ #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves. * This takes precedence over the @@ -121,6 +123,9 @@ struct vm_area_struct; #define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) +#define GFP_TEMP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP_TEMP | \ +__GFP_NOMEMALLOC | __GFP_NORETRY | \ +__GFP_NOWARN | __GFP_NO_KSWAPD) #ifdef CONFIG_NUMA #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
Ugggh. Looks like mutt clobbered the message ID on the cover letter, screwing up the reply chain. Sorry about that :/ - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/3] Add tunable to control THP behavior
> Is there a setting that will turn off the must-be-the-same-node > behavior? There are workloads where TLB matters more than cross-node > traffic (or where all the pages are hopelessly shared between nodes, > but hugepages are still useful). That's pretty much how THPs already behave in the kernel, so if you want to allow THPs to be handed out to one node, but referenced from many others, you'd just set the threshold to 1, and let the existing code take over. As for the must-be-the-same-node behavior: I'd actually say it's more like a "must have so much on one node" behavior, in that, if you set the threshold to 16, for example, 16 4K pages must be faulted in on the same node, in the same contiguous 2M chunk, before a THP will be created. What happens after that THP is created is out of our control, it could be referenced from anywhere. The idea here is that we can tune things so that jobs that behave poorly with THP on will not be given THPs, but the jobs that like THPs can still get them. Granted, there are still issues with this approach, but I think it's a bit better than just handing out a THP because we touched one byte in a 2M chunk. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/3] Add tunable to control THP behavior
> Right. I like that behavior for my workload. (Although I currently > allocate huge pages -- when I wrote that code, THP interacted so badly > with pagecache that it was a non-starter. I think it's fixed now, > though.) In that case, it's probably best to just stick with current behavior, and leave the threshold at 1, unless we implement something like I discuss below. > In that case, I guess I misunderstood your description. Are saying > that, once any node accesses this many pages in the potential THP, > then the whole THP will be mapped? Well, right now, this patch completely gives up on mapping a THP if two different nodes take a page from our chunk before the threshold is hit, so I think you're mostly understanding it correctly. One thing we could consider is adding an option to map the THP on the node with the *most* references to the potential THP, instead of giving up on mapping the THP when multiple nodes reference it. That might be a good middle ground, but I can see some performance issues coming into play there if the threshold is set too high, since we'll have to move all the pages in the chunk to the node that hit the threshold. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/3] Add tunable to control THP behavior
On Thu, Dec 12, 2013 at 04:37:11PM -0500, Rik van Riel wrote: > On 12/12/2013 01:00 PM, Alex Thorlton wrote: > >This part of the patch adds a tunable to > >/sys/kernel/mm/transparent_hugepage called threshold. This threshold > >determines how many pages a user must fault in from a single node before > >a temporary compound page is turned into a THP. > > >+++ b/mm/huge_memory.c > >@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = > > (1< > (1< > > >+/* default to 1 page threshold for handing out thps; maintains old behavior > >*/ > >+static int transparent_hugepage_threshold = 1; > > I assume the motivation for writing all this code is that "1" > was not a good value in your tests. Yes, that's correct. > That makes me wonder, why should 1 be the default value with > your patches? The main reason I set the default to 1 was because the majority of jobs aren't hurt by the existing THP behavior. I figured it would be best to default to having things behave the same as they do now, but provide the option to increase the threshold on systems that run jobs that could be adversely affected by the current behavior. > If there is a better value, why should we not use that? > > What is the upside of using a better value? > > What is the downside? The problem here is that what the "better" value is can vary greatly depending on how a particular task allocates memory. Setting the threshold too high can negatively affect the performance of jobs that behave well with the current behavior, setting it too low won't yield a performance increase for the jobs that are hurt by the current behavior. With some more thorough testing, I'm sure that we could arrive at a value that will help out jobs which behave poorly under current conditions, while having a minimal effect on jobs that already perform well. At this point, I'm looking more to ensure that everybody likes this approach to solving the problem before putting the finishing touches on the patches, and doing testing to find a good middle ground. > Is there a value that would to bound the downside, so it > is almost always smaller than the upside? Again, the problem here is that, to find a good value, we have to know quite a bit about why a particular value is bad for a particular job. While, as stated above, I think we can probably find a good middle ground to use as a default, in the end it will be the job of individual sysadmins to determine what value works best for their particular applications, and tune things accordingly. - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
> Please cc Andrea on this. I'm going to clean up a few small things for a v2 pretty soon, I'll be sure to cc Andrea there. > > My proposed solution to the problem is to allow users to set a > > threshold at which THPs will be handed out. The idea here is that, when > > a user faults in a page in an area where they would usually be handed a > > THP, we pull 512 pages off the free list, as we would with a regular > > THP, but we only fault in single pages from that chunk, until the user > > has faulted in enough pages to pass the threshold we've set. Once they > > pass the threshold, we do the necessary work to turn our 512 page chunk > > into a proper THP. As it stands now, if the user tries to fault in > > pages from different nodes, we completely give up on ever turning a > > particular chunk into a THP, and just fault in the 4K pages as they're > > requested. We may want to make this tunable in the future (i.e. allow > > them to fault in from only 2 different nodes). > > OK. But all 512 pages reside on the same node, yes? Whereas with thp > disabled those 512 pages would have resided closer to the CPUs which > instantiated them. As it stands right now, yes, since we're pulling a 512 page contiguous chunk off the free list, everything from that chunk will reside on the same node, but as I (stupidly) forgot to mention in my original e-mail, one piece I have yet to add is the functionality to put the remaining unfaulted pages from our chunk *back* on the free list after we give up on handing out a THP. Once this is in there, things will behave more like they do when THP is turned completely off, i.e. pages will get faulted in closer to the CPU that first referenced them once we give up on handing out the THP. > So the expected result will be somewhere in between > the 93 secs and the 76 secs? Yes. Due to the time it takes to search for the temporary THP, I'm sure we won't get down to 76 secs, but hopefully we'll get close. I'm also considering switching the linked list that stores the temporary THPs over to an rbtree to make that search faster, just fyi. > That being said, I don't see a downside to the idea, apart from some > additional setup cost in kernel code. Good to hear. I still need to address some of the issues that others have raised, and finish up the few pieces that aren't fully working/finished. I'll get things polished up and get some more informative test results out soon. Thanks for looking at the patch! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 3/3] Change THP behavior
> Hm. I think this part is not correct: you collapse temp thp page > into real one only for current procees. What will happen if a process with > temp thp pages was forked? That's a scenario that I hadn't yet addressed, but definitely something I'll consider going forward. I think we can come up with a way to appropriately handle that situation. > And I don't think this problem is an easy one. khugepaged can't collapse > pages with page->_count != 1 for the same reason: to make it properly you > need to take mmap_sem for all processes and collapse all pages at once. > And if a page is pinned, we also can't collapse. Again, a few things here that I hadn't taken into account. I'll look for a way to address these concerns. > Sorry, I don't think the whole idea has much potential. :( I understand that there are some issues with this initial pass at the idea, but I think we can get things corrected and come up with a workable solution here. When it comes down to it, there are relatively few options to correct the issue that I'm focusing on here, and this one seems to be the most appropriate one that we've tried so far. We've looked at disabling THP on a per-cpuset/per-process basis, and that has been met with fairly strong resistance, for good reason. I'll take another pass at this and hopefully be able to address some of your concerns with the idea. Thanks for taking the time to look the patch over! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCHv2 1/2] Add mm flag to control THP
ddressed. From what I know, get user pages calls down to handle_mm_fault, which should prevent THPs from being handed out where necessary. If anybody can confirm that, it would be appreciated. * Current behavior is to have fork()/exec()'d processes inherit the flag. Andrew Morton pointed out some possible issues with this, so we may need to rethink this behavior. - If parent process has THP disabled, and forks off a child, the child will also have THP disabled. This may not be the desired behavior. Signed-off-by: Alex Thorlton Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrew Morton Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: Kees Cook Cc: Andrea Arcangeli Cc: linux-kernel@vger.kernel.org --- include/linux/huge_mm.h| 6 -- include/linux/sched.h | 6 +- include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 11 +++ 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 91672e2..475f59f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -1,6 +1,8 @@ #ifndef _LINUX_HUGE_MM_H #define _LINUX_HUGE_MM_H +#include + extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, @@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma); (1<vm_flags & VM_HUGEPAGE))) && \ !((__vma)->vm_flags & VM_NOHUGEPAGE) &&\ -!is_vma_temporary_stack(__vma)) +!is_vma_temporary_stack(__vma) && \ +!test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags)) #define transparent_hugepage_defrag(__vma) \ ((transparent_hugepage_flags & \ (1<no_new_privs ? 1 : 0; + case PR_SET_THP_DISABLE: + if (arg2) + set_bit(MMF_THP_DISABLE, &me->mm->flags); + else + clear_bit(MMF_THP_DISABLE, &me->mm->flags); + break; + case PR_GET_THP_DISABLE: + error = put_user(test_bit(MMF_THP_DISABLE, +&me->mm->flags), +(int __user *) arg2); + break; default: error = -EINVAL; break; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCHv2 2/2] Change khugepaged to respect MMF_THP_DISABLE flag
This just adds a simple check to get khugepaged to behave appropriately when MMF_THP_DISABLE is set. Signed-off-by: Alex Thorlton Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrew Morton Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: Kees Cook Cc: Andrea Arcangeli Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9c0b172..3cfe6b4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2049,7 +2049,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm, static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + return atomic_read(&mm->mm_users) == 0 || + (mm->flags & MMF_THP_DISABLE_MASK); } int __khugepaged_enter(struct mm_struct *mm) -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/