Re: [RFC PATCH 8/8] powerpc/64s/radix: Only flush local TLB for spurious fault flushes

2017-09-08 Thread Nicholas Piggin
On Fri, 8 Sep 2017 14:44:37 +1000
Nicholas Piggin  wrote:

> On Fri, 08 Sep 2017 08:05:38 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > On Fri, 2017-09-08 at 00:51 +1000, Nicholas Piggin wrote:  
> > > When permissiveness is relaxed, or found to have been relaxed by
> > > another thread, we flush that address out of the TLB to avoid a
> > > future fault or micro-fault due to a stale TLB entry.
> > > 
> > > Currently for processes with TLBs on other CPUs, this flush is always
> > > done with a global tlbie. Although that could reduce faults on remote
> > > CPUs, a broadcast operation seems to be wasteful for something that
> > > can be handled in-core by the remote CPU if it comes to it.
> > > 
> > > This is not benchmarked yet. It does seem cut some tlbie operations
> > > from the bus.
> > 
> > What happens with the nest MMU here ?  
> 
> Good question, I'm not sure. I can't tell from the UM or not if the
> agent and NMMU must discard cached translations if there is a
> translation cached but it has a permission fault. It's not clear 
> from that I've read that if it's relying on the host to send back a
> tlbie.
> 
> I'll keep digging.

Okay, talking to Alistair and for NPU/CAPI, this is a concern because
the NMMU may not walk page tables if it has a cached translation. Have
to confirm that with hardware.

So we've got them wanting to hook into the core code and make it give
them tibies.

For now I'll put this on hold until NPU and CAPI and VAS get their
patches in and working, then maybe revisit. Might be good to get them
all moved to using a common nmmu.c driver that gives them a page fault
handler and uses mmu notifiers to do their nmmu invalidations.

Thanks,
Nick


Re: [PATCH] powerpc/mm: Fix missing mmap_sem release

2017-09-08 Thread Laurent Dufour
On 07/09/2017 22:51, Davidlohr Bueso wrote:
> On Thu, 07 Sep 2017, Laurent Dufour wrote:
> 
>> The commit b5c8f0fd595d ("powerpc/mm: Rework mm_fault_error()") reviewed
>> the way the error path is managed in __do_page_fault() but it was a bit too
>> agressive when handling a case by returning without releasing the mmap_sem.
>>
>> By the way, replacing current->mm->mmap_sem by mm->mmap_sem as mm is set to
>> current->mm.
>>
>> Fixes: b5c8f0fd595d ("powerpc/mm: Rework mm_fault_error()")
>> Cc: Benjamin Herrenschmidt 
>> Signed-off-by: Laurent Dufour 
>> ---
>> arch/powerpc/mm/fault.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 4797d08581ce..f799ccf37d27 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
> 
> But... here:
> 
> /*
>  * If we need to retry the mmap_sem has already been released,
>  * and if there is a fatal signal pending there is no guarantee
>  * that we made any progress. Handle this case first.
>  */
> 
>> @@ -521,10 +521,11 @@ static int __do_page_fault(struct pt_regs *regs,
>> unsigned long address,
>>  * User mode? Just return to handle the fatal exception otherwise
>>  * return to bad_page_fault
>>  */
>> +up_read(&mm->mmap_sem);
>> return is_user ? 0 : SIGBUS;
>> }
> 
> Per the above comment, for that case handle_mm_fault()
> has already released mmap_sem. The same occurs in x86,
> for example.

Oops, my bad.

Please forget about this stupid patch...


Re: [PATCH v3 2/2] cxl: Enable global TLBIs for cxl contexts

2017-09-08 Thread Frederic Barrat



Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :

On Sun,  3 Sep 2017 20:15:13 +0200
Frederic Barrat  wrote:


The PSL and nMMU need to see all TLB invalidations for the memory
contexts used on the adapter. For the hash memory model, it is done by
making all TLBIs global as soon as the cxl driver is in use. For
radix, we need something similar, but we can refine and only convert
to global the invalidations for contexts actually used by the device.

The new mm_context_add_copro() API increments the 'active_cpus' count
for the contexts attached to the cxl adapter. As soon as there's more
than 1 active cpu, the TLBIs for the context become global. Active cpu
count must be decremented when detaching to restore locality if
possible and to avoid overflowing the counter.

The hash memory model support is somewhat limited, as we can't
decrement the active cpus count when mm_context_remove_copro() is
called, because we can't flush the TLB for a mm on hash. So TLBIs
remain global on hash.


Sorry I didn't look at this earlier and just wading in here a bit, but
what do you think of using mmu notifiers for invalidating nMMU and
coprocessor caches, rather than put the details into the host MMU
management? npu-dma.c already looks to have almost everything covered
with its notifiers (in that it wouldn't have to rely on tlbie coming
from host MMU code).


Does npu-dma.c really do mmio nMMU invalidations? My understanding was 
that those atsd_launch operations are really targeted at the device 
behind the NPU, i.e. the nvidia card.
At some point, it was not possible to do mmio invalidations on the nMMU. 
At least on dd1. I'm checking with the nMMU team the status on dd2.


Alistair: is your code really doing a nMMU invalidation? Considering 
you're trying to also reuse the mm_context_add_copro() from this patch, 
I think I know the answer.


There are also other components relying on broadcasted invalidations 
from hardware: the PSL (for capi FPGA) and the XSL on the Mellanox CX5 
card, when in capi mode. They rely on hardware TLBIs, snooped and 
forwarded to them by the CAPP.
For the PSL, we do have a mmio interface to do targeted invalidations, 
but it was removed from the capi architecture (and left as a debug 
feature for our PSL implementation), because the nMMU would be out of 
sync with the PSL (due to the lack of interface to sync the nMMU, as 
mentioned above).
For the XSL on the Mellanox CX5, it's even more complicated. AFAIK, they 
do have a way to trigger invalidations through software, though the 
interface is private and Mellanox would have to be involved. They've 
also stated the performance is much worse through software invalidation.


Another consideration is performance. Which is best? Short of having 
real numbers, it's probably hard to know for sure.


So the road of getting rid of hardware invalidations for external 
components, if at all possible or even desirable, may be long.


  Fred




This change is not too bad today, but if we get to more complicated
MMU/nMMU TLB management like directed invalidation of particular units,
then putting more knowledge into the host code will end up being
complex I think.

I also want to also do optimizations on the core code that assumes we
only have to take care of other CPUs, e.g.,

https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_811068_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=647QnUvvBMO2f-DWP2xkeFceXDYSjpgHeTL3m_I9fiA&m=VaerDVXunKigctgE7NLm8VjaTR90W1m08iMcohAAnPo&s=y25SSoLEB8zDwXOLaTb8FFSpX_qSKiIG3Z5Cf1m7xnw&e=

Or, another example, directed IPI invalidations from the mm_cpumask
bitmap.

I realize you want to get something merged! For the merge window and
backports this seems fine. I think it would be nice soon afterwards to
get nMMU knowledge out of the core code... Though I also realize with
our tlbie instruction that does everything then it may be tricky to
make a really optimal notifier.

Thanks,
Nick



Signed-off-by: Frederic Barrat 
Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
---
Changelog:
v3: don't decrement active cpus count with hash, as we don't know how to flush
v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
and PWCs (thanks to Ben)






Re: [PATCH] ASoC: fsl_ssi: Override bit clock rate based on slot number

2017-09-08 Thread Łukasz Majewski

Hi Nicolin,


On Thu, Sep 07, 2017 at 10:23:43PM -0700, Nicolin Chen wrote:

The set_sysclk() now is used to override the output bit clock rate.
But this is not a common way to implement a set_dai_sysclk(). And
this creates a problem when a general machine driver (simple-card
for example) tries to do set_dai_sysclk() by passing an input clock
rate for the baud clock instead of setting the bit clock rate as
fsl_ssi driver expected.

So this patch solves this problem by firstly removing set_sysclk()
since the hw_params() can calculate the bit clock rate. Secondly,
in order not to break those TDM use cases which previously might
have been using set_sysclk() to override the bit clock rate, this
patch changes the driver to override it based on the slot number.

The patch also removes an obsolete comment of the dir parameter.

Signed-off-by: Nicolin Chen 


Forgot to mention, I think that it's better to wait for a couple of
Tested-by from those who use the TDM mode of SSI before applying it.


Although, I'm not the PCM user, I've tested your patch and it works :-)

Tested-by: Łukasz Majewski 

Test HW: imx6q + tfa9879 codec



Thanks
Nicolin




--
Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: w...@denx.de


Unable to handle kernel paging request for unaligned access

2017-09-08 Thread Michal Sojka
Hi all,

commit 31bfdb036f12 ("powerpc: Use instruction emulation infrastructure
to handle alignment faults", 2017-08-30) breaks my MPC5200B system. Boot
log is below. Let me know if you need more information to debug the
problem.

Best regards,
-Michal Sojka


Linux version 4.13.0-rc2+ (wsh@steelpick) (gcc version 4.7.2 
(OSELAS.Toolchain-2012.12.1)) #33 Fri Sep 8 09:36:26 CEST 2017
Found initrd at 0xc7c38000:0xc7e5f499
Using mpc5200-simple-platform machine description
-
Hash_size = 0x0
phys_mem_size = 0x800
dcache_bsize  = 0x20
icache_bsize  = 0x20
cpu_features  = 0x00020460
  possible= 0x05a6fd77
  always  = 0x0002
cpu_user_features = 0x8c00 0x
mmu_features  = 0x0001
-
PCI host bridge /pci@fd00 (primary) ranges:
 MEM 0x8000..0x9fff -> 0x8000 Prefetch
 MEM 0xa000..0xafff -> 0xa000 
  IO 0xb000..0xb0ff -> 0x
Zone ranges:
  DMA  [mem 0x-0x07ff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x07ff]
Initmem setup node 0 [mem 0x-0x07ff]
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32512
Kernel command line: console=ttyPSC0,115200
PID hash table entries: 512 (order: -1, 2048 bytes)
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
Memory: 125136K/131072K available (1924K kernel code, 88K rwdata, 188K rodata, 
108K init, 187K bss, 5936K reserved, 0K cma-reserved)
Kernel virtual memory layout:
  * 0xfffdf000..0xf000  : fixmap
  * 0xfcffb000..0xfe00  : early ioremap
  * 0xc900..0xfcffb000  : vmalloc & ioremap
SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
MPC52xx PIC is up and running!
clocksource: timebase: mask: 0x max_cycles: 0x79c5e18f3, 
max_idle_ns: 440795202740 ns
clocksource: timebase mult[1e4d9365] shift[24] registered
console [ttyPSC0] enabled
pid_max: default: 4096 minimum: 301
Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 
764504178510 ns
random: get_random_u32 called from 0xc00d06f4 with crng_init=0
NET: Registered protocol family 16
mpc52xx_irqhost_map: Critical IRQ #3 is unsupported! Nopping it.
PCI: Probing PCI hardware
PCI host bridge to bus 0100:00
pci_bus 0100:00: root bus resource [io  0x-0xff]
pci_bus 0100:00: root bus resource [mem 0x8000-0x9fff pref]
pci_bus 0100:00: root bus resource [mem 0xa000-0xafff]
pci_bus 0100:00: root bus resource [bus 00-ff]
DMA: MPC52xx BestComm driver
Unable to handle kernel paging request for unaligned access at address 
0xc9018020
Faulting instruction address: 0xc0013a38
Oops: Kernel access of bad area, sig: 7 [#1]
BE mpc5200-simple-platform
CPU: 0 PID: 1 Comm: swapper Not tainted 4.13.0-rc2+ #33
task: c781c000 task.stack: c782
NIP:  c0013a38 LR: c00f5d70 CTR: 000f
REGS: c7821ce0 TRAP: 0600   Not tainted  (4.13.0-rc2+)
MSR:  9032   CR: 4888  XER: 205f
DAR: c9018020 DSISR: 00017c07 
GPR00: 0007 c7821d90 c781c000 c9018000  0200 c901801c 0004 
GPR08: 0200 000f 0700 c9018000 2444  c00043c8  
GPR16:        001c 
GPR24: c024   c0230f10 c01f2808  c024 c78035a0 
Call Trace:
[c7821d90] [c00f5d2c] 0xc00f5d2c (unreliable)
[c7821de0] [c0d8] 0xc0d8
[c7821df0] [c010ff48] 0xc010ff48
[c7821e20] [c0110074] 0xc0110074
[c7821e40] [c010e4a4] 0xc010e4a4
[c7821e70] [c010ecd8] 0xc010ecd8
[c7821e90] [c01107fc] 0xc01107fc
[c7821ea0] [c02119b0] 0xc02119b0
[c7821f00] [c0211b70] 0xc0211b70
[c7821f30] [c00043e0] 0xc00043e0
[c7821f40] [c000e2e4] 0xc000e2e4
--- interrupt: 0 at   (null)
LR =   (null)
Instruction dump:
7d072a14 5509d97e 3529 40810038 68e0001c 5400f0bf 41820010 7c0903a6 
94860004 4200fffc 7d2903a6 38e4 <7c0737ec> 38c60020 4200fff8 550506fe 
---[ end trace 1e206a9c64fbd101 ]---

Kernel panic - not syncing: Attempted to kill init! exitcode=0x0007

Rebooting in 180 seconds..


Re: [PATCH v2 00/20] Speculative page faults

2017-09-08 Thread Laurent Dufour
On 21/08/2017 04:26, Sergey Senozhatsky wrote:
> Hello,
> 
> On (08/18/17 00:04), Laurent Dufour wrote:
>> This is a port on kernel 4.13 of the work done by Peter Zijlstra to
>> handle page fault without holding the mm semaphore [1].
>>
>> The idea is to try to handle user space page faults without holding the
>> mmap_sem. This should allow better concurrency for massively threaded
>> process since the page fault handler will not wait for other threads memory
>> layout change to be done, assuming that this change is done in another part
>> of the process's memory space. This type page fault is named speculative
>> page fault. If the speculative page fault fails because of a concurrency is
>> detected or because underlying PMD or PTE tables are not yet allocating, it
>> is failing its processing and a classic page fault is then tried.
>>
>> The speculative page fault (SPF) has to look for the VMA matching the fault
>> address without holding the mmap_sem, so the VMA list is now managed using
>> SRCU allowing lockless walking. The only impact would be the deferred file
>> derefencing in the case of a file mapping, since the file pointer is
>> released once the SRCU cleaning is done.  This patch relies on the change
>> done recently by Paul McKenney in SRCU which now runs a callback per CPU
>> instead of per SRCU structure [1].
>>
>> The VMA's attributes checked during the speculative page fault processing
>> have to be protected against parallel changes. This is done by using a per
>> VMA sequence lock. This sequence lock allows the speculative page fault
>> handler to fast check for parallel changes in progress and to abort the
>> speculative page fault in that case.
>>
>> Once the VMA is found, the speculative page fault handler would check for
>> the VMA's attributes to verify that the page fault has to be handled
>> correctly or not. Thus the VMA is protected through a sequence lock which
>> allows fast detection of concurrent VMA changes. If such a change is
>> detected, the speculative page fault is aborted and a *classic* page fault
>> is tried.  VMA sequence locks are added when VMA attributes which are
>> checked during the page fault are modified.
>>
>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>> so once the page table is locked, the VMA is valid, so any other changes
>> leading to touching this PTE will need to lock the page table, so no
>> parallel change is possible at this time.
> 
> [ 2311.315400] ==
> [ 2311.315401] WARNING: possible circular locking dependency detected
> [ 2311.315403] 4.13.0-rc5-next-20170817-dbg-00039-gaf11d7500492-dirty #1743 
> Not tainted
> [ 2311.315404] --
> [ 2311.315406] khugepaged/43 is trying to acquire lock:
> [ 2311.315407]  (&mapping->i_mmap_rwsem){}, at: [] 
> rmap_walk_file+0x5a/0x147
> [ 2311.315415] 
>but task is already holding lock:
> [ 2311.315416]  (fs_reclaim){+.+.}, at: [] 
> fs_reclaim_acquire+0x12/0x35
> [ 2311.315420] 
>which lock already depends on the new lock.
> 
> [ 2311.315422] 
>the existing dependency chain (in reverse order) is:
> [ 2311.315423] 
>-> #3 (fs_reclaim){+.+.}:
> [ 2311.315427]fs_reclaim_acquire+0x32/0x35
> [ 2311.315429]__alloc_pages_nodemask+0x8d/0x217
> [ 2311.315432]pte_alloc_one+0x13/0x5e
> [ 2311.315434]__pte_alloc+0x1f/0x83
> [ 2311.315436]move_page_tables+0x2c9/0x5ac
> [ 2311.315438]move_vma.isra.25+0xff/0x2a2
> [ 2311.315439]SyS_mremap+0x41b/0x49e
> [ 2311.315442]entry_SYSCALL_64_fastpath+0x18/0xad
> [ 2311.315443] 
>-> #2 (&vma->vm_sequence/1){+.+.}:
> [ 2311.315449]write_seqcount_begin_nested+0x1b/0x1d
> [ 2311.315451]__vma_adjust+0x1b7/0x5d6
> [ 2311.315453]__split_vma+0x142/0x1a3
> [ 2311.315454]do_munmap+0x128/0x2af
> [ 2311.315455]vm_munmap+0x5a/0x73
> [ 2311.315458]elf_map+0xb1/0xce
> [ 2311.315459]load_elf_binary+0x8e0/0x1348
> [ 2311.315462]search_binary_handler+0x70/0x1f3
> [ 2311.315464]load_script+0x1a6/0x1b5
> [ 2311.315466]search_binary_handler+0x70/0x1f3
> [ 2311.315468]do_execveat_common+0x461/0x691
> [ 2311.315471]kernel_init+0x5a/0xf0
> [ 2311.315472]ret_from_fork+0x27/0x40
> [ 2311.315473] 
>-> #1 (&vma->vm_sequence){+.+.}:
> [ 2311.315478]write_seqcount_begin_nested+0x1b/0x1d
> [ 2311.315480]__vma_adjust+0x19c/0x5d6
> [ 2311.315481]__split_vma+0x142/0x1a3
> [ 2311.315482]do_munmap+0x128/0x2af
> [ 2311.315484]vm_munmap+0x5a/0x73
> [ 2311.315485]elf_map+0xb1/0xce
> [ 2311.315487]load_elf_binary+0x8e0/0x1348
> [ 2311.315489]search_binary_handler+0x70/0x1f3
> [ 2311.315490]load_script+0x1a6/0x1b5
> [ 2311.315492]search_binary_h

Re: Machine Check in P2010(e500v2)

2017-09-08 Thread Joakim Tjernlund
On Thu, 2017-09-07 at 18:54 +, Leo Li wrote:
> > -Original Message-
> > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > Sent: Thursday, September 07, 2017 3:41 AM
> > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; York Sun
> > 
> > Subject: Re: Machine Check in P2010(e500v2)
> > 
> > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > On Wed, 2017-09-06 at 21:13 +, Leo Li wrote:
> > > > > -Original Message-
> > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ;
> > > > > York Sun 
> > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > 
> > > > > On Wed, 2017-09-06 at 20:28 +, Leo Li wrote:
> > > > > > > -Original Message-
> > > > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > ; York Sun 
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > 
> > > > > > > On Wed, 2017-09-06 at 19:31 +, Leo Li wrote:
> > > > > > > > > -Original Message-
> > > > > > > > > From: York Sun
> > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > To: Joakim Tjernlund ;
> > > > > > > > > linuxppc- d...@lists.ozlabs.org; Leo Li
> > > > > > > > > 
> > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > 
> > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > > 
> > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct
> > > > > > > > > > pt_regs
> > > > > 
> > > > > *regs)
> > > > > > > > > >  if (is_in_pci_mem_space(addr)) {
> > > > > > > > > >  if (user_mode(regs)) {
> > > > > > > > > >  pagefault_disable();
> > > > > > > > > > -   ret = get_user(regs->nip, &inst);
> > > > > > > > > > +   ret = get_user(inst, (__u32
> > > > > > > > > > + __user *)regs->nip);
> > > > > > > > > >  pagefault_enable();
> > > > > > > > > >  } else {
> > > > > > > > > >  ret =
> > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > > 
> > > > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > > > Now I wonder why this fixup is there in the first place?
> > > > > > > > > > The routine will not really fixup the insn, just return
> > > > > > > > > > 0x for the failing read and then advance the 
> > > > > > > > > > process NIP.
> > > > > > > > 
> > > > > > > > You are right.  The code here only gives 0x to the
> > > > > > > > load instructions and
> > > > > > > 
> > > > > > > continue with the next instruction when the load instruction
> > > > > > > is causing the machine check.  This will prevent a system
> > > > > > > lockup when reading from PCI/RapidIO device which is link down.
> > > > > > > > 
> > > > > > > > I don't know what is actual problem in your case.  Maybe it
> > > > > > > > is a write
> > > > > > > 
> > > > > > > instruction instead of read?   Or the code is in a infinite loop 
> > > > > > > waiting for
> > 
> > a
> > > > > 
> > > > > valid
> > > > > > > read result?  Are you able to do some further debugging with
> > > > > > > the NIP correctly printed?
> > > > > > > > 
> > > > > > > 
> > > > > > > According to the MC it is a Read and the NIP also leads to a
> > > > > > > read in the
> > > > > 
> > > > > program.
> > > > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > > > Question, is it safe add a small printk when this MC
> > > > > > > happens(after fixing up)? I need to see that it has happened
> > > > > > > as the error is somewhat
> > > > > 
> > > > > random.
> > > > > > 
> > > > > > I think it is safe to add printk as the current machine check
> > > > > > handlers are also
> > > > > 
> > > > > using printk.
> > > > > 
> > > > > I hope so, but if the fixup fires there is no printk at all so I was 
> > > > > a bit unsure.
> > > > > Don't like this fixup though, is there not a better way than
> > > > > faking a read to user space(or kernel for that matter) ?
> > > > 
> > > > I don't have a better idea.  Without the fixup, the offending load 
> > > > instruction
> > 
> > will never finish if there is anything wrong with the backing device and 
> > freeze the
> > whole system.  Do you have any suggestion in mind?
> > > > 
> > > 
> > > But it never finishes the load, it just fakes a load of 0xf,
> > > for user space I rather have it signal a SIGBUS but that does not seem
> > > to work either, at least not for us but that could be a bug in general MC 
> > > code
> > 
> > maybe.
> > > This fixup migh

Re: [PATCH v2 2/2] powerpc/powernv/npu: Don't explicitly flush nmmu tlb

2017-09-08 Thread kbuild test robot
Hi Alistair,

[auto build test ERROR on powerpc/next]
[also build test ERROR on next-20170907]
[cannot apply to v4.13]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Alistair-Popple/powerpc-npu-Use-flush_all_mm-instead-of-flush_tlb_mm/20170908-080828
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allmodconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/powernv/npu-dma.c: In function 'mmio_invalidate':
   arch/powerpc/platforms/powernv/npu-dma.c:555:3: error: implicit declaration 
of function 'flush_all_mm' [-Werror=implicit-function-declaration]
  flush_all_mm(npu_context->mm);
  ^~~~
   arch/powerpc/platforms/powernv/npu-dma.c: In function 
'pnv_npu2_init_context':
>> arch/powerpc/platforms/powernv/npu-dma.c:744:3: error: implicit declaration 
>> of function 'mm_context_add_copro' [-Werror=implicit-function-declaration]
  mm_context_add_copro(mm);
  ^~~~
   arch/powerpc/platforms/powernv/npu-dma.c: In function 
'pnv_npu2_release_context':
>> arch/powerpc/platforms/powernv/npu-dma.c:758:3: error: implicit declaration 
>> of function 'mm_context_remove_copro' [-Werror=implicit-function-declaration]
  mm_context_remove_copro(npu_context->mm);
  ^~~
   cc1: some warnings being treated as errors

vim +/mm_context_add_copro +744 arch/powerpc/platforms/powernv/npu-dma.c

   534  
   535  /*
   536   * Invalidate either a single address or an entire PID depending on
   537   * the value of va.
   538   */
   539  static void mmio_invalidate(struct npu_context *npu_context, int va,
   540  unsigned long address, bool flush)
   541  {
   542  int i, j;
   543  struct npu *npu;
   544  struct pnv_phb *nphb;
   545  struct pci_dev *npdev;
   546  struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
   547  unsigned long pid = npu_context->mm->context.id;
   548  
   549  if (npu_context->nmmu_flush)
   550  /*
   551   * Unfortunately the nest mmu does not support flushing 
specific
   552   * addresses so we have to flush the whole mm once 
before
   553   * shooting down the GPU translation.
   554   */
 > 555  flush_all_mm(npu_context->mm);
   556  
   557  /*
   558   * Loop over all the NPUs this process is active on and launch
   559   * an invalidate.
   560   */
   561  for (i = 0; i <= max_npu2_index; i++) {
   562  mmio_atsd_reg[i].reg = -1;
   563  for (j = 0; j < NV_MAX_LINKS; j++) {
   564  npdev = npu_context->npdev[i][j];
   565  if (!npdev)
   566  continue;
   567  
   568  nphb = 
pci_bus_to_host(npdev->bus)->private_data;
   569  npu = &nphb->npu;
   570  mmio_atsd_reg[i].npu = npu;
   571  
   572  if (va)
   573  mmio_atsd_reg[i].reg =
   574  mmio_invalidate_va(npu, 
address, pid,
   575  flush);
   576  else
   577  mmio_atsd_reg[i].reg =
   578  mmio_invalidate_pid(npu, pid, 
flush);
   579  
   580  /*
   581   * The NPU hardware forwards the shootdown to 
all GPUs
   582   * so we only have to launch one shootdown per 
NPU.
   583   */
   584  break;
   585  }
   586  }
   587  
   588  mmio_invalidate_wait(mmio_atsd_reg, flush);
   589  if (flush)
   590  /* Wait for the flush to complete */
   591  mmio_invalidate_wait(mmio_atsd_reg, false);
   592  }
   593  
   594  static void pnv_npu2_mn_release(struct mmu_notifier *mn,
   595  struct mm_struct *mm)
   596  {
   597  struct npu_context *npu_context = mn_to_npu_context(mn);
   598  
   599  /* Call into device driver to stop requests to the NMMU */
   600  if (npu_con

Re: [PATCH] ASoC: fsl_ssi: Override bit clock rate based on slot number

2017-09-08 Thread Arnaud Mouiche


On 08/09/2017 07:42, Nicolin Chen wrote:

On Thu, Sep 07, 2017 at 10:23:43PM -0700, Nicolin Chen wrote:

The set_sysclk() now is used to override the output bit clock rate.
But this is not a common way to implement a set_dai_sysclk(). And
this creates a problem when a general machine driver (simple-card
for example) tries to do set_dai_sysclk() by passing an input clock
rate for the baud clock instead of setting the bit clock rate as
fsl_ssi driver expected.

So this patch solves this problem by firstly removing set_sysclk()
since the hw_params() can calculate the bit clock rate. Secondly,
in order not to break those TDM use cases which previously might
have been using set_sysclk() to override the bit clock rate, this
patch changes the driver to override it based on the slot number.

The patch also removes an obsolete comment of the dir parameter.

Signed-off-by: Nicolin Chen 

Forgot to mention, I think that it's better to wait for a couple of
Tested-by from those who use the TDM mode of SSI before applying it.

I can check next monday or tuesday.
Arnaud

Thanks
Nicolin




Re: [PATCH v3 2/2] cxl: Enable global TLBIs for cxl contexts

2017-09-08 Thread Nicholas Piggin
On Fri, 8 Sep 2017 09:34:39 +0200
Frederic Barrat  wrote:

> Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :
> > On Sun,  3 Sep 2017 20:15:13 +0200
> > Frederic Barrat  wrote:
> >   
> >> The PSL and nMMU need to see all TLB invalidations for the memory
> >> contexts used on the adapter. For the hash memory model, it is done by
> >> making all TLBIs global as soon as the cxl driver is in use. For
> >> radix, we need something similar, but we can refine and only convert
> >> to global the invalidations for contexts actually used by the device.
> >>
> >> The new mm_context_add_copro() API increments the 'active_cpus' count
> >> for the contexts attached to the cxl adapter. As soon as there's more
> >> than 1 active cpu, the TLBIs for the context become global. Active cpu
> >> count must be decremented when detaching to restore locality if
> >> possible and to avoid overflowing the counter.
> >>
> >> The hash memory model support is somewhat limited, as we can't
> >> decrement the active cpus count when mm_context_remove_copro() is
> >> called, because we can't flush the TLB for a mm on hash. So TLBIs
> >> remain global on hash.  
> > 
> > Sorry I didn't look at this earlier and just wading in here a bit, but
> > what do you think of using mmu notifiers for invalidating nMMU and
> > coprocessor caches, rather than put the details into the host MMU
> > management? npu-dma.c already looks to have almost everything covered
> > with its notifiers (in that it wouldn't have to rely on tlbie coming
> > from host MMU code).  
> 
> Does npu-dma.c really do mmio nMMU invalidations?

No, but it does do a flush_tlb_mm there to do a tlbie (probably
buggy in some cases and does tlbiel without this patch of yours).
But the point is when you control the flushing you don't have to
mess with making the core flush code give you tlbies.

Just add a flush_nmmu_mm or something that does what you need.

If you can make a more targeted nMMU invalidate, then that's
even better.

One downside at first I thought is that the core code might already
do a broadcast tlbie, then the mmu notifier does not easily know
about that so it will do a second one which will be suboptimal.

Possibly we could add some flag or state so the nmmu flush can
avoid the second one.

But now that I look again, the NPU code has this comment:

/*
 * Unfortunately the nest mmu does not support flushing specific
 * addresses so we have to flush the whole mm.
 */

Which seems to indicate that you can't rely on core code to give
you full flushes because for range flushing it is possible that the
core code will do it with address flushes. Or am I missing something?

So it seems you really do need to always issue a full PID tlbie from
a notifier.

> My understanding was 
> that those atsd_launch operations are really targeted at the device 
> behind the NPU, i.e. the nvidia card.
> At some point, it was not possible to do mmio invalidations on the nMMU. 
> At least on dd1. I'm checking with the nMMU team the status on dd2.
> 
> Alistair: is your code really doing a nMMU invalidation? Considering 
> you're trying to also reuse the mm_context_add_copro() from this patch, 
> I think I know the answer.
> 
> There are also other components relying on broadcasted invalidations 
> from hardware: the PSL (for capi FPGA) and the XSL on the Mellanox CX5 
> card, when in capi mode. They rely on hardware TLBIs, snooped and 
> forwarded to them by the CAPP.
> For the PSL, we do have a mmio interface to do targeted invalidations, 
> but it was removed from the capi architecture (and left as a debug 
> feature for our PSL implementation), because the nMMU would be out of 
> sync with the PSL (due to the lack of interface to sync the nMMU, as 
> mentioned above).
> For the XSL on the Mellanox CX5, it's even more complicated. AFAIK, they 
> do have a way to trigger invalidations through software, though the 
> interface is private and Mellanox would have to be involved. They've 
> also stated the performance is much worse through software invalidation.

Okay, point is I think the nMMU and agent drivers will be in a better
position to handle all that. I don't see that flushing from your notifier
means that you can't issue a tlbie to do it.

> 
> Another consideration is performance. Which is best? Short of having 
> real numbers, it's probably hard to know for sure.

Let's come to that if we agree on a way to go. I *think* we can make it
at least no worse than we have today, using tlbie and possibly some small
changes to generic code callers.

Thanks,
Nick


Re: [PATCH v3 2/2] cxl: Enable global TLBIs for cxl contexts

2017-09-08 Thread Nicholas Piggin
On Fri, 8 Sep 2017 20:54:02 +1000
Nicholas Piggin  wrote:

> On Fri, 8 Sep 2017 09:34:39 +0200
> Frederic Barrat  wrote:
> 
> > Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :  
> > > On Sun,  3 Sep 2017 20:15:13 +0200
> > > Frederic Barrat  wrote:
> > > 
> > >> The PSL and nMMU need to see all TLB invalidations for the memory
> > >> contexts used on the adapter. For the hash memory model, it is done by
> > >> making all TLBIs global as soon as the cxl driver is in use. For
> > >> radix, we need something similar, but we can refine and only convert
> > >> to global the invalidations for contexts actually used by the device.
> > >>
> > >> The new mm_context_add_copro() API increments the 'active_cpus' count
> > >> for the contexts attached to the cxl adapter. As soon as there's more
> > >> than 1 active cpu, the TLBIs for the context become global. Active cpu
> > >> count must be decremented when detaching to restore locality if
> > >> possible and to avoid overflowing the counter.
> > >>
> > >> The hash memory model support is somewhat limited, as we can't
> > >> decrement the active cpus count when mm_context_remove_copro() is
> > >> called, because we can't flush the TLB for a mm on hash. So TLBIs
> > >> remain global on hash.
> > > 
> > > Sorry I didn't look at this earlier and just wading in here a bit, but
> > > what do you think of using mmu notifiers for invalidating nMMU and
> > > coprocessor caches, rather than put the details into the host MMU
> > > management? npu-dma.c already looks to have almost everything covered
> > > with its notifiers (in that it wouldn't have to rely on tlbie coming
> > > from host MMU code).
> > 
> > Does npu-dma.c really do mmio nMMU invalidations?  
> 
> No, but it does do a flush_tlb_mm there to do a tlbie (probably
> buggy in some cases and does tlbiel without this patch of yours).
> But the point is when you control the flushing you don't have to
> mess with making the core flush code give you tlbies.
> 
> Just add a flush_nmmu_mm or something that does what you need.
> 
> If you can make a more targeted nMMU invalidate, then that's
> even better.
> 
> One downside at first I thought is that the core code might already
> do a broadcast tlbie, then the mmu notifier does not easily know
> about that so it will do a second one which will be suboptimal.
> 
> Possibly we could add some flag or state so the nmmu flush can
> avoid the second one.
> 
> But now that I look again, the NPU code has this comment:
> 
> /*
>  * Unfortunately the nest mmu does not support flushing specific
>  * addresses so we have to flush the whole mm.
>  */
> 
> Which seems to indicate that you can't rely on core code to give
> you full flushes because for range flushing it is possible that the
> core code will do it with address flushes. Or am I missing something?
> 
> So it seems you really do need to always issue a full PID tlbie from
> a notifier.

Oh I see, actually it's fixed in newer firmware and there's a patch
out for it.

Okay, so the nMMU can take address tlbie, in that case it's not a
correctness issue (except for old firmware that still has the bug).


[PATCH V3] cxl: Add support for POWER9 DD2

2017-09-08 Thread Christophe Lombard
The PSL initialization sequence has been updated to DD2.
This patch adapts to the changes, retaining compatibility with DD1.
The patch includes some changes to DD1 fix-ups as well.

Tests performed on some of the old/new hardware.

The function is_page_fault(), for POWER9, lists the Translation Checkout
Responses where the page fault will be handled by copro_handle_mm_fault().
This list is too restrictive and not necessary.

This patches removes this restriction and all page faults, whatever the
reason, will be handled. In this case, the interruption is always
acknowledged.

The following features will be added soon:
- phb reset when switching to capi mode.
- cxllib update to support new functions.

Signed-off-by: Christophe Lombard 

Acked-by: Frederic Barrat 

---
Changelog[v3]
 - Update commit message

Changelog[v2]
 - Rebase to latest upstream.
 - Update the function is_page_fault()
---
 drivers/misc/cxl/cxl.h   |  2 ++
 drivers/misc/cxl/fault.c | 15 ++-
 drivers/misc/cxl/pci.c   | 47 ---
 3 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index b1afecc..0167df8 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -100,6 +100,8 @@ static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
 /* PSL registers - CAIA 2 */
 static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
+static const cxl_p1_reg_t CXL_XSL9_INV  = {0x0110};
+static const cxl_p1_reg_t CXL_XSL9_DEF  = {0x0140};
 static const cxl_p1_reg_t CXL_XSL9_DSNCTL   = {0x0168};
 static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300};
 static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308};
diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 6eed7d0..0cf7f4a 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -204,22 +204,11 @@ static bool cxl_is_segment_miss(struct cxl_context *ctx, 
u64 dsisr)
 
 static bool cxl_is_page_fault(struct cxl_context *ctx, u64 dsisr)
 {
-   u64 crs; /* Translation Checkout Response Status */
-
if ((cxl_is_power8()) && (dsisr & CXL_PSL_DSISR_An_DM))
return true;
 
-   if (cxl_is_power9()) {
-   crs = (dsisr & CXL_PSL9_DSISR_An_CO_MASK);
-   if ((crs == CXL_PSL9_DSISR_An_PF_SLR) ||
-   (crs == CXL_PSL9_DSISR_An_PF_RGC) ||
-   (crs == CXL_PSL9_DSISR_An_PF_RGP) ||
-   (crs == CXL_PSL9_DSISR_An_PF_HRH) ||
-   (crs == CXL_PSL9_DSISR_An_PF_STEG) ||
-   (crs == CXL_PSL9_DSISR_An_URTCH)) {
-   return true;
-   }
-   }
+   if (cxl_is_power9())
+   return true;
 
return false;
 }
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index d18b3d9..a5baab3 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -401,7 +401,8 @@ int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
*capp_unit_id = get_capp_unit_id(np, *phb_index);
of_node_put(np);
if (!*capp_unit_id) {
-   pr_err("cxl: invalid capp unit id\n");
+   pr_err("cxl: invalid capp unit id (phb_index: %d)\n",
+  *phb_index);
return -ENODEV;
}
 
@@ -475,37 +476,37 @@ static int init_implementation_adapter_regs_psl9(struct 
cxl *adapter,
psl_fircntl |= 0x1ULL; /* ce_thresh */
cxl_p1_write(adapter, CXL_PSL9_FIR_CNTL, psl_fircntl);
 
-   /* vccredits=0x1  pcklat=0x4 */
-   cxl_p1_write(adapter, CXL_PSL9_DSNDCTL, 0x1810ULL);
-
-   /*
-* For debugging with trace arrays.
-* Configure RX trace 0 segmented mode.
-* Configure CT trace 0 segmented mode.
-* Configure LA0 trace 0 segmented mode.
-* Configure LA1 trace 0 segmented mode.
+   /* Setup the PSL to transmit packets on the PCIe before the
+* CAPP is enabled
 */
-   cxl_p1_write(adapter, CXL_PSL9_TRACECFG, 0x804080008000ULL);
-   cxl_p1_write(adapter, CXL_PSL9_TRACECFG, 0x804080008003ULL);
-   cxl_p1_write(adapter, CXL_PSL9_TRACECFG, 0x804080008005ULL);
-   cxl_p1_write(adapter, CXL_PSL9_TRACECFG, 0x804080008006ULL);
+   cxl_p1_write(adapter, CXL_PSL9_DSNDCTL, 0x000100102A10ULL);
 
/*
 * A response to an ASB_Notify request is returned by the
 * system as an MMIO write to the address defined in
-* the PSL_TNR_ADDR register
+* the PSL_TNR_ADDR register.
+* keep the Reset Value: 0x0002E000
 */
-   /* PSL_TNR_ADDR */
 
-   /* NORST */
-   cxl_p1_write(adapter, CXL_PSL9_DEBUG, 0x8000ULL);
+   /* Enable XSL rty limit */
+   cxl_p1_write(adapter, CXL_XSL9_DEF, 0x51F80005ULL);
+
+   /* Change XSL_INV dummy read threshold */
+   cxl_p1_write(adapter, 

Re: Machine Check in P2010(e500v2)

2017-09-08 Thread Joakim Tjernlund
On Fri, 2017-09-08 at 11:54 +0200, Joakim Tjernlund wrote:
> On Thu, 2017-09-07 at 18:54 +, Leo Li wrote:
> > > -Original Message-
> > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > Sent: Thursday, September 07, 2017 3:41 AM
> > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; York Sun
> > > 
> > > Subject: Re: Machine Check in P2010(e500v2)
> > > 
> > > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > > On Wed, 2017-09-06 at 21:13 +, Leo Li wrote:
> > > > > > -Original Message-
> > > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ;
> > > > > > York Sun 
> > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > 
> > > > > > On Wed, 2017-09-06 at 20:28 +, Leo Li wrote:
> > > > > > > > -Original Message-
> > > > > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > > ; York Sun 
> > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > 
> > > > > > > > On Wed, 2017-09-06 at 19:31 +, Leo Li wrote:
> > > > > > > > > > -Original Message-
> > > > > > > > > > From: York Sun
> > > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > > To: Joakim Tjernlund ;
> > > > > > > > > > linuxppc- d...@lists.ozlabs.org; Leo Li
> > > > > > > > > > 
> > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > 
> > > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > > > 
> > > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct
> > > > > > > > > > > pt_regs
> > > > > > 
> > > > > > *regs)
> > > > > > > > > > >  if (is_in_pci_mem_space(addr)) {
> > > > > > > > > > >  if (user_mode(regs)) {
> > > > > > > > > > >  pagefault_disable();
> > > > > > > > > > > -   ret = get_user(regs->nip, &inst);
> > > > > > > > > > > +   ret = get_user(inst, (__u32
> > > > > > > > > > > + __user *)regs->nip);
> > > > > > > > > > >  pagefault_enable();
> > > > > > > > > > >  } else {
> > > > > > > > > > >  ret =
> > > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > > > 
> > > > > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > > > > Now I wonder why this fixup is there in the first place?
> > > > > > > > > > > The routine will not really fixup the insn, just return
> > > > > > > > > > > 0x for the failing read and then advance the 
> > > > > > > > > > > process NIP.
> > > > > > > > > 
> > > > > > > > > You are right.  The code here only gives 0x to the
> > > > > > > > > load instructions and
> > > > > > > > 
> > > > > > > > continue with the next instruction when the load instruction
> > > > > > > > is causing the machine check.  This will prevent a system
> > > > > > > > lockup when reading from PCI/RapidIO device which is link down.
> > > > > > > > > 
> > > > > > > > > I don't know what is actual problem in your case.  Maybe it
> > > > > > > > > is a write
> > > > > > > > 
> > > > > > > > instruction instead of read?   Or the code is in a infinite 
> > > > > > > > loop waiting for
> > > 
> > > a
> > > > > > 
> > > > > > valid
> > > > > > > > read result?  Are you able to do some further debugging with
> > > > > > > > the NIP correctly printed?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > According to the MC it is a Read and the NIP also leads to a
> > > > > > > > read in the
> > > > > > 
> > > > > > program.
> > > > > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > > > > Question, is it safe add a small printk when this MC
> > > > > > > > happens(after fixing up)? I need to see that it has happened
> > > > > > > > as the error is somewhat
> > > > > > 
> > > > > > random.
> > > > > > > 
> > > > > > > I think it is safe to add printk as the current machine check
> > > > > > > handlers are also
> > > > > > 
> > > > > > using printk.
> > > > > > 
> > > > > > I hope so, but if the fixup fires there is no printk at all so I 
> > > > > > was a bit unsure.
> > > > > > Don't like this fixup though, is there not a better way than
> > > > > > faking a read to user space(or kernel for that matter) ?
> > > > > 
> > > > > I don't have a better idea.  Without the fixup, the offending load 
> > > > > instruction
> > > 
> > > will never finish if there is anything wrong with the backing device and 
> > > freeze the
> > > whole system.  Do you have any suggestion in mind?
> > > > > 
> > > > 
> > > 

[PATCH v3 00/20] Speculative page faults

2017-09-08 Thread Laurent Dufour
This is a port on kernel 4.13 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type page fault is named speculative
page fault. If the speculative page fault fails because of a concurrency is
detected or because underlying PMD or PTE tables are not yet allocating, it
is failing its processing and a classic page fault is then tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, so the VMA list is now managed using
SRCU allowing lockless walking. The only impact would be the deferred file
derefencing in the case of a file mapping, since the file pointer is
released once the SRCU cleaning is done.  This patch relies on the change
done recently by Paul McKenney in SRCU which now runs a callback per CPU
instead of per SRCU structure [1].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA is found, the speculative page fault handler would check for
the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence locks are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

Compared to the Peter's initial work, this series introduces a spin_trylock
when dealing with speculative page fault. This is required to avoid dead
lock when handling a page fault while a TLB invalidate is requested by an
other CPU holding the PTE. Another change due to a lock dependency issue
with mapping->i_mmap_rwsem.

In addition some VMA field values which are used once the PTE is unlocked
at the end the page fault path are saved into the vm_fault structure to
used the values matching the VMA at the time the PTE was locked.

This series only support VMA with no vm_ops define, so huge page and mapped
file are not managed with the speculative path. In addition transparent
huge page are not supported. Once this series will be accepted upstream
I'll extend the support to mapped files, and transparent huge pages.

This series builds on top of v4.13.9-mm1 and is functional on x86 and
PowerPC.

Tests have been made using a large commercial in-memory database on a
PowerPC system with 752 CPU using RFC v5 using a previous version of this
series. The results are very encouraging since the loading of the 2TB
database was faster by 14% with the speculative page fault.

Using ebizzy test [3], which spreads a lot of threads, the result are good
when running on both a large or a small system. When using kernbench, the
result are quite similar which expected as not so much multithreaded
processes are involved. But there is no performance degradation neither
which is good.

--
Benchmarks results

Note these test have been made on top of 4.13.0-mm1.

Ebizzy:
---
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
result I repeated the test 100 times and measure the average result, mean
deviation, max and min.

- 16 CPUs x86 VM
Records/s   4.13.0-mm1  4.13.0-mm1-spf  delta
Average 13217.9065765.94+397.55%
Mean deviation  690.37  2609.36 +277.97%
Max 16726   77675   +364.40%
Min 12194   616340  +405.45%

- 80 CPUs Power 8 node:
Records/s   4.13.0-mm1  4.13.0-mm1-spf  delta
Average 38175.4067635.5577.17% 
Mean deviation  600.09  2349.66 291.55%
Max 39563   74292   87.78% 
Min 35846   62657   74.79% 

The number of record per second is far better with the speculative page
fault. 
The mean deviation is higher with the speculative page fault, may be
because sometime the fault are not handled in a speculative way leading to
more variation.
The numbers for the x86 guest are really i

[PATCH v3 01/20] mm: Dont assume page-table invariance during faults

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the reliance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) 
---
 mm/memory.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..30bccfa00630 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2270,30 +2270,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
-{
-   int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-   if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
-   }
-#endif
-   pte_unmap(page_table);
-   return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
 {
debug_dma_assert_idle(src);
@@ -2854,11 +2830,6 @@ int do_swap_page(struct vm_fault *vmf)
 
if (vma_readahead)
page = swap_readahead_detect(vmf, &swap_ra);
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
-   if (page)
-   put_page(page);
-   goto out;
-   }
 
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
-- 
2.7.4



Re: [PATCH v3 00/20] Speculative page faults

2017-09-08 Thread Laurent Dufour
On 08/09/2017 19:32, Laurent Dufour wrote:
> This is a port on kernel 4.13 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore [1].

Sorry for the noise, I got trouble sending the whole series through this
email. I will try again.

Cheers,
Laurent.



[PATCH v3 00/20] Speculative page faults

2017-09-08 Thread Laurent Dufour
This is a port on kernel 4.13 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type page fault is named speculative
page fault. If the speculative page fault fails because of a concurrency is
detected or because underlying PMD or PTE tables are not yet allocating, it
is failing its processing and a classic page fault is then tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, so the VMA list is now managed using
SRCU allowing lockless walking. The only impact would be the deferred file
derefencing in the case of a file mapping, since the file pointer is
released once the SRCU cleaning is done.  This patch relies on the change
done recently by Paul McKenney in SRCU which now runs a callback per CPU
instead of per SRCU structure [1].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA is found, the speculative page fault handler would check for
the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence locks are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

Compared to the Peter's initial work, this series introduces a spin_trylock
when dealing with speculative page fault. This is required to avoid dead
lock when handling a page fault while a TLB invalidate is requested by an
other CPU holding the PTE. Another change due to a lock dependency issue
with mapping->i_mmap_rwsem.

In addition some VMA field values which are used once the PTE is unlocked
at the end the page fault path are saved into the vm_fault structure to
used the values matching the VMA at the time the PTE was locked.

This series only support VMA with no vm_ops define, so huge page and mapped
file are not managed with the speculative path. In addition transparent
huge page are not supported. Once this series will be accepted upstream
I'll extend the support to mapped files, and transparent huge pages.

This series builds on top of v4.13.9-mm1 and is functional on x86 and
PowerPC.

Tests have been made using a large commercial in-memory database on a
PowerPC system with 752 CPU using RFC v5 using a previous version of this
series. The results are very encouraging since the loading of the 2TB
database was faster by 14% with the speculative page fault.

Using ebizzy test [3], which spreads a lot of threads, the result are good
when running on both a large or a small system. When using kernbench, the
result are quite similar which expected as not so much multithreaded
processes are involved. But there is no performance degradation neither
which is good.

--
Benchmarks results

Note these test have been made on top of 4.13.0-mm1.

Ebizzy:
---
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
result I repeated the test 100 times and measure the average result, mean
deviation, max and min.

- 16 CPUs x86 VM
Records/s   4.13.0-mm1  4.13.0-mm1-spf  delta
Average 13217.9065765.94+397.55%
Mean deviation  690.37  2609.36 +277.97%
Max 16726   77675   +364.40%
Min 12194   616340  +405.45%

- 80 CPUs Power 8 node:
Records/s   4.13.0-mm1  4.13.0-mm1-spf  delta
Average 38175.4067635.5577.17% 
Mean deviation  600.09  2349.66 291.55%
Max 39563   74292   87.78% 
Min 35846   62657   74.79% 

The number of record per second is far better with the speculative page
fault. 
The mean deviation is higher with the speculative page fault, may be
because sometime the fault are not handled in a speculative way leading to
more variation.
The numbers for the x86 guest are really i

[PATCH v3 01/20] mm: Dont assume page-table invariance during faults

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the reliance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) 
---
 mm/memory.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..30bccfa00630 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2270,30 +2270,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
-{
-   int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-   if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
-   }
-#endif
-   pte_unmap(page_table);
-   return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
 {
debug_dma_assert_idle(src);
@@ -2854,11 +2830,6 @@ int do_swap_page(struct vm_fault *vmf)
 
if (vma_readahead)
page = swap_readahead_detect(vmf, &swap_ra);
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
-   if (page)
-   put_page(page);
-   goto out;
-   }
 
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
-- 
2.7.4



[PATCH v3 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
 implemented as the vm_fault structure in the kernel]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 55 ++
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 68be41b31ad0..46e769a5a7ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -291,6 +291,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER0x40/* The fault originated in 
userspace */
 #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an instruction 
fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200   /* Speculative fault, not holding 
mmap_sem */
 
 #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
diff --git a/mm/memory.c b/mm/memory.c
index 30bccfa00630..13c8c3c8b5e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2408,6 +2408,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, 
&vmf->ptl);
+   return true;
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2435,6 +2441,7 @@ static int wp_page_copy(struct vm_fault *vmf)
const unsigned long mmun_start = vmf->address & PAGE_MASK;
const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;
+   int ret = VM_FAULT_OOM;
 
if (unlikely(anon_vma_prepare(vma)))
goto oom;
@@ -2462,7 +2469,11 @@ static int wp_page_copy(struct vm_fault *vmf)
/*
 * Re-check the pte - we dropped the lock
 */
-   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+   if (!pte_map_lock(vmf)) {
+   mem_cgroup_cancel_charge(new_page, memcg, false);
+   ret = VM_FAULT_RETRY;
+   goto oom_free_new;
+   }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2550,7 +2561,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
if (old_page)
put_page(old_page);
-   return VM_FAULT_OOM;
+   return ret;
 }
 
 /**
@@ -2571,8 +2582,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-  &vmf->ptl);
+   if (!pte_map_lock(vmf))
+   return VM_FAULT_RETRY;
/*
 * We might have raced with another page fault while we released the
 * pte_offset_map_lock.
@@ -2690,8 +2701,11 @@ static int do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
lock_page(vmf->page);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, &vmf->ptl);
+   if (!pte_map_lock(vmf)) {
+   unlock_page(vmf->page);
+   put_page(vmf->page);
+   return VM_FAULT_RETRY;
+   }
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2868,8 +2882,10 @@ int do_swap_page(struct vm_fault *vmf)
 * Back out if somebody else faulted in this pte
 * while we released the pte lock.
 */
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, &vmf->ptl);
+   if (!pte_map_lock(vmf)) {
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   return VM_FAULT_RETRY;
+   }
if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2925

[PATCH v3 03/20] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2017-09-08 Thread Laurent Dufour
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 13c8c3c8b5e4..530d887ca885 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2408,6 +2408,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, 
&vmf->ptl);
@@ -3717,8 +3724,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3910,8 +3917,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4



[PATCH v3 04/20] mm: VMA sequence count

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |  1 +
 mm/memory.c  |  2 ++
 mm/mmap.c| 21 ++---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 46f4ecf5479a..df9a530c8ca1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -344,6 +344,7 @@ struct vm_area_struct {
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+   seqcount_t vm_sequence;
 } __randomize_layout;
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 530d887ca885..f250e7c92948 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1499,6 +1499,7 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long next;
 
BUG_ON(addr >= end);
+   write_seqcount_begin(&vma->vm_sequence);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -1508,6 +1509,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
+   write_seqcount_end(&vma->vm_sequence);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..0a0012c7e50c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -558,6 +558,8 @@ void __vma_link_rb(struct mm_struct *mm, struct 
vm_area_struct *vma,
else
mm->highest_vm_end = vm_end_gap(vma);
 
+   seqcount_init(&vma->vm_sequence);
+
/*
 * vma->vm_prev wasn't known when we followed the rbtree to find the
 * correct insertion point for that vma. As a result, we could not
@@ -799,6 +801,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
}
}
 
+   write_seqcount_begin(&vma->vm_sequence);
+   if (next && next != vma)
+   write_seqcount_begin_nested(&next->vm_sequence,
+   SINGLE_DEPTH_NESTING);
+
anon_vma = vma->anon_vma;
if (!anon_vma && adjust_next)
anon_vma = next->anon_vma;
@@ -903,6 +910,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
mm->map_count--;
mpol_put(vma_policy(next));
kmem_cache_free(vm_area_cachep, next);
+   write_seqcount_end(&next->vm_sequence);
/*
 * In mprotect's case 6 (see comments on vma_merge),
 * we must remove another next too. It would clutter
@@ -932,11 +940,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
long start,
if (remove_next == 2) {
remove_next = 1;
end = next->vm_end;
+   write_seqcount_end(&vma->vm_sequence);
goto again;
-   }
-   else if (next)
+   } else if (next) {
+   if (next != vma)
+   write_seqcount_begin_nested(&next->vm_sequence,
+   
SINGLE_DEPTH_NESTING);
vma_gap_update(next);
-   else {
+   } else {
/*
 * If remove_next == 2 we obviously can't
 * reach this path.
@@ -962,6 +973,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
if (insert && file)
uprobe_mmap(insert);
 
+   if (next && next != vma)
+   write_seqcount_end(&next->vm_sequence);
+   write_seqcount_end(&vma->vm_sequence);
+
validate_mm(mm);
 
return 0;
-- 
2.7.4



[PATCH v3 05/20] mm: Protect VMA modifications using VMA sequence count

2017-09-08 Thread Laurent Dufour
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mremap()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/mremap.c|  7 +++
 9 files changed, 87 insertions(+), 36 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4bd4b85..550bbc852143 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1152,8 +1152,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   write_seqcount_begin(&vma->vm_sequence);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   write_seqcount_end(&vma->vm_sequence);
}
downgrade_write(&mm->mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index ef4b48d1ea42..856570f327c3 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -634,8 +634,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   write_seqcount_begin(&vma->vm_sequence);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   write_seqcount_end(&vma->vm_sequence);
return 0;
}
 
@@ -860,8 +863,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   write_seqcount_begin(&vma->vm_sequence);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   write_seqcount_end(&vma->vm_sequence);
}
up_write(&mm->mmap_sem);
mmput(mm);
@@ -1379,8 +1384,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   write_seqcount_begin(&vma->vm_sequence);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   write_seqcount_end(&vma->vm_sequence);
 
skip:
prev = vma;
@@ -1537,8 +1544,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   write_seqcount_begin(&vma->vm_sequence);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   write_seqcount_end(&vma->vm_sequence);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c01f177a1120..56dd994c05d0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1005,6 +1005,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   write_seqcount_begin(&vma->vm_sequence);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1040,6 +1041,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm,

[PATCH v3 06/20] mm: RCU free VMAs

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (this could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

Signed-off-by: Peter Zijlstra (Intel) 

[Remove the warnings in description about the SRCU global lock which
 has been removed now]
[Rename vma_is_dead() to vma_has_changed() and move its adding to the next
 patch]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |   2 +
 kernel/fork.c|   1 +
 mm/init-mm.c |   1 +
 mm/internal.h|   5 +++
 mm/mmap.c| 100 +++
 5 files changed, 83 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index df9a530c8ca1..19ed70f79873 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -345,6 +345,7 @@ struct vm_area_struct {
 #endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
seqcount_t vm_sequence;
+   struct rcu_head vm_rcu_head;
 } __randomize_layout;
 
 struct core_thread {
@@ -362,6 +363,7 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+   seqlock_t mm_seq;
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 59c7a8775dd7..bb8205faf3c4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -808,6 +808,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+   seqlock_init(&mm->mm_seq);
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 975e49f00f34..2b1fa061684f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,6 +16,7 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+   .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 0aaa05af7833..84360184eafd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,11 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm,
+   unsigned long addr);
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 04e72314274d..ad125d8c2e14 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -160,6 +160,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+   struct vm_area_struct *vma =
+   container_of(head, struct vm_area_struct, vm_rcu_head);
+
+   if (vma->vm_file)
+   fput(vma->vm_file);
+   kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+   call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -170,10 +187,8 @@ static struct vm_area_struct *remove_vma(struct 
vm_area_struct *vma)
m

[PATCH v3 07/20] mm: Cache some VMA fields in the vm_fault structure

2017-09-08 Thread Laurent Dufour
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  6 ++
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 38 --
 4 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46e769a5a7ab..5fd90ac31317 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -350,6 +350,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 424b0ef08a60..da82a86a4761 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3683,6 +3683,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 56dd994c05d0..0525a0e74535 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -881,6 +881,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index f250e7c92948..f008042ab24e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2556,7 +2556,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Don't let another task, with possibly unlocked vma,
 * keep the mlocked page.
 */
-   if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+   if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2590,7 +2590,7 @@ static int wp_page_copy(struct vm_fault *vmf)
  */
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
-   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2692,7 +2692,7 @@ static int do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
 (VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
 
@@ -2739,7 +2739,7 @@ static int do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -2975,7 +2975,7 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -2999,7 +2999,7 @@ int do_swap_page(struct vm_fault *vmf)
 
swap_free(entry);
if (mem_cgroup_swap_full(page) ||
-   (vma->vm_flags & VM_LOCKED) || PageMlocked(page

[PATCH v3 08/20] mm: Protect SPF handler against anon_vma changes

2017-09-08 Thread Laurent Dufour
The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index f008042ab24e..401b13cbfc3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -617,7 +617,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   write_seqcount_begin(&vma->vm_sequence);
unlink_anon_vmas(vma);
+   write_seqcount_end(&vma->vm_sequence);
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
@@ -631,7 +633,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+   write_seqcount_begin(&vma->vm_sequence);
unlink_anon_vmas(vma);
+   write_seqcount_end(&vma->vm_sequence);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
-- 
2.7.4



[PATCH v3 09/20] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2017-09-08 Thread Laurent Dufour
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Signed-off-by: Laurent Dufour 
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 643c7ae7d7b4..a815ed6838b3 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 401b13cbfc3c..a4982917c16e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3790,7 +3790,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..4e8e22b9df8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1939,7 +1939,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1952,7 +1952,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4



[PATCH v3 10/20] mm: Introduce __lru_cache_add_active_or_unevictable

2017-09-08 Thread Laurent Dufour
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Signed-off-by: Laurent Dufour 
---
 include/linux/swap.h | 11 +--
 mm/memory.c  |  8 
 mm/swap.c| 12 ++--
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8a807292037f..9b4dbb98af89 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -323,8 +323,15 @@ extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
+
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index a4982917c16e..4583f354be94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2509,7 +2509,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2998,7 +2998,7 @@ int do_swap_page(struct vm_fault *vmf)
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
}
 
swap_free(entry);
@@ -3144,7 +3144,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3396,7 +3396,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 9295ae960d66..b084bb16d769 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -470,21 +470,21 @@ void add_page_to_unevictable_list(struct page *page)
 }
 
 /**
- * lru_cache_add_active_or_unevictable
- * @page:  the page to be added to LRU
- * @vma:   vma in which page is mapped for determining reclaimability
+ * __lru_cache_add_active_or_unevictable
+ * @page:  the page to be added to LRU
+ * @vma_flags:  vma in which page is mapped for determining reclaimability
  *
  * Place @page on the active or unevictable LRU list, depending on its
  * evictability.  Note that if the page is not evictable, it goes
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
SetPageActive(page);
lru_cache_add(page);
return;
-- 
2.7.4



[PATCH v3 11/20] mm: Introduce __maybe_mkwrite()

2017-09-08 Thread Laurent Dufour
The current maybe_mkwrite() is getting passed the pointer to the vma
structure to fetch the vm_flags field.

When dealing with the speculative page fault handler, it will be better to
rely on the cached vm_flags value stored in the vm_fault structure.

This patch introduce a __maybe_mkwrite() service which can be called by
passing the value of the vm_flags field.

There is no change functional changes expected for the other callers of
maybe_mkwrite().

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 9 +++--
 mm/memory.c| 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5fd90ac31317..bb0c87f1c725 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -673,13 +673,18 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
 
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   return __maybe_mkwrite(pte, vma->vm_flags);
+}
+
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
diff --git a/mm/memory.c b/mm/memory.c
index 4583f354be94..c306f1f64c9e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2408,7 +2408,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2498,8 +2498,8 @@ static int wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-   entry = mk_pte(new_page, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(new_page, vmf->vma_page_prot);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
 * Clear the pte entry and flush it first, before updating the
 * pte with the new entry. This will avoid a race condition
-- 
2.7.4



[PATCH v3 12/20] mm: Introduce __vm_normal_page()

2017-09-08 Thread Laurent Dufour
When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  7 +--
 mm/memory.c| 18 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb0c87f1c725..a2857aaa03f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1246,8 +1246,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap 
*/
 };
 
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device);
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+#define _vm_normal_page(vma, addr, pte, with_public_device) \
+   __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags);
 #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
 
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index c306f1f64c9e..a5b5fe833ed3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -822,8 +822,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
 {
unsigned long pfn = pte_pfn(pte);
 
@@ -832,7 +833,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -864,8 +865,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
/* !HAVE_PTE_SPECIAL case follows: */
 
-   if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-   if (vma->vm_flags & VM_MIXEDMAP) {
+   if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+   if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -874,7 +875,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
-   if (!is_cow_mapping(vma->vm_flags))
+   if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2687,7 +2688,8 @@ static int do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+vmf->vma_flags);
if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -3749,7 +3751,7 @@ static int do_numa_page(struct vm_fault *vmf)
ptep_modify_prot_commit(vma->vm_mm, vmf->address, vmf->pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
 
-   page = vm_normal_page(vma, vmf->address, pte);
+   page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
if (!page) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
-- 
2.7.4



[PATCH v3 13/20] mm: Introduce __page_add_new_anon_rmap()

2017-09-08 Thread Laurent Dufour
When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour 
---
 include/linux/rmap.h | 12 ++--
 mm/memory.c  |  8 
 mm/rmap.c|  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 733d3d8181e2..d91be69c1c60 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -173,8 +173,16 @@ void page_add_anon_rmap(struct page *, struct 
vm_area_struct *,
unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-   unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/memory.c b/mm/memory.c
index a5b5fe833ed3..479b47a8ed7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2508,7 +2508,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * thread doing COW.
 */
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -2998,7 +2998,7 @@ int do_swap_page(struct vm_fault *vmf)
mem_cgroup_commit_charge(page, memcg, true, false);
activate_page(page);
} else { /* ksm created a completely new copy */
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
}
@@ -3144,7 +3144,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
}
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
@@ -3396,7 +3396,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
index b874c4761e84..5d657329191e 100644
--- a/mm/rmap.c
+++ b/

[PATCH v3 14/20] mm: Provide speculative fault infrastructure

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
[Set fe.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Introduce __HAVE_ARCH_CALL_SPF to declare the SPF handler only when
 architecture is supporting it]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   5 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  14 +++
 mm/memory.c| 249 -
 5 files changed, 266 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index a4e7ca0f3585..6cfdfca4cc2a 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -7,7 +7,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2857aaa03f1..966b69f10f57 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -320,6 +320,7 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+   unsigned int sequence;
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1342,6 +1343,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef __HAVE_ARCH_CALL_SPF
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* __HAVE_ARCH_CALL_SPF */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 5bbd6780f205..832aa3ec7d00 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -451,8 +451,8 @@ static inline pgoff_t linear_page_index(struct 
vm_area_struct *vma,
pgoff_t pgoff;
if (unlikely(is_vm_hugetlb_page(vma)))
return linear_hugepage_index(vma, address);
-   pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
-   pgoff += vma->vm_pgoff;
+   pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
+   pgoff += READ_ONCE(vma->vm_pgoff);
return pgoff;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 84360184eafd..4ddadc440c26 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -45,6 +45,20 @@ extern struct srcu_struct vma_srcu;
 extern struct v

[PATCH v3 15/20] mm: Try spin lock in speculative path

2017-09-08 Thread Laurent Dufour
There is a deadlock when a CPU is doing a speculative page fault and
another one is calling do_unmap().

The deadlock occurred because the speculative path try to spinlock the
pte while the interrupt are disabled. When the other CPU in the
unmap's path has locked the pte then is waiting for all the CPU to
invalidate the TLB. As the CPU doing the speculative fault have the
interrupt disable it can't invalidate the TLB, and can't get the lock.

Since we are in a speculative path, we can race with other mm action.
So let assume that the lock may not get acquired and fail the
speculative page fault.

Here are the stacks captured during the deadlock:

CPU 0
native_flush_tlb_others+0x7c/0x260
flush_tlb_mm_range+0x6a/0x220
tlb_flush_mmu_tlbonly+0x63/0xc0
unmap_page_range+0x897/0x9d0
? unmap_single_vma+0x7d/0xe0
? release_pages+0x2b3/0x360
unmap_single_vma+0x7d/0xe0
unmap_vmas+0x51/0xa0
unmap_region+0xbd/0x130
do_munmap+0x279/0x460
SyS_munmap+0x53/0x70

CPU 1
do_raw_spin_lock+0x14e/0x160
_raw_spin_lock+0x5d/0x80
? pte_map_lock+0x169/0x1b0
pte_map_lock+0x169/0x1b0
handle_pte_fault+0xbf2/0xd80
? trace_hardirqs_on+0xd/0x10
handle_speculative_fault+0x272/0x280
handle_speculative_fault+0x5/0x280
__do_page_fault+0x187/0x580
trace_do_page_fault+0x52/0x260
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5e98259c7ac0..18b39f930ce1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2432,7 +2432,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
goto out;
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (unlikely(!spin_trylock(vmf->ptl)))
+   goto out;
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
@@ -2468,8 +2469,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
if (vma_has_changed(vmf))
goto out;
 
-   pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
- vmf->address, &ptl);
+   /*
+* Same as pte_offset_map_lock() except that we call
+* spin_trylock() in place of spin_lock() to avoid race with
+* unmap path which may have the lock and wait for this CPU
+* to invalidate TLB but this CPU has irq disabled.
+* Since we are in a speculative patch, accept it could fail
+*/
+   ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   pte = pte_offset_map(vmf->pmd, vmf->address);
+   if (unlikely(!spin_trylock(ptl))) {
+   pte_unmap(pte);
+   goto out;
+   }
+
if (vma_has_changed(vmf)) {
pte_unmap_unlock(pte, ptl);
goto out;
-- 
2.7.4



[PATCH v3 16/20] mm: Adding speculative page fault failure trace events

2017-09-08 Thread Laurent Dufour
This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour 
---
 include/trace/events/pagefault.h | 87 
 mm/memory.c  | 59 ++-
 2 files changed, 135 insertions(+), 11 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..d7d56f8102d1
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,87 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lu-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_pte_lock,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_dead,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index 18b39f930ce1..c2a52ddc850b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -81,6 +81,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2428,15 +2431,20 @@ static bool pte_spinlock(struct vm_fault *vmf)
}
 
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   if (unlikely(!spin_trylock(vmf->ptl)))
+   if (unlikely(!spin_trylock(vmf->ptl))) {
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2466,8 +2474,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * block on the PTL and thus we're safe.
 */
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
/*
 * Same as pte_offset_map_lock() except that we call
@@ -2480,11 +2490,13 @@ static bool pte_map_lock(struct vm_fault *vmf)
pte = pte_offset_map(vmf->pmd, vmf->address);
if (unlikely(!spin_trylock(ptl))) {
pte_unmap(pte);
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
if (vma_has_changed(vmf)) {
pte_unmap_unlock(pte, ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -4164,32 +4176,45 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * Validate the VMA found by the lockless lookup.
 */
dead =

[PATCH v3 17/20] perf: Add a speculative page fault sw event

2017-09-08 Thread Laurent Dufour
Add a new software event to count succeeded speculative page faults.

Signed-off-by: Laurent Dufour 
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 140ae638cfd6..101e509ee39b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -111,6 +111,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.7.4



[PATCH v3 18/20] perf tools: Add support for the SPF perf event

2017-09-08 Thread Laurent Dufour
Add support for the new speculative faults event.

Signed-off-by: Laurent Dufour 
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 2a37ae925d85..8ee9a661018d 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -111,6 +111,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d9bd632ed7db..983e8104b523 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -432,6 +432,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index f44aeba51d1f..ac2ebf99e965 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -135,6 +135,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index c42edeac451f..af3e52fb9103 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -289,6 +289,7 @@ alignment-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_AL
 emulation-faults   { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EMULATION_FAULTS); }
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index c129e99114ae..12209adb7cb5 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1141,6 +1141,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4



[PATCH v3 19/20] x86/mm: Add speculative pagefault handling

2017-09-08 Thread Laurent Dufour
From: Peter Zijlstra 

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
Signed-off-by: Laurent Dufour 
---
 arch/x86/include/asm/pgtable_types.h |  7 +++
 arch/x86/mm/fault.c  | 19 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index f1492473f10e..aadc8ecb91fb 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -257,6 +257,13 @@ enum page_cache_mode {
 #define PGD_IDENT_ATTR  0x001  /* PRESENT (no other attributes) */
 #endif
 
+/*
+ * Advertise that we call the Speculative Page Fault handler.
+ */
+#ifdef CONFIG_X86_64
+#define __HAVE_ARCH_CALL_SPF
+#endif
+
 #ifdef CONFIG_X86_32
 # include 
 #else
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b836a7274e12..652af5524f42 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1365,6 +1365,24 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
if (error_code & PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef __HAVE_ARCH_CALL_SPF
+   if (error_code & PF_USER) {
+   fault = handle_speculative_fault(mm, address, flags);
+
+   /*
+* We also check against VM_FAULT_ERROR because we have to
+* raise a signal by calling later mm_fault_error() which
+* requires the vma pointer to be set. So in that case,
+* we fall through the normal path.
+*/
+   if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* __HAVE_ARCH_CALL_SPF */
+
/*
 * When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in
@@ -1474,6 +1492,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
return;
}
 
+done:
/*
 * Major/minor page fault accounting. If any of the events
 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4



[PATCH v3 20/20] powerpc/mm: Add speculative page fault

2017-09-08 Thread Laurent Dufour
This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
  set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
  called by update_mmu_cache()

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +
 arch/powerpc/mm/fault.c  | 15 +++
 2 files changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b9aff515b4de..470203daf97e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -314,6 +314,11 @@ extern unsigned long pci_io_base;
 /* Advertise support for _PAGE_SPECIAL */
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* Advertise that we call the Speculative Page Fault handler */
+#if defined(CONFIG_PPC_BOOK3S_64)
+#define __HAVE_ARCH_CALL_SPF
+#endif
+
 #ifndef __ASSEMBLY__
 
 /*
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 4797d08581ce..97c7de242627 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -442,6 +442,20 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#if defined(__HAVE_ARCH_CALL_SPF)
+   if (is_user) {
+   /* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, address, flags);
+   if (!(fault & VM_FAULT_RETRY)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* defined(__HAVE_ARCH_CALL_SPF) */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -526,6 +540,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(¤t->mm->mmap_sem);
 
+done:
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4



[PATCH V14 0/4] powerpc/vphn: Update CPU topology when VPHN enabled

2017-09-08 Thread Michael Bringmann
powerpc/numa: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources.  This patch
addresses some of those problems.

First, it corrects the currently broken capability to set the
topology for shared CPUs in LPARs.  At boot time for shared CPU
lpars, the topology for each CPU was being set to node zero.  Now
when numa_update_cpu_topology() is called appropriately, the Virtual
Processor Home Node (VPHN) capabilities information provided by the
pHyp allows the appropriate node in the shared configuration to be
selected for the CPU.

Next, it updates the initialization checks to independently recognize
PRRN or VPHN support.

Next, during hotplug CPU operations, it resets the timer on topology
update work function to a small value to better ensure that the CPU
topology is detected and configured sooner.

Also, fix an end-of-updates processing problem observed occasionally
in numa_update_cpu_topology().

Signed-off-by: Michael Bringmann 

Michael Bringmann (4):
  powerpc/vphn: Update CPU topology when VPHN enabled
  powerpc/vphn: Improve recognition of PRRN/VPHN
  powerpc/hotplug: Improve responsiveness of hotplug change
  powerpc/vphn: Fix numa update end-loop bug
  ---
Changes in V14:
  -- Cleanup and compress some code after split.



[PATCH V14 1/4] powerpc/vphn: Update CPU topology when VPHN enabled

2017-09-08 Thread Michael Bringmann
powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources.  This patch
corrects the currently broken capability to set the topology for
shared CPUs in LPARs.  At boot time for shared CPU lpars, the
topology for each CPU was being set to node zero.  Now when
numa_update_cpu_topology() is called appropriately, the Virtual
Processor Home Node (VPHN) capabilities information provided by the
pHyp allows the appropriate node in the shared configuration to be
selected for the CPU.

Signed-off-by: Michael Bringmann 
---
Changes in V14:
  -- Cleanup some duplicate code. Change a trace stmt to debug.
---
 arch/powerpc/mm/numa.c |   22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b95c584..3ae031d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1153,6 +1153,8 @@ struct topology_update_data {
 static int vphn_enabled;
 static int prrn_enabled;
 static void reset_topology_timer(void);
+static int topology_inited;
+static int topology_update_needed;
 
 /*
  * Store the current values of the associativity change counters in the
@@ -1246,6 +1248,10 @@ static long vphn_get_associativity(unsigned long cpu,
"hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
stop_topology_update();
+   break;
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   break;
}
 
return rc;
@@ -1323,8 +1329,11 @@ int numa_update_cpu_topology(bool cpus_locked)
struct device *dev;
int weight, new_nid, i = 0;
 
-   if (!prrn_enabled && !vphn_enabled)
+   if (!prrn_enabled && !vphn_enabled) {
+   if (!topology_inited)
+   topology_update_needed = 1;
return 0;
+   }
 
weight = cpumask_weight(&cpu_associativity_changes_mask);
if (!weight)
@@ -1363,6 +1372,8 @@ int numa_update_cpu_topology(bool cpus_locked)
cpumask_andnot(&cpu_associativity_changes_mask,
&cpu_associativity_changes_mask,
cpu_sibling_mask(cpu));
+   dbg("Assoc chg gives same node %d for cpu%d\n",
+   new_nid, cpu);
cpu = cpu_last_thread_sibling(cpu);
continue;
}
@@ -1433,6 +1444,7 @@ int numa_update_cpu_topology(bool cpus_locked)
 
 out:
kfree(updates);
+   topology_update_needed = 0;
return changed;
 }
 
@@ -1613,9 +1625,17 @@ static int topology_update_init(void)
if (topology_updates_enabled)
start_topology_update();
 
+   if (vphn_enabled)
+   topology_schedule_update();
+
if (!proc_create("powerpc/topology_updates", 0644, NULL, &topology_ops))
return -ENOMEM;
 
+   topology_inited = 1;
+   if (topology_update_needed)
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+
return 0;
 }
 device_initcall(topology_update_init);



[PATCH V14 2/4] powerpc/vphn: Improve recognition of PRRN/VPHN

2017-09-08 Thread Michael Bringmann
powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources.  This patch
updates the initialization checks to independently recognize PRRN
or VPHN support.

Signed-off-by: Michael Bringmann 
---
Changes in V14:
  -- Code cleanup to reduce patch sizes
---
 arch/powerpc/mm/numa.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3ae031d..5f5ff46 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1531,15 +1531,14 @@ int start_topology_update(void)
if (firmware_has_feature(FW_FEATURE_PRRN)) {
if (!prrn_enabled) {
prrn_enabled = 1;
-   vphn_enabled = 0;
 #ifdef CONFIG_SMP
rc = of_reconfig_notifier_register(&dt_update_nb);
 #endif
}
-   } else if (firmware_has_feature(FW_FEATURE_VPHN) &&
+   }
+   if (firmware_has_feature(FW_FEATURE_VPHN) &&
   lppaca_shared_proc(get_lppaca())) {
if (!vphn_enabled) {
-   prrn_enabled = 0;
vphn_enabled = 1;
setup_cpu_associativity_change_counters();
init_timer_deferrable(&topology_timer);
@@ -1562,7 +1561,8 @@ int stop_topology_update(void)
 #ifdef CONFIG_SMP
rc = of_reconfig_notifier_unregister(&dt_update_nb);
 #endif
-   } else if (vphn_enabled) {
+   }
+   if (vphn_enabled) {
vphn_enabled = 0;
rc = del_timer_sync(&topology_timer);
}



[PATCH V14 3/4] powerpc/hotplug: Improve responsiveness of hotplug change

2017-09-08 Thread Michael Bringmann
powerpc/hotplug: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources.  During hotplug
CPU operations, this patch resets the timer on topology update work
function to a small value to better ensure that the CPU topology is
detected and configured sooner.

Signed-off-by: Michael Bringmann 
---
Changes in V14:
  -- Restore accidentally deleted reset of topology timer.
---
 arch/powerpc/include/asm/topology.h  |8 
 arch/powerpc/mm/numa.c   |   23 ++-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |2 ++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index dc4e159..beb9bca 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,6 +98,14 @@ static inline int prrn_is_enabled(void)
 }
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_NEED_MULTIPLE_NODES)
+#if defined(CONFIG_PPC_SPLPAR)
+extern int timed_topology_update(int nsecs);
+#else
+#definetimed_topology_update(nsecs)
+#endif /* CONFIG_PPC_SPLPAR */
+#endif /* CONFIG_HOTPLUG_CPU || CONFIG_NEED_MULTIPLE_NODES */
+
 #include 
 
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5f5ff46..32f5f8d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1148,15 +1148,35 @@ struct topology_update_data {
int new_nid;
 };
 
+#define TOPOLOGY_DEF_TIMER_SECS60
+
 static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
 static cpumask_t cpu_associativity_changes_mask;
 static int vphn_enabled;
 static int prrn_enabled;
 static void reset_topology_timer(void);
+static int topology_timer_secs = 1;
 static int topology_inited;
 static int topology_update_needed;
 
 /*
+ * Change polling interval for associativity changes.
+ */
+int timed_topology_update(int nsecs)
+{
+   if (vphn_enabled) {
+   if (nsecs > 0)
+   topology_timer_secs = nsecs;
+   else
+   topology_timer_secs = TOPOLOGY_DEF_TIMER_SECS;
+
+   reset_topology_timer();
+   }
+
+   return 0;
+}
+
+/*
  * Store the current values of the associativity change counters in the
  * hypervisor.
  */
@@ -1251,6 +1271,7 @@ static long vphn_get_associativity(unsigned long cpu,
break;
case H_SUCCESS:
dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
break;
}
 
@@ -1481,7 +1502,7 @@ static void topology_timer_fn(unsigned long ignored)
 static void reset_topology_timer(void)
 {
topology_timer.data = 0;
-   topology_timer.expires = jiffies + 60 * HZ;
+   topology_timer.expires = jiffies + topology_timer_secs * HZ;
mod_timer(&topology_timer, topology_timer.expires);
 }
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 6afd1ef..5a7fb1e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -356,6 +356,7 @@ static int dlpar_online_cpu(struct device_node *dn)
BUG_ON(get_cpu_current_state(cpu)
!= CPU_STATE_OFFLINE);
cpu_maps_update_done();
+   timed_topology_update(1);
rc = device_online(get_cpu_device(cpu));
if (rc)
goto out;
@@ -522,6 +523,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
set_preferred_offline_state(cpu,
CPU_STATE_OFFLINE);
cpu_maps_update_done();
+   timed_topology_update(1);
rc = device_offline(get_cpu_device(cpu));
if (rc)
goto out;



[PATCH V14 4/4] powerpc/vphn: Fix numa update end-loop bug

2017-09-08 Thread Michael Bringmann
powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources.  This patch
fixes an end-of-updates processing problem observed occasionally
in numa_update_cpu_topology().

Signed-off-by: Michael Bringmann 
---
Changes in V14:
  -- More code cleanup
---
 arch/powerpc/mm/numa.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 32f5f8d..ec098b3 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1401,16 +1401,22 @@ int numa_update_cpu_topology(bool cpus_locked)
 
for_each_cpu(sibling, cpu_sibling_mask(cpu)) {
ud = &updates[i++];
+   ud->next = &updates[i];
ud->cpu = sibling;
ud->new_nid = new_nid;
ud->old_nid = numa_cpu_lookup_table[sibling];
cpumask_set_cpu(sibling, &updated_cpus);
-   if (i < weight)
-   ud->next = &updates[i];
}
cpu = cpu_last_thread_sibling(cpu);
}
 
+   /*
+* Prevent processing of 'updates' from overflowing array
+* where last entry filled in a 'next' pointer.
+*/
+   if (i)
+   updates[i-1].next = NULL;
+
pr_debug("Topology update for the following CPUs:\n");
if (cpumask_weight(&updated_cpus)) {
for (ud = &updates[0]; ud; ud = ud->next) {



Re: [PATCH] powerpc/powernv: Increase memory block size to 1GB on radix

2017-09-08 Thread Balbir Singh
On Thu, Sep 7, 2017 at 5:21 PM, Benjamin Herrenschmidt
 wrote:
> On Thu, 2017-09-07 at 15:17 +1000, Anton Blanchard wrote:
>> Hi,
>>
>> > There is a similar issue being worked on w.r.t pseries.
>> >
>> > https://lkml.kernel.org/r/1502357028-27465-1-git-send-email-bhar...@linux.vnet.ibm.com
>> >
>> > The question is should we map these regions ? ie, we need to tell the
>> > kernel memory region that we would like to hot unplug later so that
>> > we avoid doing kernel allocations from that. If we do that, then we
>> > can possibly map them via 2M size ?
>>
>> But all of memory on PowerNV should be able to be hot unplugged, so

For this ideally we need movable mappings for the regions we intend
to hot-unplug - no? Otherwise, there is no guarantee that hot-unplug
will work

>> there are two options as I see it - either increase the memory block
>> size, or map everything with 2MB pages.
>
> Or be smarter and map with 1G when blocks of 1G are available and break
> down to 2M where necessary, it shouldn't be too hard.
>

strict_rwx patches added helpers to do this

Balbir Singh.


RE: Machine Check in P2010(e500v2)

2017-09-08 Thread Leo Li


> -Original Message-
> From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> Sent: Friday, September 08, 2017 7:51 AM
> To: linuxppc-dev@lists.ozlabs.org; Leo Li ; York Sun
> 
> Subject: Re: Machine Check in P2010(e500v2)
> 
> On Fri, 2017-09-08 at 11:54 +0200, Joakim Tjernlund wrote:
> > On Thu, 2017-09-07 at 18:54 +, Leo Li wrote:
> > > > -Original Message-
> > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > Sent: Thursday, September 07, 2017 3:41 AM
> > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ;
> > > > York Sun 
> > > > Subject: Re: Machine Check in P2010(e500v2)
> > > >
> > > > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > > > On Wed, 2017-09-06 at 21:13 +, Leo Li wrote:
> > > > > > > -Original Message-
> > > > > > > From: Joakim Tjernlund
> > > > > > > [mailto:joakim.tjernl...@infinera.com]
> > > > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > ; York Sun 
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > >
> > > > > > > On Wed, 2017-09-06 at 20:28 +, Leo Li wrote:
> > > > > > > > > -Original Message-
> > > > > > > > > From: Joakim Tjernlund
> > > > > > > > > [mailto:joakim.tjernl...@infinera.com]
> > > > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > > > ; York Sun 
> > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > >
> > > > > > > > > On Wed, 2017-09-06 at 19:31 +, Leo Li wrote:
> > > > > > > > > > > -Original Message-
> > > > > > > > > > > From: York Sun
> > > > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > > > To: Joakim Tjernlund
> > > > > > > > > > > ;
> > > > > > > > > > > linuxppc- d...@lists.ozlabs.org; Leo Li
> > > > > > > > > > > 
> > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > >
> > > > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > > > >
> > > > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > > > @@ -996,7 +998,7 @@ int
> > > > > > > > > > > > fsl_pci_mcheck_exception(struct pt_regs
> > > > > > >
> > > > > > > *regs)
> > > > > > > > > > > >  if (is_in_pci_mem_space(addr)) {
> > > > > > > > > > > >  if (user_mode(regs)) {
> > > > > > > > > > > >  pagefault_disable();
> > > > > > > > > > > > -   ret = get_user(regs->nip, 
> > > > > > > > > > > > &inst);
> > > > > > > > > > > > +   ret = get_user(inst,
> > > > > > > > > > > > + (__u32 __user *)regs->nip);
> > > > > > > > > > > >  pagefault_enable();
> > > > > > > > > > > >  } else {
> > > > > > > > > > > >  ret =
> > > > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > > > >
> > > > > > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > > > > > Now I wonder why this fixup is there in the first place?
> > > > > > > > > > > > The routine will not really fixup the insn, just
> > > > > > > > > > > > return 0x for the failing read and then advance 
> > > > > > > > > > > > the
> process NIP.
> > > > > > > > > >
> > > > > > > > > > You are right.  The code here only gives 0x to
> > > > > > > > > > the load instructions and
> > > > > > > > >
> > > > > > > > > continue with the next instruction when the load
> > > > > > > > > instruction is causing the machine check.  This will
> > > > > > > > > prevent a system lockup when reading from PCI/RapidIO device
> which is link down.
> > > > > > > > > >
> > > > > > > > > > I don't know what is actual problem in your case.
> > > > > > > > > > Maybe it is a write
> > > > > > > > >
> > > > > > > > > instruction instead of read?   Or the code is in a infinite 
> > > > > > > > > loop
> waiting for
> > > >
> > > > a
> > > > > > >
> > > > > > > valid
> > > > > > > > > read result?  Are you able to do some further debugging
> > > > > > > > > with the NIP correctly printed?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > According to the MC it is a Read and the NIP also leads
> > > > > > > > > to a read in the
> > > > > > >
> > > > > > > program.
> > > > > > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > > > > > Question, is it safe add a small printk when this MC
> > > > > > > > > happens(after fixing up)? I need to see that it has
> > > > > > > > > happened as the error is somewhat
> > > > > > >
> > > > > > > random.
> > > > > > > >
> > > > > > > > I think it is safe to add printk as the current machine
> > > > > > > > check handlers are also
> > > > > > >
> > > > > > > using printk.
> > > > > > >
> > > > > > > I hope so, but if the fixup fires

[PATCH 0/7] powerpc: Free up RPAGE_RSV bits

2017-09-08 Thread Ram Pai
RPAGE_RSV0..4 pte bits are currently used for hpte slot
tracking. We  need  these bits   for  memory-protection
keys. Luckily these  four bits   are  relatively easier 
to move among all the other candidate bits.

For  64K   linux-ptes   backed  by 4k hptes, these bits
are   used for tracking the  validity of the slot value
stored   in the second-part-of-the-pte. We device a new
mechanism for  tracking   the   validity  without using
those bits. Themechanism  is explained in the patch.

For 64K  linux-pte  backed by 64K hptes, we simply move
the   slot  tracking bits to the second-part-of-the-pte.

The above  mechanism  is also used to free the bits for
hugetlb linux-ptes.

For 4k linux-pte, we  have only 3 free  bits  available.
We swizzle around the bits and release RPAGE_RSV{2,3,4}
for memory protection keys.

Testing:

has survived  kernel  compilation on multiple platforms
p8 powernv hash-mode, p9 powernv hash-mode,  p7 powervm,
p8-powervm, p8-kvm-guest.

Has survived git-bisect on p8  power-nv  with  64K page
and 4K page.

History:
---
This patchset  is  a  spin-off from the memkey patchset.

version v9:
(1) rearranged the patch order. First the helper
routines are defined followed   by  the
patches that make use of the helpers.

version v8:
(1) an  additional  patch   added  to  free  up
 RSV{2,3,4} on 4K linux-pte.

version v7:
(1) GIX bit reset change  moved  to  the second
patch  -- noticed by Aneesh.
(2) Separated this patches from memkey patchset
(3) merged a  bunch  of  patches, that used the
helper function, into one.
version v6:
(1) No changes related to pte.

version v5:
(1) No changes related to pte.

version v4:
(1) No changes related to pte.

version v3:
(1) split the patches into smaller consumable
patches.
(2) A bug fix while  invalidating a hpte slot
in __hash_page_4K()
-- noticed by Aneesh


version v2:
(1) fixed a  bug  in 4k  hpte  backed 64k pte
where  pageinvalidation   was not
done  correctly,  and  initialization
ofsecond-part-of-the-pte  was not
donecorrectly  if the pte was not
yet Hashed with a hpte.
   --   Reported by Aneesh.


version v1: Initial version

Ram Pai (7):
  powerpc: introduce pte_set_hash_slot() helper
  powerpc: introduce pte_get_hash_gslot() helper
  powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
  powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
  powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6
  powerpc: use helper functions to get and set hash slots
  powerpc: capture the PTE format changes in the dump pte report

 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   21 
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   61 
 arch/powerpc/include/asm/book3s/64/hash.h |8 +-
 arch/powerpc/mm/dump_linuxpagetables.c|3 +-
 arch/powerpc/mm/hash64_4k.c   |   14 +--
 arch/powerpc/mm/hash64_64k.c  |  131 +
 arch/powerpc/mm/hash_utils_64.c   |   35 +--
 arch/powerpc/mm/hugetlbpage-hash64.c  |   18 ++--
 8 files changed, 171 insertions(+), 120 deletions(-)



[PATCH 1/7] powerpc: introduce pte_set_hash_slot() helper

2017-09-08 Thread Ram Pai
Introduce pte_set_hash_slot().It  sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX)
bits at  the   appropriate   location   in   the   PTE  of  4K  PTE.  For
64K PTE, it  sets  the  bits  in  the  second  part  of  the  PTE. Though
the implementation  for the former just needs the slot parameter, it does
take some additional parameters to keep the prototype consistent.

This function  will  be  handy  as  we   work   towards  re-arranging the
bits in the later patches.

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   15 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   25 +
 2 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 0c4e470..8909039 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -48,6 +48,21 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
 }
 #endif
 
+/*
+ * 4k pte format is  different  from  64k  pte  format.  Saving  the
+ * hash_slot is just a matter of returning the pte bits that need to
+ * be modified. On 64k pte, things are a  little  more  involved and
+ * hence  needs   many   more  parameters  to  accomplish  the  same.
+ * However we  want  to abstract this out from the caller by keeping
+ * the prototype consistent across the two formats.
+ */
+static inline unsigned long pte_set_hash_slot(pte_t *ptep, real_pte_t rpte,
+   unsigned int subpg_index, unsigned long slot)
+{
+   return (slot << H_PAGE_F_GIX_SHIFT) &
+   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..6652669 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -74,6 +74,31 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, 
unsigned long index)
return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
 }
 
+/*
+ * Commit the hash slot and return pte bits that needs to be modified.
+ * The caller is expected to modify the pte bits accordingly and
+ * commit the pte to memory.
+ */
+static inline unsigned long pte_set_hash_slot(pte_t *ptep, real_pte_t rpte,
+   unsigned int subpg_index, unsigned long slot)
+{
+   unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+   rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+   *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
+   /*
+* Commit the hidx bits to memory before returning.
+* Anyone reading  pte  must  ensure hidx bits are
+* read  only  after  reading the pte by using the
+* read-side  barrier  smp_rmb(). __real_pte() can
+* help ensure that.
+*/
+   smp_wmb();
+
+   /* no pte bits to be modified, return 0x0UL */
+   return 0x0UL;
+}
+
 #define __rpte_to_pte(r)   ((r).pte)
 extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
 /*
-- 
1.7.1



[PATCH 2/7] powerpc: introduce pte_get_hash_gslot() helper

2017-09-08 Thread Ram Pai
Introduce pte_get_hash_gslot()() which returns the slot number of the
HPTE in the global hash table.

This function will come in handy as we work towards re-arranging the
PTE bits in the later patches.

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash.h |3 +++
 arch/powerpc/mm/hash_utils_64.c   |   18 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index f884520..060c059 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -166,6 +166,9 @@ static inline int hash__pte_none(pte_t pte)
return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
 }
 
+unsigned long pte_get_hash_gslot(unsigned long vpn, unsigned long shift,
+   int ssize, real_pte_t rpte, unsigned int subpg_index);
+
 /* This low level function performs the actual PTE insertion
  * Setting the PTE depends on the MMU type and other factors. It's
  * an horrible mess that I'm not going to try to clean up now but
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 67ec2e9..e68f053 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1591,6 +1591,24 @@ static inline void tm_flush_hash_page(int local)
 }
 #endif
 
+/*
+ * return the global hash slot, corresponding to the given
+ * pte, which contains the hpte.
+ */
+unsigned long pte_get_hash_gslot(unsigned long vpn, unsigned long shift,
+   int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+   unsigned long hash, slot, hidx;
+
+   hash = hpt_hash(vpn, shift, ssize);
+   hidx = __rpte_to_hidx(rpte, subpg_index);
+   if (hidx & _PTEIDX_SECONDARY)
+   hash = ~hash;
+   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+   slot += hidx & _PTEIDX_GROUP_IX;
+   return slot;
+}
+
 /* WARNING: This is called from hash_low_64.S, if you change this prototype,
  *  do not forget to update the assembly call site !
  */
-- 
1.7.1



[PATCH 3/7] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-09-08 Thread Ram Pai
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
in the 4K backed HPTE pages.These bits continue to be used
for 64K backed HPTE pages in this patch, but will be freed
up in the next patch. The  bit  numbers are big-endian  as
defined in the ISA3.0

The patch does the following change to the 4k htpe backed
64K PTE's format.

H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
below)
V0 which occupied bit 4 is not used anymore.
V1 which occupied bit 5 is not used anymore.
V2 which occupied bit 6 is not used anymore.
V3 which occupied bit 7 is not used anymore.

Before the patch, the 4k backed 64k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10...63
 : : : : :  :  :  :  : : ::
 v v v v v  v  v  v  v v vv

,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
|x|x|x|B|V0|V1|V2|V3|x| | |x|x||x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'

After the patch, the 4k backed 64k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10...63
 : : : : :  :  :  :  : : ::
 v v v v v  v  v  v  v v vv

,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
|x|x|x| |  |  |  |  |x|B| |x|x||.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'

the four  bits S,G,I,X (one quadruplet per 4k HPTE) that
cache  the  hash-bucket  slot  value, is initialized  to
1,1,1,1 indicating -- an invalid slot.   If  a HPTE gets
cached in a   slot(i.e 7th  slot  of  secondary hash
bucket), it is  released  immediately. In  other  words,
even  though    is   a valid slot  value in the hash
bucket, we consider it invalid and  release the slot and
the HPTE.  This  gives  us  the opportunity to determine
the validity of S,G,I,X  bits  based on its contents and
not on any of the bits V0,V1,V2 or V3 in the primary PTE

When   we  release  aHPTEcached in the  slot
we alsorelease  a  legitimate   slot  in the primary
hash bucket  and  unmap  its  corresponding  HPTE.  This
is  to  ensure   that  we do get a HPTE cached in a slot
of the primary hash bucket, the next time we retry.

Though  treating    slot  as  invalid,  reduces  the
number of  available  slots  in the hash bucket and  may
have  an  effect   on the performance, the probabilty of
hitting a  slot is extermely low.

Compared  to  the   currentscheme,  the above scheme
reduces  the   number  of   false   hash  table  updates
significantly and  has the  added advantage of releasing
four  valuable  PTE bits for other purpose.

NOTE:even though bits 3, 4, 5, 6, 7 are  not  used  when
the  64K  PTE is backed by 4k HPTE,  they continue to be
used  if  the  PTE  gets  backed  by 64k HPTE.  The next
patch will decouple that aswell, and truely  release the
bits.

This idea was jointly developed by Paul Mackerras,
Aneesh, Michael Ellermen and myself.

4K PTE format remains unchanged currently.

The patch does the following code changes
a) PTE flags are split between 64k and 4k  header files.
b) __hash_page_4K()  is  reimplemented   to reflect the
   above logic.

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |2 +
 arch/powerpc/include/asm/book3s/64/hash-64k.h |8 +--
 arch/powerpc/include/asm/book3s/64/hash.h |1 -
 arch/powerpc/mm/hash64_64k.c  |  106 +
 arch/powerpc/mm/hash_utils_64.c   |4 +-
 5 files changed, 63 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 8909039..e66bfeb 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,8 @@
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
+
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
 H_PAGE_F_SECOND | H_PAGE_F_GIX)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 6652669..e038f1c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,18 +12,14 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_BUS

[PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-09-08 Thread Ram Pai
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
in the 64K backed HPTE pages. This along with the earlier
patch will  entirely free  up the four bits from 64K PTE.
The bit numbers are  big-endian as defined in the  ISA3.0

This patch  does  the  following change to 64K PTE backed
by 64K HPTE.

H_PAGE_F_SECOND (S) which  occupied  bit  4  moves to the
second part of the pte to bit 60.
H_PAGE_F_GIX (G,I,X) which  occupied  bit 5, 6 and 7 also
moves  to  the   second part of the pte to bit 61,
62, 63, 64 respectively

since bit 7 is now freed up, we move H_PAGE_BUSY (B) from
bit  9  to  bit  7.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
NOTE: None of the bits in the secondary PTE were not used
by 64k-HPTE backed PTE.

Before the patch, the 64K HPTE backed 64k PTE format was
as follows

 0 1 2 3 4  5  6  7  8 9 10...63
 : : : : :  :  :  :  : : ::
 v v v v v  v  v  v  v v vv

,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
|x|x|x| |S |G |I |X |x|B| |x|x||x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
| | | | |  |  |  |  | | | | |..| | | | | <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'

After the patch, the 64k HPTE backed 64k PTE format is
as follows

 0 1 2 3 4  5  6  7  8 9 10...63
 : : : : :  :  :  :  : : ::
 v v v v v  v  v  v  v v vv

,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
|x|x|x| |  |  |  |B |x| | |x|x||.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
| | | | |  |  |  |  | | | | |..|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'

The above PTE changes is applicable to hugetlbpages aswell.

The patch does the following code changes:

a) moves  the  H_PAGE_F_SECOND and  H_PAGE_F_GIX to 4k PTE
header   since it is no more needed b the 64k PTEs.
b) abstracts  out __real_pte() and __rpte_to_hidx() so the
caller  need not know the bit location of the slot.
c) moves the slot bits to the secondary pte.

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |3 ++
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   29 +++-
 arch/powerpc/include/asm/book3s/64/hash.h |3 --
 arch/powerpc/mm/hash64_64k.c  |   23 ---
 arch/powerpc/mm/hugetlbpage-hash64.c  |   18 ++-
 5 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index e66bfeb..dc153c6 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,9 @@
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+#define H_PAGE_F_GIX_SHIFT 56
+#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
 #define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
 
 /* PTE flags to conserve for HPTE identification */
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index e038f1c..89ef5a9 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,7 +12,7 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */
 
 /*
  * We need to differentiate between explicit huge page and THP huge
@@ -21,8 +21,7 @@
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
-H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
 /*
  * we support 16 fragments per PTE page of 64K size.
  */
@@ -50,24 +49,22 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
unsigned long *hidxp;
 
rpte.pte = pte;
-   rpte.hidx = 0;
-   if (pte_val(pte) & H_PAGE_COMBO) {
-   /*
-* Make sure we order the hidx load against the H_PAGE_COMBO
-* check. The store side ordering is done in __hash_page_4K
-*/
-   smp_rmb();
-   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-   rpte.hidx = *hidxp;
-   }

[PATCH 5/7] powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6

2017-09-08 Thread Ram Pai
We  need  PTE bits  3 ,4, 5, 6 and 57 to support protection-keys,
because these are  the bits we want to consolidate on across all
configuration to support protection keys.

Bit 3,4,5 and 6 are currently used on 4K-pte kernels.  But bit 9
and 10 are available.  Hence  we  use the two available bits and
free up bit 5 and 6.  We will still not be able to free up bit 3
and 4. In the absence  of  any  other free bits, we will have to
stay satisfied  with  what we have :-(.   This means we will not
be  able  to support  32  protection  keys, but only 8.  The bit
numbers are  big-endian as defined in the  ISA3.0

This patch  does  the  following change to 4K PTE.

H_PAGE_F_SECOND (S) which   occupied  bit  4   moves  to  bit  7.
H_PAGE_F_GIX (G,I,X)  which  occupied  bit 5, 6 and 7 also moves
to  bit 8,9, 10 respectively.
H_PAGE_HASHPTE (H)  which   occupied   bit  8  moves  to  bit  4.

Before the patch, the 4k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 1057.63
 : : : : :  :  :  :  : : :  : :
 v v v v v  v  v  v  v v v  v v
,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
|x|x|x|B|S |G |I |X |H| | |x|x|| |x|x|x|
'_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'

After the patch, the 4k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 1057.63
 : : : : :  :  :  :  : : :  : :
 v v v v v  v  v  v  v v v  v v
,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
|x|x|x|B|H |  |  |S |G|I|X|x|x|| |.|.|.|
'_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'

The patch has no code changes; just swizzles around bits.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |7 ---
 arch/powerpc/include/asm/book3s/64/hash-64k.h |1 +
 arch/powerpc/include/asm/book3s/64/hash.h |1 -
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index dc153c6..5187249 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,10 +16,11 @@
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
-#define H_PAGE_F_GIX_SHIFT 56
-#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 53
+#define H_PAGE_F_SECOND_RPAGE_RPN44/* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RPN43 | _RPAGE_RPN42 | _RPAGE_RPN41)
 #define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RSV2 /* software: PTE & hash are busy */
 
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 89ef5a9..8576060 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -13,6 +13,7 @@
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
 #define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 /*
  * We need to differentiate between explicit huge page and THP huge
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 46f3a23..953795e 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,7 +8,6 @@
  *
  */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS
-#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include 
-- 
1.7.1



[PATCH 6/7] powerpc: use helper functions to get and set hash slots

2017-09-08 Thread Ram Pai
replace redundant code in __hash_page_4K()   and   flush_hash_page()
with helper functions pte_get_hash_gslot() and   pte_set_hash_slot()

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash64_4k.c |   14 ++
 arch/powerpc/mm/hash_utils_64.c |   13 -
 2 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..a1eebc1 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
   pte_t *ptep, unsigned long trap, unsigned long flags,
   int ssize, int subpg_prot)
 {
+   real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * need to add in 0x1 if it's a read-only user page
 */
rflags = htab_convert_pte_flags(new_pte);
+   rpte = __real_pte(__pte(old_pte), ptep);
 
if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
/*
 * There MIGHT be an HPTE for this pte
 */
-   hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+   unsigned long gslot = pte_get_hash_gslot(vpn, shift,
+   ssize, rpte, 0);
 
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
+   if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
   MMU_PAGE_4K, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
}
@@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
return -1;
}
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
-   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
-   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+   new_pte |= pte_set_hash_slot(ptep, rpte, 0, slot);
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index a40c7bc..0dff57b 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1617,23 +1617,18 @@ unsigned long pte_get_hash_gslot(unsigned long vpn, 
unsigned long shift,
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 unsigned long flags)
 {
-   unsigned long hash, index, shift, hidx, slot;
+   unsigned long index, shift, gslot;
int local = flags & HPTE_LOCAL_UPDATE;
 
DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
-   hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(pte, index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-   DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
+   gslot = pte_get_hash_gslot(vpn, shift, ssize, pte, index);
+   DBG_LOW(" sub %ld: gslot=%lx\n", index, gslot);
/*
 * We use same base page size and actual psize, because we don't
 * use these functions for hugepage
 */
-   mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
+   mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
 ssize, local);
} pte_iterate_hashed_end();
 
-- 
1.7.1



[PATCH 7/7] powerpc: capture the PTE format changes in the dump pte report

2017-09-08 Thread Ram Pai
The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE.
capture these changes in the dump pte report.

Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/dump_linuxpagetables.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index c9282d2..0323dc4 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
.val= H_PAGE_4K_PFN,
.set= "4K_pfn",
}, {
-#endif
+#else /* CONFIG_PPC_64K_PAGES */
.mask   = H_PAGE_F_GIX,
.val= H_PAGE_F_GIX,
.set= "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
.val= H_PAGE_F_SECOND,
.set= "f_second",
}, {
+#endif /* CONFIG_PPC_64K_PAGES */
 #endif
.mask   = _PAGE_SPECIAL,
.val= _PAGE_SPECIAL,
-- 
1.7.1



[PATCH 00/25] powerpc: Memory Protection Keys

2017-09-08 Thread Ram Pai
Memory protection keys enable applications to protect its
address  space from inadvertent access from or corruption
by itself.

These patches along with the pte-bit freeing patch series
enables the protection key feature on powerpc; 4k and 64k
hashpage kernels. A subsequent patch series that  changes
the generic  and  x86  code  will  expose memkey features
through sysfs and  provide  testcases  and  Documentation
updates.

Patches are based on powerpc -next branch.  All   patches
can be found at --
https://github.com/rampai/memorykeys.git memkey.v8

The overall idea:
-
 A process allocates a   key  and associates it with
 an  address  range  withinits   address   space.
 The process  then  can  dynamically  set read/write 
 permissions on  the   key   without  involving  the 
 kernel. Any  code that  violates   the  permissions
 of  the address space; as defined by its associated
 key, will receive a segmentation fault.

This  patch series enables the feature on PPC64 HPTE
platform.

ISA3.0   section  5.7.13   describes  the  detailed
specifications.


Highlevel view of the design:
---
When  an  application associates a key with a address
address  range,  program  the key inthe Linux PTE.
When the MMU   detects  a page fault, allocate a hash
page  and   program  the  key into HPTE.  And finally
when the  MMUdetects  a  key  violation;  due  to
invalidapplication  access, invoke the registered
signal   handler and provide the violated  key number.


Testing:
---
This  patch  series has passed all the protection key
tests   availablein   the selftests directory.The
tests are updated  towork on both x86 and powerpc.
NOTE: All the selftest related patches will be   part
of  a separate patch series.


History:
---
version v8:
(1) Contents of the AMR register withdrawn from
the siginfo  structure. Applications can always
read the AMR register.
(2) AMR/IAMR/UAMOR are  now  available  through 
ptrace system call. -- thanks to Thiago
(3) code  changes  to  handle legacy power cpus
that do not support execute-disable.
(4) incorporates many code improvement
suggestions.

version v7:
(1) refers to device tree property to enable
protection keys.
(2) adds 4K PTE support.
(3) fixes a couple of bugs noticed by Thiago
(4) decouples this patch series from   arch-
independent code. This patch series can
now stand by itself, with one kludge
patch(2).
version v7:
(1) refers to device tree property to enable
protection keys.
(2) adds 4K PTE support.
(3) fixes a couple of bugs noticed by Thiago
(4) decouples this patch series from   arch-
independent code. This patch series can
now stand by itself, with one kludge
patch(2).

version v6:
(1) selftest changes  are broken down into 20
incremental patches.
(2) A  separate   key  allocation  mask  that
includesPKEY_DISABLE_EXECUTE   is 
added for powerpc
(3) pkey feature  is enabled for 64K HPT case
only.  RPT and 4k HPT is disabled.
(4) Documentation   is   updated   to  better 
capture the semantics.
(5) introduced   arch_pkeys_enabled() to find
if an arch enables pkeys.  Correspond-
ing change the  logic   that displays
key value in smaps.
(6) code  rearranged  in many places based on
comments from   Dave Hansen,   Balbir,
Anshuman.   
(7) fixed  one bug where a bogus key could be
associated successfullyin
pkey_mprotect().

version v5:
(1) reverted back to the old  design -- store
the key in the pte,  instead of bypassing
it.  The v4  design  slowed down the hash
page path.
(2) detects key violation when kernel is told
to access user pages.
(3) further  refined the patches into smaller
consumable units
(4) page faults   handlers captures the fault-
ing key 
from the pte   instead of   the vma. This
closes  a  race  between  where  the  key 
update in the  vma and a key fault caused
by the key programmed in the pte.
(5) a key created   with access-denied should
also set it up to deny write. Fixed it.
(6) protection-key   number   is displayed in
smaps the x86 way.

version v4:
(1) patches no more depend on the pte bits
to program the hpte
-- comment by Balbir
(2) documentation updates
(3) fixed a bug in the selftest.
(4) 

[PATCH 01/25] powerpc: initial pkey plumbing

2017-09-08 Thread Ram Pai
Basic  plumbing  to   initialize  the   pkey  system.
Nothing is enabled yet. A later patch will enable it
ones all the infrastructure is in place.

Signed-off-by: Ram Pai 
---
 arch/powerpc/Kconfig   |   16 +++
 arch/powerpc/include/asm/mmu_context.h |5 +++
 arch/powerpc/include/asm/pkeys.h   |   45 
 arch/powerpc/kernel/setup_64.c |4 +++
 arch/powerpc/mm/Makefile   |1 +
 arch/powerpc/mm/hash_utils_64.c|1 +
 arch/powerpc/mm/pkeys.c|   33 +++
 7 files changed, 105 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9fc3c0b..a4cd210 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -864,6 +864,22 @@ config SECCOMP
 
  If unsure, say Y. Only embedded should say N here.
 
+config PPC64_MEMORY_PROTECTION_KEYS
+   prompt "PowerPC Memory Protection Keys"
+   def_bool y
+   # Note: only available in 64-bit mode
+   depends on PPC64
+   select ARCH_USES_HIGH_VMA_FLAGS
+   select ARCH_HAS_PKEYS
+   ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/vm/protection-keys.txt
+
+ If unsure, say y.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 3095925..7badf29 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -141,5 +141,10 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
/* by default, allow everything */
return true;
 }
+
+#ifndef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define pkey_initialize()
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 000..c02305a
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,45 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+extern bool pkey_inited;
+extern bool pkey_execute_disable_support;
+#define ARCH_VM_PKEY_FLAGS 0
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+   return (pkey == 0);
+}
+
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+   return -1;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+   return -EINVAL;
+}
+
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+   return 0;
+}
+
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+   int prot, int pkey)
+{
+   return 0;
+}
+
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+   unsigned long init_val)
+{
+   return 0;
+}
+
+extern void pkey_initialize(void);
+#endif /*_ASM_PPC64_PKEYS_H */
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index b89c6aa..3b67014 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -316,6 +317,9 @@ void __init early_setup(unsigned long dt_ptr)
/* Initialize the hash table or TLB handling */
early_init_mmu();
 
+   /* initialize the key subsystem */
+   pkey_initialize();
+
/*
 * At this point, we can let interrupts switch to virtual mode
 * (the MMU has been setup), so adjust the MSR in the PACA to
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index fb844d2..927620a 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -43,3 +43,4 @@ obj-$(CONFIG_PPC_COPRO_BASE)  += copro_fault.o
 obj-$(CONFIG_SPAPR_TCE_IOMMU)  += mmu_context_iommu.o
 obj-$(CONFIG_PPC_PTDUMP)   += dump_linuxpagetables.o
 obj-$(CONFIG_PPC_HTDUMP)   += dump_hashpagetable.o
+obj-$(CONFIG_PPC64_MEMORY_PROTECTION_KEYS) += pkeys.o
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0dff57b..67f62b5 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
new file mode 100644
index 000..418a05b
--- /dev/null
+++ b/arch/powerpc/mm/pkeys.c
@@ -0,0 +1,33 @@
+/*
+ * PowerPC Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ * Copyright (c) 2017, IBM Corporatio

[PATCH 02/25] powerpc: define an additional vma bit for protection keys.

2017-09-08 Thread Ram Pai
powerpc needs an additional vma bit to support 32 keys.
Till the additional vma bit lands in include/linux/mm.h
we have to define  it  in powerpc specific header file.
This is  needed to get pkeys working on power.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/pkeys.h |   18 ++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index c02305a..44e01a2 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -3,6 +3,24 @@
 
 extern bool pkey_inited;
 extern bool pkey_execute_disable_support;
+
+/*
+ * powerpc needs an additional vma bit to support 32 keys.
+ * Till the additional vma bit lands in include/linux/mm.h
+ * we have to carry the hunk below. This is  needed to get
+ * pkeys working on power. -- Ram
+ */
+#ifndef VM_HIGH_ARCH_BIT_4
+#define VM_HIGH_ARCH_BIT_4 36
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
+#define VM_PKEY_BIT0   VM_HIGH_ARCH_0
+#define VM_PKEY_BIT1   VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2   VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3   VM_HIGH_ARCH_3
+#define VM_PKEY_BIT4   VM_HIGH_ARCH_4
+#endif
+
 #define ARCH_VM_PKEY_FLAGS 0
 
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
-- 
1.7.1



[PATCH 03/25] powerpc: track allocation status of all pkeys

2017-09-08 Thread Ram Pai
Total 32 keys are available on power7 and above. However
pkey 0,1 are reserved. So effectively we  have  30 pkeys.

On 4K kernels, we do not  have  5  bits  in  the  PTE to
represent  all the keys; we only have 3bits.Two of those
keys are reserved; pkey 0 and pkey 1. So effectively  we
have 6 pkeys.

This patch keeps track of reserved keys, allocated  keys
and keys that are currently free.

Also it  adds  skeletal  functions  and macros, that the
architecture-independent code expects to be available.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |9 
 arch/powerpc/include/asm/mmu_context.h   |1 +
 arch/powerpc/include/asm/pkeys.h |   72 --
 arch/powerpc/mm/mmu_context_book3s64.c   |2 +
 arch/powerpc/mm/pkeys.c  |   28 
 5 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c3b00e8..55950f4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -107,6 +107,15 @@ struct patb_entry {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
struct list_head iommu_group_mem_list;
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /*
+* Each bit represents one protection key.
+* bit set   -> key allocated
+* bit unset -> key available for allocation
+*/
+   u32 pkey_allocation_map;
+#endif
 } mm_context_t;
 
 /*
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 7badf29..c705a5d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -144,6 +144,7 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
 
 #ifndef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
 #define pkey_initialize()
+#define pkey_mm_init(mm)
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 44e01a2..133f8c4 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -3,6 +3,8 @@
 
 extern bool pkey_inited;
 extern bool pkey_execute_disable_support;
+extern int pkeys_total; /* total pkeys as per device tree */
+extern u32 initial_allocation_mask;/* bits set for reserved keys */
 
 /*
  * powerpc needs an additional vma bit to support 32 keys.
@@ -21,21 +23,76 @@
 #define VM_PKEY_BIT4   VM_HIGH_ARCH_4
 #endif
 
-#define ARCH_VM_PKEY_FLAGS 0
+#define arch_max_pkey()  pkeys_total
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
+   VM_PKEY_BIT3 | VM_PKEY_BIT4)
+
+#define pkey_alloc_mask(pkey) (0x1 << pkey)
+
+#define mm_pkey_allocation_map(mm) (mm->context.pkey_allocation_map)
+
+#define mm_set_pkey_allocated(mm, pkey) {  \
+   mm_pkey_allocation_map(mm) |= pkey_alloc_mask(pkey); \
+}
+
+#define mm_set_pkey_free(mm, pkey) {   \
+   mm_pkey_allocation_map(mm) &= ~pkey_alloc_mask(pkey);   \
+}
+
+#define mm_set_pkey_is_allocated(mm, pkey) \
+   (mm_pkey_allocation_map(mm) & pkey_alloc_mask(pkey))
+
+#define mm_set_pkey_is_reserved(mm, pkey) (initial_allocation_mask & \
+   pkey_alloc_mask(pkey))
 
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
-   return (pkey == 0);
+   /* a reserved key is never considered as 'explicitly allocated' */
+   return ((pkey < arch_max_pkey()) &&
+   !mm_set_pkey_is_reserved(mm, pkey) &&
+   mm_set_pkey_is_allocated(mm, pkey));
 }
 
+/*
+ * Returns a positive, 5-bit key on success, or -1 on failure.
+ */
 static inline int mm_pkey_alloc(struct mm_struct *mm)
 {
-   return -1;
+   /*
+* Note: this is the one and only place we make sure
+* that the pkey is valid as far as the hardware is
+* concerned.  The rest of the kernel trusts that
+* only good, valid pkeys come out of here.
+*/
+   u32 all_pkeys_mask = (u32)(~(0x0));
+   int ret;
+
+   if (!pkey_inited)
+   return -1;
+   /*
+* Are we out of pkeys?  We must handle this specially
+* because ffz() behavior is undefined if there are no
+* zeros.
+*/
+   if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+   return -1;
+
+   ret = ffz((u32)mm_pkey_allocation_map(mm));
+   mm_set_pkey_allocated(mm, ret);
+   return ret;
 }
 
 static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
 {
-   return -EINVAL;
+   if (!pkey_inited)
+   return -1;
+
+   if (!mm_pkey_is_allocated(mm, pkey))
+   return -EINVAL;
+
+   mm_set_pkey_free(mm, pkey);
+
+   return 0;
 }
 
 /*
@@ -59,5 +116,12 @@ static inline int arch_set_user_pkey_access(struct 
task_struct *tsk, int pkey,
return 0;
 }
 
+static inli

[PATCH 04/25] powerpc: helper function to read, write AMR, IAMR, UAMOR registers

2017-09-08 Thread Ram Pai
Implements helper functions to read and write the key related
registers; AMR, IAMR, UAMOR.

AMR register tracks the read,write permission of a key
IAMR register tracks the execute permission of a key
UAMOR register enables and disables a key

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   31 ++
 1 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b9aff51..73ed52c 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -438,6 +438,37 @@ static inline void huge_ptep_set_wrprotect(struct 
mm_struct *mm,
pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
+#include 
+static inline u64 read_amr(void)
+{
+   return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+   mtspr(SPRN_AMR, value);
+}
+extern bool pkey_execute_disable_support;
+static inline u64 read_iamr(void)
+{
+   if (pkey_execute_disable_support)
+   return mfspr(SPRN_IAMR);
+   else
+   return 0x0UL;
+}
+static inline void write_iamr(u64 value)
+{
+   if (pkey_execute_disable_support)
+   mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+   return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+   mtspr(SPRN_UAMOR, value);
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep)
-- 
1.7.1



[PATCH 05/25] powerpc: helper functions to initialize AMR, IAMR and UAMOR registers

2017-09-08 Thread Ram Pai
Introduce  helper functions that can initialize the bits in the AMR,
IAMR and UAMOR register; the bits that correspond to the given pkey.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/pkeys.h |1 +
 arch/powerpc/mm/pkeys.c  |   46 ++
 2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 133f8c4..5a83ed7 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -26,6 +26,7 @@
 #define arch_max_pkey()  pkeys_total
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
VM_PKEY_BIT3 | VM_PKEY_BIT4)
+#define AMR_BITS_PER_PKEY 2
 
 #define pkey_alloc_mask(pkey) (0x1 << pkey)
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index ebc9e84..178aa33 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -59,3 +59,49 @@ void __init pkey_initialize(void)
for (i = 2; i < (pkeys_total - os_reserved); i++)
initial_allocation_mask &= ~(0x1<

[PATCH 06/25] powerpc: cleaup AMR, iAMR when a key is allocated or freed

2017-09-08 Thread Ram Pai
cleanup the bits corresponding to a key in the AMR, and IAMR
register, when the key is newly allocated/activated or is freed.
We dont want some residual bits cause the hardware enforce
unintended behavior when the key is activated or freed.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/pkeys.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 5a83ed7..53bf13b 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -54,6 +54,8 @@ static inline bool mm_pkey_is_allocated(struct mm_struct *mm, 
int pkey)
mm_set_pkey_is_allocated(mm, pkey));
 }
 
+extern void __arch_activate_pkey(int pkey);
+extern void __arch_deactivate_pkey(int pkey);
 /*
  * Returns a positive, 5-bit key on success, or -1 on failure.
  */
@@ -80,6 +82,12 @@ static inline int mm_pkey_alloc(struct mm_struct *mm)
 
ret = ffz((u32)mm_pkey_allocation_map(mm));
mm_set_pkey_allocated(mm, ret);
+
+   /*
+* enable the key in the hardware
+*/
+   if (ret > 0)
+   __arch_activate_pkey(ret);
return ret;
 }
 
@@ -91,6 +99,10 @@ static inline int mm_pkey_free(struct mm_struct *mm, int 
pkey)
if (!mm_pkey_is_allocated(mm, pkey))
return -EINVAL;
 
+   /*
+* Disable the key in the hardware
+*/
+   __arch_deactivate_pkey(pkey);
mm_set_pkey_free(mm, pkey);
 
return 0;
-- 
1.7.1



[PATCH 07/25] powerpc: implementation for arch_set_user_pkey_access()

2017-09-08 Thread Ram Pai
This patch provides the detailed implementation for
a user to allocate a key and enable it in the hardware.

It provides the plumbing, but it cannot be used till
the system call is implemented. The next patch  will
do so.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/pkeys.h |9 -
 arch/powerpc/mm/pkeys.c  |   28 
 2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 53bf13b..7fd48a4 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -24,6 +24,9 @@
 #endif
 
 #define arch_max_pkey()  pkeys_total
+#define AMR_RD_BIT 0x1UL
+#define AMR_WR_BIT 0x2UL
+#define IAMR_EX_BIT 0x1UL
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
VM_PKEY_BIT3 | VM_PKEY_BIT4)
 #define AMR_BITS_PER_PKEY 2
@@ -123,10 +126,14 @@ static inline int arch_override_mprotect_pkey(struct 
vm_area_struct *vma,
return 0;
 }
 
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+   unsigned long init_val);
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
 {
-   return 0;
+   if (!pkey_inited)
+   return -EINVAL;
+   return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
 static inline void pkey_mm_init(struct mm_struct *mm)
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 178aa33..cc5be6a 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -12,6 +12,7 @@
  * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
  * more details.
  */
+#include 
 #include /* PKEY_*   */
 
 bool pkey_inited;
@@ -63,6 +64,11 @@ void __init pkey_initialize(void)
 #define PKEY_REG_BITS (sizeof(u64)*8)
 #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
 
+static bool is_pkey_enabled(int pkey)
+{
+   return !!(read_uamor() & (0x3ul << pkeyshift(pkey)));
+}
+
 static inline void init_amr(int pkey, u8 init_bits)
 {
u64 new_amr_bits = (((u64)init_bits & 0x3UL) << pkeyshift(pkey));
@@ -105,3 +111,25 @@ void __arch_deactivate_pkey(int pkey)
 {
pkey_status_change(pkey, false);
 }
+
+/*
+ * set the access right in AMR IAMR and UAMOR register
+ * for @pkey to that specified in @init_val.
+ */
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+   unsigned long init_val)
+{
+   u64 new_amr_bits = 0x0ul;
+
+   if (!is_pkey_enabled(pkey))
+   return -EINVAL;
+
+   /* Set the bits we need in AMR:  */
+   if (init_val & PKEY_DISABLE_ACCESS)
+   new_amr_bits |= AMR_RD_BIT | AMR_WR_BIT;
+   else if (init_val & PKEY_DISABLE_WRITE)
+   new_amr_bits |= AMR_WR_BIT;
+
+   init_amr(pkey, new_amr_bits);
+   return 0;
+}
-- 
1.7.1



[PATCH 08/25] powerpc: sys_pkey_alloc() and sys_pkey_free() system calls

2017-09-08 Thread Ram Pai
Finally this patch provides the ability for a process to
allocate and free a protection key.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/systbl.h  |2 ++
 arch/powerpc/include/asm/unistd.h  |4 +---
 arch/powerpc/include/uapi/asm/unistd.h |2 ++
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 1c94708..22dd776 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -388,3 +388,5 @@
 COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
+SYSCALL(pkey_alloc)
+SYSCALL(pkey_free)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 9ba11db..e0273bc 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,13 +12,11 @@
 #include 
 
 
-#define NR_syscalls384
+#define NR_syscalls386
 
 #define __NR__exit __NR_exit
 
 #define __IGNORE_pkey_mprotect
-#define __IGNORE_pkey_alloc
-#define __IGNORE_pkey_free
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index b85f142..7993a07 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -394,5 +394,7 @@
 #define __NR_pwritev2  381
 #define __NR_kexec_file_load   382
 #define __NR_statx 383
+#define __NR_pkey_alloc384
+#define __NR_pkey_free 385
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
1.7.1



[PATCH 09/25] powerpc: ability to create execute-disabled pkeys

2017-09-08 Thread Ram Pai
powerpc has hardware support to disable execute on a pkey.
This patch enables the ability to create execute-disabled
keys.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/uapi/asm/mman.h |6 ++
 arch/powerpc/mm/pkeys.c  |   16 
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/mman.h 
b/arch/powerpc/include/uapi/asm/mman.h
index ab45cc2..f272b09 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -45,4 +45,10 @@
 #define MAP_HUGE_1GB   (30 << MAP_HUGE_SHIFT)  /* 1GB   HugeTLB Page */
 #define MAP_HUGE_16GB  (34 << MAP_HUGE_SHIFT)  /* 16GB  HugeTLB Page */
 
+/* override any generic PKEY Permission defines */
+#define PKEY_DISABLE_EXECUTE   0x4
+#undef PKEY_ACCESS_MASK
+#define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
+   PKEY_DISABLE_WRITE  |\
+   PKEY_DISABLE_EXECUTE)
 #endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index cc5be6a..2282864 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -24,6 +24,14 @@ void __init pkey_initialize(void)
 {
int os_reserved, i;
 
+   /*
+* we define PKEY_DISABLE_EXECUTE in addition to the arch-neutral
+* generic defines for PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
+* Ensure that the bits a distinct.
+*/
+   BUILD_BUG_ON(PKEY_DISABLE_EXECUTE &
+(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
/* disable the pkey system till everything
 * is in place. A patch further down the
 * line will enable it.
@@ -120,10 +128,18 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
unsigned long init_val)
 {
u64 new_amr_bits = 0x0ul;
+   u64 new_iamr_bits = 0x0ul;
 
if (!is_pkey_enabled(pkey))
return -EINVAL;
 
+   if ((init_val & PKEY_DISABLE_EXECUTE)) {
+   if (!pkey_execute_disable_support)
+   return -EINVAL;
+   new_iamr_bits |= IAMR_EX_BIT;
+   }
+   init_iamr(pkey, new_iamr_bits);
+
/* Set the bits we need in AMR:  */
if (init_val & PKEY_DISABLE_ACCESS)
new_amr_bits |= AMR_RD_BIT | AMR_WR_BIT;
-- 
1.7.1



[PATCH 10/25] powerpc: store and restore the pkey state across context switches

2017-09-08 Thread Ram Pai
Store and restore the AMR, IAMR and UAMOR register state of the task
before scheduling out and after scheduling in, respectively.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/pkeys.h |4 +++
 arch/powerpc/include/asm/processor.h |5 
 arch/powerpc/kernel/process.c|   10 
 arch/powerpc/mm/pkeys.c  |   39 ++
 4 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7fd48a4..78c5362 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -143,5 +143,9 @@ static inline void pkey_mm_init(struct mm_struct *mm)
mm_pkey_allocation_map(mm) = initial_allocation_mask;
 }
 
+extern void thread_pkey_regs_save(struct thread_struct *thread);
+extern void thread_pkey_regs_restore(struct thread_struct *new_thread,
+   struct thread_struct *old_thread);
+extern void thread_pkey_regs_init(struct thread_struct *thread);
 extern void pkey_initialize(void);
 #endif /*_ASM_PPC64_PKEYS_H */
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index fab7ff8..de9d9ba 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -309,6 +309,11 @@ struct thread_struct {
struct thread_vr_state ckvr_state; /* Checkpointed VR state */
unsigned long   ckvrsave; /* Checkpointed VRSAVE */
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   unsigned long   amr;
+   unsigned long   iamr;
+   unsigned long   uamor;
+#endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
void*   kvm_shadow_vcpu; /* KVM internal data */
 #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index a0c74bb..ba80002 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1085,6 +1086,9 @@ static inline void save_sprs(struct thread_struct *t)
t->tar = mfspr(SPRN_TAR);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   thread_pkey_regs_save(t);
+#endif
 }
 
 static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1120,6 +1124,9 @@ static inline void restore_sprs(struct thread_struct 
*old_thread,
mtspr(SPRN_TAR, new_thread->tar);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   thread_pkey_regs_restore(new_thread, old_thread);
+#endif
 }
 
 #ifdef CONFIG_PPC_BOOK3S_64
@@ -1705,6 +1712,9 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
current->thread.tm_tfiar = 0;
current->thread.load_tm = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   thread_pkey_regs_init(¤t->thread);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL(start_thread);
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 2282864..7cd1be4 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -149,3 +149,42 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
init_amr(pkey, new_amr_bits);
return 0;
 }
+
+void thread_pkey_regs_save(struct thread_struct *thread)
+{
+   if (!pkey_inited)
+   return;
+
+   /* @TODO skip saving any registers if the thread
+* has not used any keys yet.
+*/
+
+   thread->amr = read_amr();
+   thread->iamr = read_iamr();
+   thread->uamor = read_uamor();
+}
+
+void thread_pkey_regs_restore(struct thread_struct *new_thread,
+   struct thread_struct *old_thread)
+{
+   if (!pkey_inited)
+   return;
+
+   /* @TODO just reset uamor to zero if the new_thread
+* has not used any keys yet.
+*/
+
+   if (old_thread->amr != new_thread->amr)
+   write_amr(new_thread->amr);
+   if (old_thread->iamr != new_thread->iamr)
+   write_iamr(new_thread->iamr);
+   if (old_thread->uamor != new_thread->uamor)
+   write_uamor(new_thread->uamor);
+}
+
+void thread_pkey_regs_init(struct thread_struct *thread)
+{
+   write_amr(0x0ul);
+   write_iamr(0x0ul);
+   write_uamor(0x0ul);
+}
-- 
1.7.1



[PATCH 11/25] powerpc: introduce execute-only pkey

2017-09-08 Thread Ram Pai
This patch provides the implementation of execute-only pkey.
The architecture-independent layer expects the arch-dependent
layer, to support the ability to create and enable a special
key which has execute-only permission.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |1 +
 arch/powerpc/include/asm/pkeys.h |9 -
 arch/powerpc/mm/pkeys.c  |   57 ++
 3 files changed, 66 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 55950f4..ee18ba0 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -115,6 +115,7 @@ struct patb_entry {
 * bit unset -> key available for allocation
 */
u32 pkey_allocation_map;
+   s16 execute_only_pkey; /* key holding execute-only protection */
 #endif
 } mm_context_t;
 
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 78c5362..0cf115f 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -115,11 +115,16 @@ static inline int mm_pkey_free(struct mm_struct *mm, int 
pkey)
  * Try to dedicate one of the protection keys to be used as an
  * execute-only protection key.
  */
+extern int __execute_only_pkey(struct mm_struct *mm);
 static inline int execute_only_pkey(struct mm_struct *mm)
 {
-   return 0;
+   if (!pkey_inited || !pkey_execute_disable_support)
+   return -1;
+
+   return __execute_only_pkey(mm);
 }
 
+
 static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
int prot, int pkey)
 {
@@ -141,6 +146,8 @@ static inline void pkey_mm_init(struct mm_struct *mm)
if (!pkey_inited)
return;
mm_pkey_allocation_map(mm) = initial_allocation_mask;
+   /* -1 means unallocated or invalid */
+   mm->context.execute_only_pkey = -1;
 }
 
 extern void thread_pkey_regs_save(struct thread_struct *thread);
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 7cd1be4..8a24983 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -188,3 +188,60 @@ void thread_pkey_regs_init(struct thread_struct *thread)
write_iamr(0x0ul);
write_uamor(0x0ul);
 }
+
+static inline bool pkey_allows_readwrite(int pkey)
+{
+   int pkey_shift = pkeyshift(pkey);
+
+   if (!(read_uamor() & (0x3UL << pkey_shift)))
+   return true;
+
+   return !(read_amr() & ((AMR_RD_BIT|AMR_WR_BIT) << pkey_shift));
+}
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+   bool need_to_set_mm_pkey = false;
+   int execute_only_pkey = mm->context.execute_only_pkey;
+   int ret;
+
+   /* Do we need to assign a pkey for mm's execute-only maps? */
+   if (execute_only_pkey == -1) {
+   /* Go allocate one to use, which might fail */
+   execute_only_pkey = mm_pkey_alloc(mm);
+   if (execute_only_pkey < 0)
+   return -1;
+   need_to_set_mm_pkey = true;
+   }
+
+   /*
+* We do not want to go through the relatively costly
+* dance to set AMR if we do not need to.  Check it
+* first and assume that if the execute-only pkey is
+* readwrite-disabled than we do not have to set it
+* ourselves.
+*/
+   if (!need_to_set_mm_pkey &&
+   !pkey_allows_readwrite(execute_only_pkey))
+   return execute_only_pkey;
+
+   /*
+* Set up AMR so that it denies access for everything
+* other than execution.
+*/
+   ret = __arch_set_user_pkey_access(current, execute_only_pkey,
+   (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+   /*
+* If the AMR-set operation failed somehow, just return
+* 0 and effectively disable execute-only support.
+*/
+   if (ret) {
+   mm_set_pkey_free(mm, execute_only_pkey);
+   return -1;
+   }
+
+   /* We got one, store it and use it from here on out */
+   if (need_to_set_mm_pkey)
+   mm->context.execute_only_pkey = execute_only_pkey;
+   return execute_only_pkey;
+}
-- 
1.7.1



[PATCH 12/25] powerpc: ability to associate pkey to a vma

2017-09-08 Thread Ram Pai
arch-independent code expects the arch to  map
a  pkey  into the vma's protection bit setting.
The patch provides that ability.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mman.h  |8 +++-
 arch/powerpc/include/asm/pkeys.h |   18 ++
 2 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..067eec2 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -22,7 +23,12 @@
 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
unsigned long pkey)
 {
-   return (prot & PROT_SAO) ? VM_SAO : 0;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (((prot & PROT_SAO) ? VM_SAO : 0) |
+   pkey_to_vmflag_bits(pkey));
+#else
+   return ((prot & PROT_SAO) ? VM_SAO : 0);
+#endif
 }
 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0cf115f..f13e913 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -23,6 +23,24 @@
 #define VM_PKEY_BIT4   VM_HIGH_ARCH_4
 #endif
 
+/* override any generic PKEY Permission defines */
+#define PKEY_DISABLE_EXECUTE   0x4
+#define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
+   PKEY_DISABLE_WRITE  |\
+   PKEY_DISABLE_EXECUTE)
+
+static inline u64 pkey_to_vmflag_bits(u16 pkey)
+{
+   if (!pkey_inited)
+   return 0x0UL;
+
+   return (((pkey & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) |
+   ((pkey & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |
+   ((pkey & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |
+   ((pkey & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) |
+   ((pkey & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL));
+}
+
 #define arch_max_pkey()  pkeys_total
 #define AMR_RD_BIT 0x1UL
 #define AMR_WR_BIT 0x2UL
-- 
1.7.1



[PATCH 13/25] powerpc: implementation for arch_override_mprotect_pkey()

2017-09-08 Thread Ram Pai
arch independent code calls arch_override_mprotect_pkey()
to return a pkey that best matches the requested protection.

This patch provides the implementation.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mmu_context.h |5 +++
 arch/powerpc/include/asm/pkeys.h   |   17 ++-
 arch/powerpc/mm/pkeys.c|   47 
 3 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index c705a5d..8e5a87e 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -145,6 +145,11 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
 #ifndef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
 #define pkey_initialize()
 #define pkey_mm_init(mm)
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+   return 0;
+}
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index f13e913..d2fffef 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -41,6 +41,16 @@ static inline u64 pkey_to_vmflag_bits(u16 pkey)
((pkey & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL));
 }
 
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
+   VM_PKEY_BIT3 | VM_PKEY_BIT4)
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+   if (!pkey_inited)
+   return 0;
+   return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
+}
+
 #define arch_max_pkey()  pkeys_total
 #define AMR_RD_BIT 0x1UL
 #define AMR_WR_BIT 0x2UL
@@ -142,11 +152,14 @@ static inline int execute_only_pkey(struct mm_struct *mm)
return __execute_only_pkey(mm);
 }
 
-
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+   int prot, int pkey);
 static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
int prot, int pkey)
 {
-   return 0;
+   if (!pkey_inited)
+   return 0;
+   return __arch_override_mprotect_pkey(vma, prot, pkey);
 }
 
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 8a24983..fb1a76a 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -245,3 +245,50 @@ int __execute_only_pkey(struct mm_struct *mm)
mm->context.execute_only_pkey = execute_only_pkey;
return execute_only_pkey;
 }
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+   /* Do this check first since the vm_flags should be hot */
+   if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+   return false;
+
+   return (vma_pkey(vma) == vma->vm_mm->context.execute_only_pkey);
+}
+
+/*
+ * This should only be called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
+   int pkey)
+{
+   /*
+* Is this an mprotect_pkey() call?  If so, never
+* override the value that came from the user.
+*/
+   if (pkey != -1)
+   return pkey;
+
+   /*
+* If the currently associated pkey is execute-only,
+* but the requested protection requires read or write,
+* move it back to the default pkey.
+*/
+   if (vma_is_pkey_exec_only(vma) &&
+   (prot & (PROT_READ|PROT_WRITE)))
+   return 0;
+
+   /*
+* the requested protection is execute-only. Hence
+* lets use a execute-only pkey.
+*/
+   if (prot == PROT_EXEC) {
+   pkey = execute_only_pkey(vma->vm_mm);
+   if (pkey > 0)
+   return pkey;
+   }
+
+   /*
+* nothing to override.
+*/
+   return vma_pkey(vma);
+}
-- 
1.7.1



[PATCH 14/25] powerpc: map vma key-protection bits to pte key bits.

2017-09-08 Thread Ram Pai
map  the  key  protection  bits of the vma to the pkey bits in
the PTE.

The Pte  bits used  for pkey  are  3,4,5,6  and 57. The  first
four bits are the same four bits that were freed up  initially
in this patch series. remember? :-) Without those four bits
this patch would'nt be possible.

BUT, On 4k kernel, bit 3, and 4 could not be freed up. remember?
Hence we have to be satisfied with 5,6 and 7.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   25 -
 arch/powerpc/include/asm/mman.h  |8 
 arch/powerpc/include/asm/pkeys.h |   12 
 3 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 73ed52c..5935d4e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -38,6 +38,7 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+#define _RPAGE_RSV50x00040UL
 
 #define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
 #define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
@@ -57,6 +58,25 @@
 /* Max physical address bit as per radix table */
 #define _RPAGE_PA_MAX  57
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_PPC_64K_PAGES
+#define H_PAGE_PKEY_BIT0   _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1   _RPAGE_RSV2
+#else /* CONFIG_PPC_64K_PAGES */
+#define H_PAGE_PKEY_BIT0   0 /* _RPAGE_RSV1 is not available */
+#define H_PAGE_PKEY_BIT1   0 /* _RPAGE_RSV2 is not available */
+#endif /* CONFIG_PPC_64K_PAGES */
+#define H_PAGE_PKEY_BIT2   _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3   _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4   _RPAGE_RSV5
+#else /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0   0
+#define H_PAGE_PKEY_BIT1   0
+#define H_PAGE_PKEY_BIT2   0
+#define H_PAGE_PKEY_BIT3   0
+#define H_PAGE_PKEY_BIT4   0
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 /*
  * Max physical address bit we will use for now.
  *
@@ -120,13 +140,16 @@
 #define _PAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
 _PAGE_ACCESSED | _PAGE_SPECIAL | _PAGE_PTE |   \
 _PAGE_SOFT_DIRTY)
+
+#define H_PAGE_PKEY  (H_PAGE_PKEY_BIT0 | H_PAGE_PKEY_BIT1 | H_PAGE_PKEY_BIT2 | 
\
+   H_PAGE_PKEY_BIT3 | H_PAGE_PKEY_BIT4)
 /*
  * Mask of bits returned by pte_pgprot()
  */
 #define PAGE_PROT_BITS  (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
 H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
 _PAGE_READ | _PAGE_WRITE |  _PAGE_DIRTY | _PAGE_EXEC | 
\
-_PAGE_SOFT_DIRTY)
+_PAGE_SOFT_DIRTY | H_PAGE_PKEY)
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 067eec2..3f7220f 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -32,12 +32,20 @@ static inline unsigned long arch_calc_vm_prot_bits(unsigned 
long prot,
 }
 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
+
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (vm_flags & VM_SAO) ?
+   __pgprot(_PAGE_SAO | vmflag_to_page_pkey_bits(vm_flags)) :
+   __pgprot(0 | vmflag_to_page_pkey_bits(vm_flags));
+#else
return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
+#endif
 }
 #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
+
 static inline bool arch_validate_prot(unsigned long prot)
 {
if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index d2fffef..0d2488a 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -41,6 +41,18 @@ static inline u64 pkey_to_vmflag_bits(u16 pkey)
((pkey & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL));
 }
 
+static inline u64 vmflag_to_page_pkey_bits(u64 vm_flags)
+{
+   if (!pkey_inited)
+   return 0x0UL;
+
+   return (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL));
+}
+
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PK

[PATCH 15/25] powerpc: sys_pkey_mprotect() system call

2017-09-08 Thread Ram Pai
Patch provides the ability for a process to
associate a pkey with a address range.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/systbl.h  |1 +
 arch/powerpc/include/asm/unistd.h  |4 +---
 arch/powerpc/include/uapi/asm/unistd.h |1 +
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 22dd776..b33b551 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@
 SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
+SYSCALL(pkey_mprotect)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index e0273bc..daf1ba9 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,12 +12,10 @@
 #include 
 
 
-#define NR_syscalls386
+#define NR_syscalls387
 
 #define __NR__exit __NR_exit
 
-#define __IGNORE_pkey_mprotect
-
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 7993a07..71ae45e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_statx 383
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
+#define __NR_pkey_mprotect 386
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
1.7.1



[PATCH 16/25] powerpc: Program HPTE key protection bits

2017-09-08 Thread Ram Pai
Map the PTE protection key bits to the HPTE key protection bits,
while creating HPTE  entries.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |5 +
 arch/powerpc/include/asm/mmu_context.h|6 ++
 arch/powerpc/include/asm/pkeys.h  |   13 +
 arch/powerpc/mm/hash_utils_64.c   |1 +
 4 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 508275b..2e22357 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
 #define HPTE_R_PP0 ASM_CONST(0x8000)
 #define HPTE_R_TS  ASM_CONST(0x4000)
 #define HPTE_R_KEY_HI  ASM_CONST(0x3000)
+#define HPTE_R_KEY_BIT0ASM_CONST(0x2000)
+#define HPTE_R_KEY_BIT1ASM_CONST(0x1000)
 #define HPTE_R_RPN_SHIFT   12
 #define HPTE_R_RPN ASM_CONST(0x0000)
 #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000)
@@ -104,6 +106,9 @@
 #define HPTE_R_C   ASM_CONST(0x0080)
 #define HPTE_R_R   ASM_CONST(0x0100)
 #define HPTE_R_KEY_LO  ASM_CONST(0x0e00)
+#define HPTE_R_KEY_BIT2ASM_CONST(0x0800)
+#define HPTE_R_KEY_BIT3ASM_CONST(0x0400)
+#define HPTE_R_KEY_BIT4ASM_CONST(0x0200)
 #define HPTE_R_KEY (HPTE_R_KEY_LO | HPTE_R_KEY_HI)
 
 #define HPTE_V_1TB_SEG ASM_CONST(0x4000)
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 8e5a87e..04e9221 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -150,6 +150,12 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 {
return 0;
 }
+
+static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
+{
+   return 0x0UL;
+}
+
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0d2488a..cd3924c 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -67,6 +67,19 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 #define AMR_RD_BIT 0x1UL
 #define AMR_WR_BIT 0x2UL
 #define IAMR_EX_BIT 0x1UL
+
+static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
+{
+   if (!pkey_inited)
+   return 0x0UL;
+
+   return (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL));
+}
+
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
VM_PKEY_BIT3 | VM_PKEY_BIT4)
 #define AMR_BITS_PER_PKEY 2
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 67f62b5..a739a2d 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -232,6 +232,7 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags)
 */
rflags |= HPTE_R_M;
 
+   rflags |= pte_to_hpte_pkey_bits(pteflags);
return rflags;
 }
 
-- 
1.7.1



[PATCH 17/25] powerpc: helper to validate key-access permissions of a pte

2017-09-08 Thread Ram Pai
helper function that checks if the read/write/execute is allowed
on the pte.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |4 +++
 arch/powerpc/include/asm/pkeys.h |   12 +++
 arch/powerpc/mm/pkeys.c  |   28 ++
 3 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5935d4e..bd244b3 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -492,6 +492,10 @@ static inline void write_uamor(u64 value)
mtspr(SPRN_UAMOR, value);
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+extern bool arch_pte_access_permitted(u64 pte, bool write, bool execute);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index cd3924c..50522a0 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -80,6 +80,18 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL));
 }
 
+static inline u16 pte_to_pkey_bits(u64 pteflags)
+{
+   if (!pkey_inited)
+   return 0x0UL;
+
+   return (((pteflags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0UL) |
+   ((pteflags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0UL));
+}
+
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
VM_PKEY_BIT3 | VM_PKEY_BIT4)
 #define AMR_BITS_PER_PKEY 2
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index fb1a76a..24589d9 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -292,3 +292,31 @@ int __arch_override_mprotect_pkey(struct vm_area_struct 
*vma, int prot,
 */
return vma_pkey(vma);
 }
+
+static bool pkey_access_permitted(int pkey, bool write, bool execute)
+{
+   int pkey_shift;
+   u64 amr;
+
+   if (!pkey)
+   return true;
+
+   pkey_shift = pkeyshift(pkey);
+   if (!(read_uamor() & (0x3UL << pkey_shift)))
+   return true;
+
+   if (execute && !(read_iamr() & (IAMR_EX_BIT << pkey_shift)))
+   return true;
+
+   amr = read_amr(); /* delay reading amr uptil absolutely needed*/
+   return ((!write && !(amr & (AMR_RD_BIT << pkey_shift))) ||
+   (write &&  !(amr & (AMR_WR_BIT << pkey_shift;
+}
+
+bool arch_pte_access_permitted(u64 pte, bool write, bool execute)
+{
+   if (!pkey_inited)
+   return true;
+   return pkey_access_permitted(pte_to_pkey_bits(pte),
+   write, execute);
+}
-- 
1.7.1



[PATCH 18/25] powerpc: check key protection for user page access

2017-09-08 Thread Ram Pai
Make sure that the kernel does not access user pages without
checking their key-protection.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index bd244b3..d22bb4d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -494,6 +494,20 @@ static inline void write_uamor(u64 value)
 
 #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
 extern bool arch_pte_access_permitted(u64 pte, bool write, bool execute);
+
+#define pte_access_permitted(pte, write) \
+   (pte_present(pte) && \
+((!(write) || pte_write(pte)) && \
+ arch_pte_access_permitted(pte_val(pte), !!write, 0)))
+
+/*
+ * We store key in pmd for huge tlb pages. So need
+ * to check for key protection.
+ */
+#define pmd_access_permitted(pmd, write) \
+   (pmd_present(pmd) && \
+((!(write) || pmd_write(pmd)) && \
+ arch_pte_access_permitted(pmd_val(pmd), !!write, 0)))
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-- 
1.7.1



[PATCH 19/25] powerpc: implementation for arch_vma_access_permitted()

2017-09-08 Thread Ram Pai
This patch provides the implementation for
arch_vma_access_permitted(). Returns true if the
requested access is allowed by pkey associated with the
vma.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mmu_context.h |5 +++-
 arch/powerpc/mm/pkeys.c|   43 
 2 files changed, 47 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 04e9221..9a56355 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -135,6 +135,10 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+   bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
 {
@@ -142,7 +146,6 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
return true;
 }
 
-#ifndef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
 #define pkey_initialize()
 #define pkey_mm_init(mm)
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 24589d9..21c3b42 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -320,3 +320,46 @@ bool arch_pte_access_permitted(u64 pte, bool write, bool 
execute)
return pkey_access_permitted(pte_to_pkey_bits(pte),
write, execute);
 }
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to AMR/IAMR for other
+ * processes or any way to tell *which * AMR/IAMR in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+   if (!current->mm)
+   return true;
+   /*
+* if the VMA is from another process, then AMR/IAMR has no
+* relevance and should not be enforced.
+*/
+   if (current->mm != vma->vm_mm)
+   return true;
+
+   return false;
+}
+
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+   bool write, bool execute, bool foreign)
+{
+   int pkey;
+
+   if (!pkey_inited)
+   return true;
+
+   /* allow access if the VMA is not one from this process */
+   if (foreign || vma_is_foreign(vma))
+   return true;
+
+   pkey = vma_pkey(vma);
+
+   if (!pkey)
+   return true;
+
+   return pkey_access_permitted(pkey, write, execute);
+}
-- 
1.7.1



[PATCH 20/25] powerpc: Handle exceptions caused by pkey violation

2017-09-08 Thread Ram Pai
Handle Data and  Instruction exceptions caused by memory
protection-key.

The CPU will detect the key fault if the HPTE is already
programmed with the key.

However if the HPTE is not  hashed, a key fault will not
be detected by the  hardware. The   software will detect
pkey violation in such a case.

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/fault.c |   37 -
 1 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 4797d08..a16bc43 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -145,6 +145,23 @@ static noinline int bad_area(struct pt_regs *regs, 
unsigned long address)
return __bad_area(regs, address, SEGV_MAPERR);
 }
 
+static int bad_page_fault_exception(struct pt_regs *regs, unsigned long 
address,
+   int si_code)
+{
+   int sig = SIGBUS;
+   int code = BUS_OBJERR;
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (si_code & DSISR_KEYFAULT) {
+   sig = SIGSEGV;
+   code = SEGV_PKUERR;
+   }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+   _exception(sig, regs, code, address);
+   return 0;
+}
+
 static int do_sigbus(struct pt_regs *regs, unsigned long address,
 unsigned int fault)
 {
@@ -391,11 +408,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
return 0;
 
if (unlikely(page_fault_is_bad(error_code))) {
-   if (is_user) {
-   _exception(SIGBUS, regs, BUS_OBJERR, address);
-   return 0;
-   }
-   return SIGBUS;
+   if (!is_user)
+   return SIGBUS;
+   return bad_page_fault_exception(regs, address, error_code);
}
 
/* Additional sanity check(s) */
@@ -492,6 +507,18 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (unlikely(access_error(is_write, is_exec, vma)))
return bad_area(regs, address);
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+   is_exec, 0))
+   return __bad_area(regs, address, SEGV_PKUERR);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
+   /* handle_mm_fault() needs to know if its a instruction access
+* fault.
+*/
+   if (is_exec)
+   flags |= FAULT_FLAG_INSTRUCTION;
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
-- 
1.7.1



[PATCH 21/25] powerpc: introduce get_pte_pkey() helper

2017-09-08 Thread Ram Pai
get_pte_pkey() helper returns the pkey associated with
a address corresponding to a given mm_struct.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |5 +
 arch/powerpc/mm/hash_utils_64.c   |   24 
 2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 2e22357..8716031 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -451,6 +451,11 @@ extern int hash_page(unsigned long ea, unsigned long 
access, unsigned long trap,
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long 
vsid,
 pte_t *ptep, unsigned long trap, unsigned long flags,
 int ssize, unsigned int shift, unsigned int mmu_psize);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+u16 get_pte_pkey(struct mm_struct *mm, unsigned long address);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern int __hash_page_thp(unsigned long ea, unsigned long access,
   unsigned long vsid, pmd_t *pmdp, unsigned long trap,
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index a739a2d..5917d45 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1572,6 +1572,30 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+/*
+ * return the protection key associated with the given address
+ * and the mm_struct.
+ */
+u16 get_pte_pkey(struct mm_struct *mm, unsigned long address)
+{
+   pte_t *ptep;
+   u16 pkey = 0;
+   unsigned long flags;
+
+   if (!mm || !mm->pgd)
+   return 0;
+
+   local_irq_save(flags);
+   ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
+   if (ptep)
+   pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
+   local_irq_restore(flags);
+
+   return pkey;
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 static inline void tm_flush_hash_page(int local)
 {
-- 
1.7.1



[PATCH 22/25] powerpc: capture the violated protection key on fault

2017-09-08 Thread Ram Pai
Capture the protection key that got violated in paca.
This value will be later used to inform the signal
handler.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/paca.h   |3 +++
 arch/powerpc/kernel/asm-offsets.c |5 +
 arch/powerpc/mm/fault.c   |   11 ++-
 3 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 04b60af..51c89c1 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -97,6 +97,9 @@ struct paca_struct {
struct dtl_entry *dispatch_log_end;
 #endif /* CONFIG_PPC_STD_MMU_64 */
u64 dscr_default;   /* per-CPU default DSCR */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   u16 paca_pkey;  /* exception causing pkey */
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
 #ifdef CONFIG_PPC_STD_MMU_64
/*
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 8cfb20e..361f0d4 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -241,6 +241,11 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   OFFSET(PACA_PKEY, paca_struct, paca_pkey);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a16bc43..ad31f6e 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -153,6 +153,7 @@ static int bad_page_fault_exception(struct pt_regs *regs, 
unsigned long address,
 
 #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
if (si_code & DSISR_KEYFAULT) {
+   get_paca()->paca_pkey = get_pte_pkey(current->mm, address);
sig = SIGSEGV;
code = SEGV_PKUERR;
}
@@ -509,8 +510,16 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
 #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
-   is_exec, 0))
+   is_exec, 0)) {
+   /*
+* The pgd-pdt...pmd-pte tree may not  have  been fully setup.
+* Hence we cannot walk the tree to locate the pte, to locate
+* the key. Hence lets use vma_pkey() to get the key; instead
+* of get_pte_pkey().
+*/
+   get_paca()->paca_pkey = vma_pkey(vma);
return __bad_area(regs, address, SEGV_PKUERR);
+   }
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
 
-- 
1.7.1



[PATCH 23/25] powerpc: Deliver SEGV signal on pkey violation

2017-09-08 Thread Ram Pai
The value of the pkey, whose protection got violated,
is made available in si_pkey field of the siginfo structure.

Also keep the thread's pkey-register fields up2date.

Signed-off-by: Ram Pai 
---
 arch/powerpc/kernel/traps.c |   22 ++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index ec74e20..f2a310d 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -265,6 +266,15 @@ void user_single_step_siginfo(struct task_struct *tsk,
info->si_addr = (void __user *)regs->nip;
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long 
addr)
+{
+   if (info->si_signo != SIGSEGV || si_code != SEGV_PKUERR)
+   return;
+   info->si_pkey = get_paca()->paca_pkey;
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
 {
siginfo_t info;
@@ -292,6 +302,18 @@ void _exception(int signr, struct pt_regs *regs, int code, 
unsigned long addr)
info.si_signo = signr;
info.si_code = code;
info.si_addr = (void __user *) addr;
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /*
+* update the thread's pkey related fields.
+* core-dump handlers and other sub-systems
+* depend on those values.
+*/
+   thread_pkey_regs_save(¤t->thread);
+   /* update the violated-key value */
+   fill_sig_info_pkey(code, &info, addr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
force_sig_info(signr, &info, current);
 }
 
-- 
1.7.1



[PATCH 24/25] powerpc/ptrace: Add memory protection key regset

2017-09-08 Thread Ram Pai
From: Thiago Jung Bauermann 

The AMR/IAMR/UAMOR are part of the program context.
Allow it to be accessed via ptrace and through core files.

Signed-off-by: Ram Pai 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/pkeys.h|5 +++
 arch/powerpc/include/uapi/asm/elf.h |1 +
 arch/powerpc/kernel/ptrace.c|   66 +++
 include/uapi/linux/elf.h|1 +
 4 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 50522a0..a0111de 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -209,6 +209,11 @@ static inline int arch_set_user_pkey_access(struct 
task_struct *tsk, int pkey,
return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
+static inline bool arch_pkeys_enabled(void)
+{
+   return pkey_inited;
+}
+
 static inline void pkey_mm_init(struct mm_struct *mm)
 {
if (!pkey_inited)
diff --git a/arch/powerpc/include/uapi/asm/elf.h 
b/arch/powerpc/include/uapi/asm/elf.h
index b2c6fdd..923e6d5 100644
--- a/arch/powerpc/include/uapi/asm/elf.h
+++ b/arch/powerpc/include/uapi/asm/elf.h
@@ -96,6 +96,7 @@
 #define ELF_NTMSPRREG  3   /* include tfhar, tfiar, texasr */
 #define ELF_NEBB   3   /* includes ebbrr, ebbhr, bescr */
 #define ELF_NPMU   5   /* includes siar, sdar, sier, mmcr2, mmcr0 */
+#define ELF_NPKEY  3   /* includes amr, iamr, uamor */
 
 typedef unsigned long elf_greg_t64;
 typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG];
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 07cd22e..6a9d3ec 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define CREATE_TRACE_POINTS
@@ -1775,6 +1776,61 @@ static int pmu_set(struct task_struct *target,
return ret;
 }
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+static int pkey_active(struct task_struct *target,
+  const struct user_regset *regset)
+{
+   if (!arch_pkeys_enabled())
+   return -ENODEV;
+
+   return regset->n;
+}
+
+static int pkey_get(struct task_struct *target,
+   const struct user_regset *regset,
+   unsigned int pos, unsigned int count,
+   void *kbuf, void __user *ubuf)
+{
+   BUILD_BUG_ON(TSO(amr) + sizeof(unsigned long) != TSO(iamr));
+   BUILD_BUG_ON(TSO(iamr) + sizeof(unsigned long) != TSO(uamor));
+
+   if (!arch_pkeys_enabled())
+   return -ENODEV;
+
+   return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+  &target->thread.amr, 0,
+  ELF_NPKEY * sizeof(unsigned long));
+}
+
+static int pkey_set(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+   u64 new_amr;
+   int ret;
+
+   if (!arch_pkeys_enabled())
+   return -ENODEV;
+
+   /* Only the AMR can be set from userspace */
+   if (pos != 0 || count != sizeof(new_amr))
+   return -EINVAL;
+
+   ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+&new_amr, 0, sizeof(new_amr));
+   if (ret)
+   return ret;
+
+   /* UAMOR determines which bits of the AMR can be set from userspace. */
+   target->thread.amr = (new_amr & target->thread.uamor) |
+   (target->thread.amr & ~target->thread.uamor);
+
+   return 0;
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 /*
  * These are our native regset flavors.
  */
@@ -1809,6 +1865,9 @@ enum powerpc_regset {
REGSET_EBB, /* EBB registers */
REGSET_PMR, /* Performance Monitor Registers */
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   REGSET_PKEY,/* AMR register */
+#endif
 };
 
 static const struct user_regset native_regsets[] = {
@@ -1914,6 +1973,13 @@ enum powerpc_regset {
.active = pmu_active, .get = pmu_get, .set = pmu_set
},
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   [REGSET_PKEY] = {
+   .core_note_type = NT_PPC_PKEY, .n = ELF_NPKEY,
+   .size = sizeof(u64), .align = sizeof(u64),
+   .active = pkey_active, .get = pkey_get, .set = pkey_set
+   },
+#endif
 };
 
 static const struct user_regset_view user_ppc_native_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b5280db..0708516 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -395,6 +395,7 @@
 #define NT_PPC_TM_CTAR 0x10d   /* TM checkpointed Target Address 
Register */
 #define NT_PPC_TM_CPPR 0x10e   /* TM checkpointed Program Priority 
Regist

[PATCH 25/25] powerpc: Enable pkey subsystem

2017-09-08 Thread Ram Pai
PAPR defines 'ibm,processor-storage-keys' property. It exports
two values.The first value indicates the number of data-access
keys and the second indicates the number of instruction-access
keys. Though this hints that keys  can  be  either data access
or instruction access only,that is not the case in reality.Any
key can be of any kind.  This patch adds all the keys and uses
that as the total number of keys available to us.

Non PAPR platforms do not define this property   in the device
tree yet. Here, we   hardcode   CPUs   that   support  pkey by
consulting PowerISA3.0

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/cputable.h|   15 ++-
 arch/powerpc/include/asm/mmu_context.h |1 +
 arch/powerpc/include/asm/pkeys.h   |   21 +
 arch/powerpc/kernel/prom.c |   19 +++
 arch/powerpc/mm/pkeys.c|   19 ++-
 5 files changed, 65 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index a9bf921..31ed1d2 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -214,7 +214,9 @@ enum {
 #define CPU_FTR_DAWR   LONG_ASM_CONST(0x0400)
 #define CPU_FTR_DABRX  LONG_ASM_CONST(0x0800)
 #define CPU_FTR_PMAO_BUG   LONG_ASM_CONST(0x1000)
+#define CPU_FTR_PKEY   LONG_ASM_CONST(0x2000)
 #define CPU_FTR_POWER9_DD1 LONG_ASM_CONST(0x4000)
+#define CPU_FTR_PKEY_EXECUTE   LONG_ASM_CONST(0x8000)
 
 #ifndef __ASSEMBLY__
 
@@ -435,7 +437,8 @@ enum {
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | CPU_FTR_PURR | \
-   CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_DABRX)
+   CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_DABRX | \
+   CPU_FTR_PKEY)
 #define CPU_FTRS_POWER6 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -443,7 +446,7 @@ enum {
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
CPU_FTR_DSCR | CPU_FTR_UNALIGNED_LD_STD | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_CFAR | \
-   CPU_FTR_DABRX)
+   CPU_FTR_DABRX | CPU_FTR_PKEY)
 #define CPU_FTRS_POWER7 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -452,7 +455,7 @@ enum {
CPU_FTR_DSCR | CPU_FTR_SAO  | CPU_FTR_ASYM_SMT | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | \
-   CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX)
+   CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX | CPU_FTR_PKEY)
 #define CPU_FTRS_POWER8 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -462,7 +465,8 @@ enum {
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
-   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP)
+   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_PKEY |\
+   CPU_FTR_PKEY_EXECUTE)
 #define CPU_FTRS_POWER8E (CPU_FTRS_POWER8 | CPU_FTR_PMAO_BUG)
 #define CPU_FTRS_POWER8_DD1 (CPU_FTRS_POWER8 & ~CPU_FTR_DBELL)
 #define CPU_FTRS_POWER9 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
@@ -474,7 +478,8 @@ enum {
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
-   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300)
+   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | \
+   CPU_FTR_PKEY | CPU_FTR_PKEY_EXECUTE)
 #define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \
 (~CPU_FTR_SAO))
 #define CPU_FTRS_CELL  (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 9a56355..98ac713 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -148,6 +148,7 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
 
 #define pkey_initialize()
 #define pkey_mm_init(mm)
+#define pkey_mmu_values(total_data, total_execute)
 
 static inline int vma_pkey(struct vm_area_struct *vma)
 {
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index a0111de..baac435 100644
---