Re: [RFC v5 00/38] powerpc: Memory Protection Keys
On Tue 11-07-17 12:32:57, Ram Pai wrote: > On Tue, Jul 11, 2017 at 04:52:46PM +0200, Michal Hocko wrote: > > On Wed 05-07-17 14:21:37, Ram Pai wrote: > > > Memory protection keys enable applications to protect its > > > address space from inadvertent access or corruption from > > > itself. > > > > > > The overall idea: > > > > > > A process allocates a key and associates it with > > > an address range withinits address space. > > > The process then can dynamically set read/write > > > permissions on the key without involving the > > > kernel. Any code that violates the permissions > > > of the address space; as defined by its associated > > > key, will receive a segmentation fault. > > > > > > This patch series enables the feature on PPC64 HPTE > > > platform. > > > > > > ISA3.0 section 5.7.13 describes the detailed specifications. > > > > Could you describe the highlevel design of this feature in the cover > > letter. > > Yes it can be hard to understand without the big picture. I will > provide the high level design and the rationale behind the patch split > towards the end. Also I will have it in the cover letter for my next > revision of the patchset. Thanks! > > I have tried to get some idea from the patchset but it was > > really far from trivial. Patches are not very well split up (many > > helpers are added without their users etc..). > > I see your point. Earlier, I had the patches split such a way that the > users of the helpers were in the same patch as that of the helper. > But then comments from others lead to the current split. It is not my call here, obviously. I cannot review arch specific parts due to lack of familiarity but it is a general good practice to include helpers along with their users to make the usage clear. Also, as much as I like small patches because they are easier to review, having very many of them can lead to a harder review in the end because you easily lose a higher level overview. > > > Testing: > > > This patch series has passed all the protection key > > > tests available in the selftests directory. > > > The tests are updated to work on both x86 and powerpc. > > > > > > version v5: > > > (1) reverted back to the old design -- store the > > > key in the pte, instead of bypassing it. > > > The v4 design slowed down the hash page path. > > > > This surprised me a lot but I couldn't find the respective code. Why do > > you need to store anything in the pte? My understanding of PKEYs is that > > the setup and teardown should be very cheap and so no page tables have > > to updated. Or do I just misunderstand what you wrote here? > > Ideally the MMU looks at the PTE for keys, in order to enforce > protection. This is the case with x86 and is the case with power9 Radix > page table. Hence the keys have to be programmed into the PTE. But x86 doesn't update ptes for PKEYs, that would be just too expensive. You could use standard mprotect to do the same... > However with HPT on power, these keys do not necessarily have to be > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE) > and instead just program them into the Hash Page Table(HPTE), since > the MMU does not refer the PTE but refers the HPTE. The last version > of the page attempted to do that. It worked as follows: > > a) when a address range is requested to be associated with a key; by the >application through key_mprotect() system call, the kernel >stores that key in the vmas corresponding to that address >range. > > b) Whenever there is a hash page fault for that address, the fault >handler reads the key from the VMA and programs the key into the >HPTE. __hash_page() is the function that does that. What causes the fault here? > c) Once the hpte is programmed, the MMU can sense key violations and >generate key-faults. > > The problem is with step (b). This step is really a very critical > path which is performance sensitive. We dont want to add any delays. > However if we want to access the key from the vma, we will have to > hold the vma semaphore, and that is a big NO-NO. As a result, this > design had to be dropped. > > > > I reverted back to the old design i.e the design in v4 version. In this > version we do the following: > > a) when a address range is requested to be associated with a key; by the >application through key_mprotect() system call, the kernel >stores that key in the vmas corresponding to that address >range. Also the kernel programs the key into Linux PTE coresponding to all > the >pages associated with the address range. OK, so how is this any different from the regular mprotect then? > b) Whenever there is a hash page fault for that address, the fault >handler reads the key from the Linux PTE and programs the key into >the HPTE. > > c) Once the HPTE is programmed, the MMU can sense key violations and >generate key-faults. >
Re: [RFC v5 00/38] powerpc: Memory Protection Keys
On Wed 12-07-17 09:23:37, Michal Hocko wrote: > On Tue 11-07-17 12:32:57, Ram Pai wrote: [...] > > Ideally the MMU looks at the PTE for keys, in order to enforce > > protection. This is the case with x86 and is the case with power9 Radix > > page table. Hence the keys have to be programmed into the PTE. > > But x86 doesn't update ptes for PKEYs, that would be just too expensive. > You could use standard mprotect to do the same... OK, this seems to be a misunderstanding and confusion on my end. do_mprotect_pkey does mprotect_fixup even for the pkey path which is quite surprising to me. I guess my misunderstanding comes from Documentation/x86/protection-keys.txt " Memory Protection Keys provides a mechanism for enforcing page-based protections, but without requiring modification of the page tables when an application changes protection domains. It works by dedicating 4 previously ignored bits in each page table entry to a "protection key", giving 16 possible keys. " So please disregard my previous comments about page tables and sorry about the confusion. -- Michal Hocko SUSE Labs
Re: [PATCH] powerpc: Hande page faults for bad paste and bad AMO
Benjamin Herrenschmidt writes: > On POWER9 and bad paste instruction (targeting the wrong memory > type) or an invalid opcode in an AMO (atomic memory operation) > will result in specific DSISR bits to be set. > > We currently don't understand those bits and thus just "hang" > the process taking constant page faults. > > Additionally in the case of paste, it appears that we don't > always get a valid DAR value when the error is a wrong memory > type. > > So we need to test those errors early in do_page_fault(), > I chose to generate a SIGBUS which is more correct than a SIGSEGV. This is true even for hash right ? If so do we want to update do_hash_page: #ifdef CONFIG_PPC_STD_MMU_64 andis. r0,r4,0xa450/* weird error? */ bne-handle_page_fault /* if not, try to insert a HPTE */ that 0xa450 such that we add these errors and call hand_page_fault in case of these DSISR values ? > > Signed-off-by: Benjamin Herrenschmidt > --- > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index 0266c664014a..5dfce2022f74 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -337,6 +337,21 @@ int do_page_fault(struct pt_regs *regs, unsigned long > address, > } > #endif /* CONFIG_PPC_ICSWX */ > > +#ifdef CONFIG_PPC_STD_MMU_64 > + /* > + * These faults indicate a copy/paste on an invalid memory type > + * or an incorrect AMO operation. They have been observed as not > + * properly updating the DAR, so handle them early > + */ > + if (error_code & (DSISR_BAD_COPYPASTE | DSISR_BAD_AMO)) { > + if (user_mode(regs)) > + _exception(SIGBUS, regs, BUS_OBJERR, address); > + else > + rc = SIGBUS; > + goto bail; > + } > +#endif /* CONFIG_PPC_STD_MMU_64 */ > + > if (notify_page_fault(regs)) > goto bail; > > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h > index 7e50e47375d6..71f524f2b109 100644 > --- a/arch/powerpc/include/asm/reg.h > +++ b/arch/powerpc/include/asm/reg.h > @@ -282,6 +282,8 @@ > #define DSISR_UNSUPP_MMU 0x0008 /* Unsupported MMU config */ > #define DSISR_SET_RC 0x0004 /* Failed setting of > R/C bits */ > #define DSISR_PGDIRFAULT 0x0002 /* Fault on page directory */ > +#define DSISR_BAD_COPYPASTE 0x0008 /* Copy/Paste on wrong mem > type */ > +#define DSISR_BAD_AMO 0x0004 /* Incorrect AMO opcode > */ > #define SPRN_TBRL0x10C /* Time Base Read Lower Register (user, R/O) */ > #define SPRN_TBRU0x10D /* Time Base Read Upper Register (user, R/O) */ > #define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */ -aneesh
[PATCH 0/7] Prepare 8xx for CONFIG_STRICT_KERNEL_RWX
This serie makes the PINning of ITLBs optional in the 8xx in order to allow STRICT_KERNEL_RWX to work properly Christophe Leroy (7): powerpc/8xx: Ensures RAM mapped with LTLB is seen as block mapped on 8xx. powerpc/8xx: Remove macro that checks kernel address powerpc/32: Avoid risk of unrecoverable TLBmiss inside entry_32.S powerpc/8xx: Make pinning of ITLBs optional powerpc/8xx: Do not allow Pinned TLBs with STRICT_KERNEL_RWX or DEBUG_PAGEALLOC powerpc/8xx: mark init functions with __init powerpc/8xx: Reduce DTLB miss handler by one insn arch/powerpc/Kconfig | 13 +- arch/powerpc/kernel/entry_32.S | 7 +++ arch/powerpc/kernel/head_8xx.S | 96 +- arch/powerpc/mm/8xx_mmu.c | 29 ++--- 4 files changed, 107 insertions(+), 38 deletions(-) -- 2.12.0
[PATCH 1/7] powerpc/8xx: Ensures RAM mapped with LTLB is seen as block mapped on 8xx.
On the 8xx, the RAM mapped with LTLBs must be seen as block mapped, just like areas mapped with BATs on standard PPC32. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/8xx_mmu.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/8xx_mmu.c b/arch/powerpc/mm/8xx_mmu.c index f4c6472f2fc4..f3a00cef9c34 100644 --- a/arch/powerpc/mm/8xx_mmu.c +++ b/arch/powerpc/mm/8xx_mmu.c @@ -22,8 +22,11 @@ extern int __map_without_ltlbs; +static unsigned long block_mapped_ram; + /* - * Return PA for this VA if it is in IMMR area, or 0 + * Return PA for this VA if it is in an area mapped with LTLBs. + * Otherwise, returns 0 */ phys_addr_t v_block_mapped(unsigned long va) { @@ -33,11 +36,13 @@ phys_addr_t v_block_mapped(unsigned long va) return 0; if (va >= VIRT_IMMR_BASE && va < VIRT_IMMR_BASE + IMMR_SIZE) return p + va - VIRT_IMMR_BASE; + if (va >= PAGE_OFFSET && va < PAGE_OFFSET + block_mapped_ram) + return __pa(va); return 0; } /* - * Return VA for a given PA or 0 if not mapped + * Return VA for a given PA mapped with LTLBs or 0 if not mapped */ unsigned long p_block_mapped(phys_addr_t pa) { @@ -47,6 +52,8 @@ unsigned long p_block_mapped(phys_addr_t pa) return 0; if (pa >= p && pa < p + IMMR_SIZE) return VIRT_IMMR_BASE + pa - p; + if (pa < block_mapped_ram) + return (unsigned long)__va(pa); return 0; } @@ -133,6 +140,8 @@ unsigned long __init mmu_mapin_ram(unsigned long top) if (mapped) memblock_set_current_limit(mapped); + block_mapped_ram = mapped; + return mapped; } -- 2.12.0
[PATCH 2/7] powerpc/8xx: Remove macro that checks kernel address
The macro to check if an address is a kernel address or not is not used anymore in DTLBmiss handler. It is used in ITLB miss handler and in DTLB error handler. DTLB error handler is not a hot path, it doesn't need such optimisation. In order to simplify a following patch which will rework ITLB miss handler, we remove the macros and reintroduce them inside the handler. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 29 - 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index c032fe8c2d26..02671e33905c 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -50,16 +50,9 @@ mtspr spr, reg #endif -/* Macro to test if an address is a kernel address */ #if CONFIG_TASK_SIZE <= 0x8000 && CONFIG_PAGE_OFFSET >= 0x8000 -#define IS_KERNEL(tmp, addr) \ - andis. tmp, addr, 0x8000 /* Address >= 0x8000 */ -#define BRANCH_UNLESS_KERNEL(label)beq label -#else -#define IS_KERNEL(tmp, addr) \ - rlwinm tmp, addr, 16, 16, 31; \ - cmpli cr0, tmp, PAGE_OFFSET >> 16 -#define BRANCH_UNLESS_KERNEL(label)blt label +/* By simply checking Address >= 0x8000, we know if its a kernel address */ +#define SIMPLE_KERNEL_ADDRESS 1 #endif @@ -347,11 +340,20 @@ InstructionTLBMiss: mfcrr3 #endif #if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) - IS_KERNEL(r11, r10) +#ifdef SIMPLE_KERNEL_ADDRESS + andis. r11, r10, 0x8000/* Address >= 0x8000 */ +#else + rlwinm r11, r10, 16, 0xfff8 + cmpli cr0, r11, PAGE_OFFSET@h +#endif #endif mfspr r11, SPRN_M_TW /* Get level 1 table */ #if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) - BRANCH_UNLESS_KERNEL(3f) +#ifdef SIMPLE_KERNEL_ADDRESS + beq+3f +#else + blt+3f +#endif lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha 3: #endif @@ -705,9 +707,10 @@ FixupDAR:/* Entry point for dcbx workaround. */ mtspr SPRN_SPRG_SCRATCH2, r10 /* fetch instruction from memory. */ mfspr r10, SPRN_SRR0 - IS_KERNEL(r11, r10) + rlwinm r11, r10, 16, 0xfff8 + cmpli cr0, r11, PAGE_OFFSET@h mfspr r11, SPRN_M_TW /* Get level 1 table */ - BRANCH_UNLESS_KERNEL(3f) + blt+3f rlwinm r11, r10, 16, 0xfff8 _ENTRY(FixupDAR_cmp) cmpli cr7, r11, (PAGE_OFFSET + 0x180)@h -- 2.12.0
[PATCH 3/7] powerpc/32: Avoid risk of unrecoverable TLBmiss inside entry_32.S
By default, the 8xx pins an ITLB on the first 8M of memory in order to avoid any ITLB miss on kernel code. However, with some debug functions like DEBUG_PAGEALLOC and DEBUG_RODATA, pinning TLBs is contradictory. In order to avoid any ITLB miss in a critical section without pinning TLBs, we have to ensure that there is no page boundary crossed between the setup of a new value in SRR0/SRR1 and the associated RFI. The functions modifying srr0/srr1 are all located in setup_32.S. They are spread over almost 4kbytes. The patch forces a 12 bits (4kbytes) alignment for those functions. This garanties that the functions remain in a single 4k page. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/entry_32.S | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 8587059ad848..4e9a359ceff6 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -43,6 +43,13 @@ #define LOAD_MSR_KERNEL(r, x) li r,(x) #endif +/* + * Align to 4k in order to ensure that all functions modyfing srr0/srr1 + * fit into one page in order to not encounter a TLB miss between the + * modification of srr0/srr1 and the associated rfi. + */ + .align 12 + #ifdef CONFIG_BOOKE .globl mcheck_transfer_to_handler mcheck_transfer_to_handler: -- 2.12.0
[PATCH 4/7] powerpc/8xx: Make pinning of ITLBs optional
As stated in a comment in head_8xx.S, today we "Always pin the first 8 MB ITLB to prevent ITLB misses while mucking around with SRR0/SRR1 in asm". This issue has just been cleared by the preceding patch, therefore we can make this pinning optional (on by default) and independent of DATA pinning. This patch also makes pinning of IMMR independent of pinning of DATA. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 10 arch/powerpc/kernel/head_8xx.S | 57 +- arch/powerpc/mm/8xx_mmu.c | 8 +- 3 files changed, 62 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 36f858c37ca7..d09b259d3621 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -1167,10 +1167,20 @@ config PIN_TLB bool "Pinned Kernel TLBs (860 ONLY)" depends on ADVANCED_OPTIONS && 8xx +config PIN_TLB_DATA + bool "Pinned TLB for DATA" + depends on PIN_TLB + default y + config PIN_TLB_IMMR bool "Pinned TLB for IMMR" depends on PIN_TLB default y + +config PIN_TLB_TEXT + bool "Pinned TLB for TEXT" + depends on PIN_TLB + default y endmenu if PPC64 diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 02671e33905c..b889b5812274 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -55,6 +55,15 @@ #define SIMPLE_KERNEL_ADDRESS 1 #endif +/* + * We need an ITLB miss handler for kernel addresses if: + * - Either we have modules + * - Or we have not pinned the first 8M + */ +#if defined(CONFIG_MODULES) || !defined(CONFIG_PIN_TLB_TEXT) || \ +defined(CONFIG_DEBUG_PAGEALLOC) +#define ITLB_MISS_KERNEL 1 +#endif /* * Value for the bits that have fixed value in RPN entries. @@ -318,7 +327,7 @@ SystemCall: #endif InstructionTLBMiss: -#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE) +#if defined(CONFIG_8xx_CPU6) || defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE) mtspr SPRN_SPRG_SCRATCH2, r3 #endif EXCEPTION_PROLOG_0 @@ -336,24 +345,32 @@ InstructionTLBMiss: INVALIDATE_ADJACENT_PAGES_CPU15(r11, r10) /* Only modules will cause ITLB Misses as we always * pin the first 8MB of kernel memory */ -#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE) +#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE) mfcrr3 #endif -#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) -#ifdef SIMPLE_KERNEL_ADDRESS +#ifdef ITLB_MISS_KERNEL +#if defined(SIMPLE_KERNEL_ADDRESS) && defined(CONFIG_PIN_TLB_TEXT) andis. r11, r10, 0x8000/* Address >= 0x8000 */ #else rlwinm r11, r10, 16, 0xfff8 cmpli cr0, r11, PAGE_OFFSET@h +#ifndef CONFIG_PIN_TLB_TEXT + /* It is assumed that kernel code fits into the first 8M page */ +_ENTRY(ITLBMiss_cmp) + cmpli cr7, r11, (PAGE_OFFSET + 0x080)@h +#endif #endif #endif mfspr r11, SPRN_M_TW /* Get level 1 table */ -#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) -#ifdef SIMPLE_KERNEL_ADDRESS +#ifdef ITLB_MISS_KERNEL +#if defined(SIMPLE_KERNEL_ADDRESS) && defined(CONFIG_PIN_TLB_TEXT) beq+3f #else blt+3f #endif +#ifndef CONFIG_PIN_TLB_TEXT + blt cr7, ITLBMissLinear +#endif lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha 3: #endif @@ -371,7 +388,7 @@ InstructionTLBMiss: rlwimi r10, r11, 0, 0, 32 - PAGE_SHIFT - 1 /* Add level 2 base */ lwz r10, 0(r10) /* Get the pte */ 4: -#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE) +#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE) mtcrr3 #endif /* Insert the APG into the TWC from the Linux PTE. */ @@ -402,7 +419,7 @@ InstructionTLBMiss: MTSPR_CPU6(SPRN_MI_RPN, r10, r3)/* Update TLB entry */ /* Restore registers */ -#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE) +#if defined(CONFIG_8xx_CPU6) || defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE) mfspr r3, SPRN_SPRG_SCRATCH2 #endif EXCEPTION_EPILOG_0 @@ -697,6 +714,22 @@ DTLBMissLinear: EXCEPTION_EPILOG_0 rfi +#ifndef CONFIG_PIN_TLB_TEXT +ITLBMissLinear: + mtcrr3 + /* Set 8M byte page and mark it valid */ + li r11, MI_PS8MEG | MI_SVALID | _PAGE_EXEC + MTSPR_CPU6(SPRN_MI_TWC, r11, r3) + rlwinm r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */ + ori r10, r10, 0xf0 | MI_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \ + _PAGE_PRESENT + MTSPR_CPU6(SPRN_MI_RPN, r10, r11) /* Update TLB entry */ + + mfspr
[PATCH 5/7] powerpc/8xx: Do not allow Pinned TLBs with STRICT_KERNEL_RWX or DEBUG_PAGEALLOC
Pinning TLBs bypasses STRICT_KERNEL_RWX or DEBUG_PAGEALLOC protections so it should only be allowed when those are not selected Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d09b259d3621..28608275d7c0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -1165,7 +1165,8 @@ config CONSISTENT_SIZE config PIN_TLB bool "Pinned Kernel TLBs (860 ONLY)" - depends on ADVANCED_OPTIONS && 8xx + depends on ADVANCED_OPTIONS && PPC_8xx && \ + !DEBUG_PAGEALLOC && !STRICT_KERNEL_RWX config PIN_TLB_DATA bool "Pinned TLB for DATA" -- 2.12.0
[PATCH 6/7] powerpc/8xx: mark init functions with __init
setup_initial_memory_limit() is only called during init. mmu_patch_cmp_limit() is only called from 8xx_mmu.c Signed-off-by: Christophe Leroy --- arch/powerpc/mm/8xx_mmu.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/8xx_mmu.c b/arch/powerpc/mm/8xx_mmu.c index ab3b10746f36..f29212e40f40 100644 --- a/arch/powerpc/mm/8xx_mmu.c +++ b/arch/powerpc/mm/8xx_mmu.c @@ -87,7 +87,7 @@ void __init MMU_init_hw(void) #endif } -static void mmu_mapin_immr(void) +static void __init mmu_mapin_immr(void) { unsigned long p = PHYS_IMMR_BASE; unsigned long v = VIRT_IMMR_BASE; @@ -107,7 +107,7 @@ extern unsigned int DTLBMiss_cmp, FixupDAR_cmp; extern unsigned int ITLBMiss_cmp; #endif -void mmu_patch_cmp_limit(unsigned int *addr, unsigned long mapped) +static void __init mmu_patch_cmp_limit(unsigned int *addr, unsigned long mapped) { unsigned int instr = *addr; @@ -151,8 +151,8 @@ unsigned long __init mmu_mapin_ram(unsigned long top) return mapped; } -void setup_initial_memory_limit(phys_addr_t first_memblock_base, - phys_addr_t first_memblock_size) +void __init setup_initial_memory_limit(phys_addr_t first_memblock_base, + phys_addr_t first_memblock_size) { /* We don't currently support the first MEMBLOCK not mapping 0 * physical on those processors -- 2.12.0
[PATCH 7/7] powerpc/8xx: Reduce DTLB miss handler by one insn
This reduces the DTLB miss handler hot path (user address path) by one instruction by preserving r10. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index b889b5812274..7365148219fd 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -466,23 +466,23 @@ DataStoreTLBMiss: * kernel page tables. */ mfspr r10, SPRN_MD_EPN - rlwinm r10, r10, 16, 0xfff8 - cmpli cr0, r10, PAGE_OFFSET@h + rlwinm r11, r10, 16, 0xfff8 + cmpli cr0, r11, PAGE_OFFSET@h mfspr r11, SPRN_M_TW /* Get level 1 table */ blt+3f + rlwinm r11, r10, 16, 0xfff8 #ifndef CONFIG_PIN_TLB_IMMR - cmpli cr0, r10, VIRT_IMMR_BASE@h + cmpli cr0, r11, VIRT_IMMR_BASE@h #endif _ENTRY(DTLBMiss_cmp) - cmpli cr7, r10, (PAGE_OFFSET + 0x180)@h - lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha + cmpli cr7, r11, (PAGE_OFFSET + 0x180)@h #ifndef CONFIG_PIN_TLB_IMMR _ENTRY(DTLBMiss_jmp) beq-DTLBMissIMMR #endif blt cr7, DTLBMissLinear + lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha 3: - mfspr r10, SPRN_MD_EPN /* Insert level 1 index */ rlwimi r11, r10, 32 - ((PAGE_SHIFT - 2) << 1), (PAGE_SHIFT - 2) << 1, 29 @@ -703,7 +703,7 @@ DTLBMissLinear: /* Set 8M byte page and mark it valid */ li r11, MD_PS8MEG | MD_SVALID MTSPR_CPU6(SPRN_MD_TWC, r11, r3) - rlwinm r10, r10, 16, 0x0f80/* 8xx supports max 256Mb RAM */ + rlwinm r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */ ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \ _PAGE_PRESENT MTSPR_CPU6(SPRN_MD_RPN, r10, r11) /* Update TLB entry */ -- 2.12.0
Re: [PATCH v5 2/2] powerpc/fadump: update documentation about 'fadump_append=' parameter
Hello, On Wed, 12 Jul 2017 00:00:57 +0530 Hari Bathini wrote: > Hi Michal, > > > Thanks for the review.. > > > On Monday 26 June 2017 05:45 PM, Michal Suchánek wrote: > > Hello, > > > > On Tue, 20 Jun 2017 21:14:08 +0530 > > Hari Bathini wrote: > > > I would prefer documenting over a complex implementation. Actually, I > am considering a simple approach of replacing every occurrence of > "fadump_extra_args=" with "fadump_extra_args " in fadump capture > kernel. The cmdline > >"root=/dev/sda2 ro fadump_extra_args="a b c" crashkernel=512M > fadump_extra_args=d" > > becomes > >"root=/dev/sda2 ro fadump_extra_args "a b c" crashkernel=512M > fadump_extra_args d" which is totally broken > > in fadump capture kernel. This must take care of the pitfalls with > the current approach and also, > doesn't rely on parse_args() which was not designed for this scenario > to start with..? It was designed for parsing arguments. To handle replacing arguments you have to extend it. You need to get more information from it for this case. Best regards Michal
Re: [GIT PULL] Please pull JSON files for Power9 PMU events
Sukadev Bhattiprolu writes: > Hi Arnaldo, > > Please pull the JSON files for POWER9 PMU events. > > The following changes since commit 07d306c838c5c30196619baae36107d0615e459b: > > Merge git://www.linux-watchdog.org/linux-watchdog (2017-07-11 09:59:37 > -0700) > > are available in the git repository at: > > https://github.com/sukadev/linux/ p9-json > > for you to fetch changes up to f94a86dbb13e18bd7ad566cabe42df63d3bd1039: > > perf vendor events: Add POWER9 PVRs to mapfile (2017-07-11 19:30:40 -0500) > > > Sukadev Bhattiprolu (2): > perf vendor events: Add POWER9 PMU events > perf vendor events: Add POWER9 PVRs to mapfile I think the PVRs need work. You have: +004e0100,1,power9.json,core +004e,1,power9.json,core The first is P9 DD1, but the second doesn't exist. We have it in the kernel, but with a mask of 0x. From memory the perf code doesn't do any masking or anything fancy, it looks for an exact match. So for starters you should probably drop that one and add 0x004e1200 and 0x004e0200. cheers
Re: [PATCH v5 2/2] powerpc/fadump: update documentation about 'fadump_append=' parameter
On Wednesday 12 July 2017 05:01 PM, msuchanek wrote: Hello, On Wed, 12 Jul 2017 00:00:57 +0530 Hari Bathini wrote: Hi Michal, Thanks for the review.. On Monday 26 June 2017 05:45 PM, Michal Suchánek wrote: Hello, On Tue, 20 Jun 2017 21:14:08 +0530 Hari Bathini wrote: I would prefer documenting over a complex implementation. Actually, I am considering a simple approach of replacing every occurrence of "fadump_extra_args=" with "fadump_extra_args " in fadump capture kernel. The cmdline "root=/dev/sda2 ro fadump_extra_args="a b c" crashkernel=512M fadump_extra_args=d" becomes "root=/dev/sda2 ro fadump_extra_args "a b c" crashkernel=512M fadump_extra_args d" which is totally broken My bad! I don't know what I was thinking when I expected that to work.. :-[ in fadump capture kernel. This must take care of the pitfalls with the current approach and also, doesn't rely on parse_args() which was not designed for this scenario to start with..? It was designed for parsing arguments. To handle replacing arguments you have to extend it. You need to get more information from it for this case. I will try to work on this. Also, want to replace in-place without moving other parameters. I guess, I could do that by replacing the leftover length after processing with spaces.. Thanks Hari
Re: [GIT PULL] Please pull JSON files for Power9 PMU events
Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu: > Sukadev Bhattiprolu writes: > > Please pull the JSON files for POWER9 PMU events. > > perf vendor events: Add POWER9 PMU events > > perf vendor events: Add POWER9 PVRs to mapfile > I think the PVRs need work. > You have: > +004e0100,1,power9.json,core > +004e,1,power9.json,core > The first is P9 DD1, but the second doesn't exist. We have it in the > kernel, but with a mask of 0x. From memory the perf code doesn't > do any masking or anything fancy, it looks for an exact match. > So for starters you should probably drop that one and add 0x004e1200 and > 0x004e0200. Could this check be somehow done automatically? Using whatever is in the kernel sources or some IBM released document? - Arnaldo
[PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock
Virtual time base(vtb) is a register which increases only in guest. Any exit from guest to host will stop the vtb(saved and restored by kvm). But if there is an IO causes guest exits to host, the guest's watchdog (watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock) needs to also include the time elapsed in host. get_vtb is not correct in this case. Also, the TB_OFFSET is well saved and restored by qemu after commit [1]. So we can use get_tb here. [1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1 Signed-off-by: Jia He --- arch/powerpc/kernel/time.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index fe6f3a2..c542dd3 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void) unsigned long long running_clock(void) { /* -* Don't read the VTB as a host since KVM does not switch in host -* timebase into the VTB when it takes a guest off the CPU, reading the -* VTB would result in reading 'last switched out' guest VTB. +* Use get_tb instead of get_vtb for guest since the TB_OFFSET has been +* well saved/restored when qemu does suspend/resume. * * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it * would be unsafe to rely only on the #ifdef above. */ if (firmware_has_feature(FW_FEATURE_LPAR) && cpu_has_feature(CPU_FTR_ARCH_207S)) - return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; + return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; /* * This is a next best approximation without a VTB. -- 2.9.3
[PATCH] powerpc/hugetlb: fix page rights verification in gup_hugepte()
gup_hugepte() checks if pages are present and readable, and when 'write' is set, also checks if the pages are writable. Initially this was done by checking if _PAGE_PRESENT and _PAGE_READ were set. In addition, _PAGE_WRITE was verified for write accesses. The problem is that we have to handle the three following cases: 1/ The target defines __PAGE_READ and __PAGE_WRITE 2/ The target defines __PAGE_RW 3/ The target defines __PAGE_RO In case 1/, this is obvious In case 2/, __PAGE_READ is defined as 0 and __PAGE_WRITE as __PAGE_RW so it works as well. But in case 3, __PAGE_RW is defined as 0, which means __PAGE_WRITE is 0 and then the test returns true (page writable) in all cases. A first correction was attempted in commit 6b8cb66a6a7cc ("powerpc: Fix usage of _PAGE_RO in hugepage"), but that fix is wrong: instead of checking that the page is writable when write is requested, it checks that the page is NOT writable when write is NOT requested. This patch adds a new pte_read() helper to check whether a page is readable or not. This avoids handling all possible cases in gup_hugepte(). Then gup_hugepte() is modified to use pte_present(), pte_read() and pte_write() instead of the raw flags. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/pgtable.h | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 5 + arch/powerpc/include/asm/nohash/pgtable.h| 1 + arch/powerpc/mm/hugetlbpage.c| 15 +++ 4 files changed, 10 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index 7fb755880409..2a67b861e6e1 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -301,6 +301,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, int flags); /* Generic accessors to PTE bits */ static inline int pte_write(pte_t pte) { return !!(pte_val(pte) & _PAGE_RW);} +static inline int pte_read(pte_t pte) { return 1; } static inline int pte_dirty(pte_t pte) { return !!(pte_val(pte) & _PAGE_DIRTY); } static inline int pte_young(pte_t pte) { return !!(pte_val(pte) & _PAGE_ACCESSED); } static inline int pte_special(pte_t pte) { return !!(pte_val(pte) & _PAGE_SPECIAL); } diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index c0737c86a362..83cb73c8c3bb 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -409,6 +409,11 @@ static inline int pte_write(pte_t pte) return __pte_write(pte) || pte_savedwrite(pte); } +static inline int pte_read(pte_t pte) +{ + return !!(pte_raw(pte) & cpu_to_be64(_PAGE_READ)); +} + #define __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index e5805ad78e12..17989c3d9a24 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -14,6 +14,7 @@ static inline int pte_write(pte_t pte) { return (pte_val(pte) & (_PAGE_RW | _PAGE_RO)) != _PAGE_RO; } +static inline int pte_read(pte_t pte) { return 1; } static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index e1bf5ca397fe..1226932579a0 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -977,7 +977,6 @@ EXPORT_SYMBOL_GPL(__find_linux_pte_or_hugepte); int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { - unsigned long mask; unsigned long pte_end; struct page *head, *page; pte_t pte; @@ -988,18 +987,10 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, end = pte_end; pte = READ_ONCE(*ptep); - mask = _PAGE_PRESENT | _PAGE_READ; - - /* -* On some CPUs like the 8xx, _PAGE_RW hence _PAGE_WRITE is defined -* as 0 and _PAGE_RO has to be set when a page is not writable -*/ - if (write) - mask |= _PAGE_WRITE; - else - mask |= _PAGE_RO; - if ((pte_val(pte) & mask) != mask) + if (!pte_present(pte) || !pte_read(pte)) + return 0; + if (write && !pte_write(pte)) return 0; /* hugepages are never "special" */ -- 2.12.0
Re: [PATCH] powerpc/mm: Implemented default_hugepagesz verification for powerpc
Em 2017-07-05 13:03, victora escreveu: Em 2017-07-05 01:26, Aneesh Kumar K.V escreveu: On Tuesday 04 July 2017 01:35 AM, Victor Aoqui wrote: Implemented default hugepage size verification (default_hugepagesz=) in order to allow allocation of defined number of pages (hugepages=) only for supported hugepage sizes. Signed-off-by: Victor Aoqui --- arch/powerpc/mm/hugetlbpage.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index a4f33de..464e72e 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -797,6 +797,21 @@ static int __init hugepage_setup_sz(char *str) } __setup("hugepagesz=", hugepage_setup_sz); +static int __init default_hugepage_setup_sz(char *str) +{ +unsigned long long size; + +size = memparse(str, &str); + +if (add_huge_page_size(size) != 0) { +hugetlb_bad_size(); +pr_err("Invalid default huge page size specified(%llu)\n", size); +} + +return 1; +} +__setup("default_hugepagesz=", default_hugepage_setup_sz); isn't that a behavior change in what we have now ? . Right now if size specified is not supported, we fallback to HPAGE_SIZE. Yes, it is. However, is this a correct behavior? If we specify an unsupported value, for example default_hugepagesz=1M and hugepages=1000, 1M will be ignored and 1000 pages of 16M (arch default) will be allocated. This could lead to non-expected out of of memory/performance issue. mm/hugetlb.c if (!size_to_hstate(default_hstate_size)) { default_hstate_size = HPAGE_SIZE; if (!size_to_hstate(default_hstate_size)) hugetlb_add_hstate(HUGETLB_PAGE_ORDER); } + struct kmem_cache *hugepte_cache; static int __init hugetlbpage_init(void) { Even if we want to do this, this should be done in generic code and should not be powerpc specific The verification of supported powerpc hugepage size (hugepagesz=) is being performed on add_huge_page_size(), which is currently defined in arch/powerpc/mm/hugetlbpage.c. I think it makes more sense to implement default_hugepagesz= verification on arch/powerpc, don't you think? -aneesh Hi Aneesh, Did you have time to review the comments above? Thanks Victor
Re: [GIT PULL] Please pull JSON files for Power9 PMU events
Arnaldo Carvalho de Melo [a...@kernel.org] wrote: > Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu: > > Sukadev Bhattiprolu writes: > > > Please pull the JSON files for POWER9 PMU events. > > > > perf vendor events: Add POWER9 PMU events > > > perf vendor events: Add POWER9 PVRs to mapfile > > > I think the PVRs need work. > > > You have: > > > +004e0100,1,power9.json,core > > +004e,1,power9.json,core > > > The first is P9 DD1, but the second doesn't exist. We have it in the > > kernel, but with a mask of 0x. From memory the perf code doesn't > > do any masking or anything fancy, it looks for an exact match. > > > So for starters you should probably drop that one and add 0x004e1200 and > > 0x004e0200. Yes, thanks. > > Could this check be somehow done automatically? Using whatever is in the > kernel sources or some IBM released document? Yes, had a recent discussion offline with Michael Petlan and Jiri about generalizing/simplifying this. For now, I have added the above PVRs and will look into the broader fix. The following changes since commit 07d306c838c5c30196619baae36107d0615e459b: Merge git://www.linux-watchdog.org/linux-watchdog (2017-07-11 09:59:37 -0700) are available in the git repository at: https://github.com/sukadev/linux/ p9-json for you to fetch changes up to 8870b54739866c6eea23c36a0125af8045fbd572: perf vendor events: Add POWER9 PVRs to mapfile (2017-07-12 11:11:00 -0500) Sukadev Bhattiprolu (2): perf vendor events: Add POWER9 PMU events perf vendor events: Add POWER9 PVRs to mapfile tools/perf/pmu-events/arch/powerpc/mapfile.csv | 3 + .../perf/pmu-events/arch/powerpc/power9/cache.json | 176 + .../arch/powerpc/power9/floating-point.json| 44 ++ .../pmu-events/arch/powerpc/power9/frontend.json | 446 +++ .../pmu-events/arch/powerpc/power9/marked.json | 782 +++ .../pmu-events/arch/powerpc/power9/memory.json | 158 .../perf/pmu-events/arch/powerpc/power9/other.json | 836 + .../pmu-events/arch/powerpc/power9/pipeline.json | 680 + tools/perf/pmu-events/arch/powerpc/power9/pmc.json | 146 .../arch/powerpc/power9/translation.json | 272 +++ 10 files changed, 3543 insertions(+) create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/cache.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/floating-point.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/frontend.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/marked.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/memory.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/other.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/pipeline.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/pmc.json create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/translation.json
Re: [GIT PULL] Please pull JSON files for Power9 PMU events
Em Wed, Jul 12, 2017 at 09:40:08AM -0700, Sukadev Bhattiprolu escreveu: > Arnaldo Carvalho de Melo [a...@kernel.org] wrote: > > Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu: > > > So for starters you should probably drop that one and add 0x004e1200 and > > > 0x004e0200. > Yes, thanks. > > Could this check be somehow done automatically? Using whatever is in the > > kernel sources or some IBM released document? > Yes, had a recent discussion offline with Michael Petlan and Jiri about > generalizing/simplifying this. For now, I have added the above PVRs and > will look into the broader fix. > The following changes since commit 07d306c838c5c30196619baae36107d0615e459b: > Merge git://www.linux-watchdog.org/linux-watchdog (2017-07-11 09:59:37 > -0700) > are available in the git repository at: > https://github.com/sukadev/linux/ p9-json Can you please base your branches out ouf tip/perf/core or acme/perf/core when you submit it with the fixes discussed here? thanks, - Arnaldo
Re: [RFC v5 36/38] selftest: PowerPC specific test updates to memory protection keys
On Tue, Jul 11, 2017 at 10:33:09AM -0700, Dave Hansen wrote: > On 07/05/2017 02:22 PM, Ram Pai wrote: > > Abstracted out the arch specific code into the header file, and > > added powerpc specific changes. > > > > a) added 4k-backed hpte, memory allocator, powerpc specific. > > b) added three test case where the key is associated after the page is > > accessed/allocated/mapped. > > c) cleaned up the code to make checkpatch.pl happy > > There's a *lot* of churn here. If it breaks, I'm going to have a heck > of a time figuring out which hunk broke. Is there any way to break this > up into a series of things that we have a chance at bisecting? Just finished breaking down the changes into 20 gradual increments. I have pushed it to my github tree at https://github.com/rampai/memorykeys.git branch is memkey.v6-rc3 See if it works for you. I am sure I would have broken something on x86 since I dont have a x86 platform to test. Let me know, Thanks, RP
Re: [RFC v5 11/38] mm: introduce an additional vma bit for powerpc pkey
On Tue, Jul 11, 2017 at 11:10:46AM -0700, Dave Hansen wrote: > On 07/05/2017 02:21 PM, Ram Pai wrote: > > Currently there are only 4bits in the vma flags to support 16 keys > > on x86. powerpc supports 32 keys, which needs 5bits. This patch > > introduces an addition bit in the vma flags. > > > > Signed-off-by: Ram Pai > > --- > > fs/proc/task_mmu.c |6 +- > > include/linux/mm.h | 18 +- > > 2 files changed, 18 insertions(+), 6 deletions(-) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index f0c8b33..2ddc298 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, > > struct vm_area_struct *vma) > > [ilog2(VM_MERGEABLE)] = "mg", > > [ilog2(VM_UFFD_MISSING)]= "um", > > [ilog2(VM_UFFD_WP)] = "uw", > > -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS > > +#ifdef CONFIG_ARCH_HAS_PKEYS > > /* These come out via ProtectionKey: */ > > [ilog2(VM_PKEY_BIT0)] = "", > > [ilog2(VM_PKEY_BIT1)] = "", > > [ilog2(VM_PKEY_BIT2)] = "", > > [ilog2(VM_PKEY_BIT3)] = "", > > +#endif /* CONFIG_ARCH_HAS_PKEYS */ > > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > > + /* Additional bit in ProtectionKey: */ > > + [ilog2(VM_PKEY_BIT4)] = "", > > #endif > > I'd probably just leave the #ifdef out and eat the byte or whatever of > storage that this costs us on x86. fine with me. > > > }; > > size_t i; > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 7cb17c6..3d35bcc 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table > > *, int, void __user *, > > #define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit > > architectures */ > > #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit > > architectures */ > > #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit > > architectures */ > > +#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */ > > Please just copy the above lines. Just copying over makes checkpatch.pl unhappy. It exceeds 80 columns. > > > #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) > > #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) > > #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) > > #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) > > +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) > > #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ > > > > -#if defined(CONFIG_X86) > > -# define VM_PATVM_ARCH_1 /* PAT reserves whole VMA at > > once (x86) */ > > -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) > > +#ifdef CONFIG_ARCH_HAS_PKEYS > > # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 > > -# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit > > value */ > > +# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 > > # define VM_PKEY_BIT1 VM_HIGH_ARCH_1 > > # define VM_PKEY_BIT2 VM_HIGH_ARCH_2 > > # define VM_PKEY_BIT3 VM_HIGH_ARCH_3 > > -#endif > > +#endif /* CONFIG_ARCH_HAS_PKEYS */ > > We have the space here, so can we just say that it's 4-bits on x86 and 5 > on ppc? sure. > > > +#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS) > > +# define VM_PKEY_BIT4 VM_HIGH_ARCH_4 /* additional key bit used on > > ppc64 */ > > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > > Why bother #ifdef'ing a #define? ok. RP
Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer
On Mon, 2017-06-26 at 20:08 +0200, Frederic Barrat wrote: > + if (desc & OPAL_PCI_P2P_ENABLE) { > + pe_init->p2p_initiator_count++; > + } else { > + if (pe_init->p2p_initiator_count > 0) { > + pe_init->p2p_initiator_count--; > + if (!pe_init->p2p_initiator_count) > + pnv_pci_ioda2_set_bypass(pe_init, true); > + } So you have the initiator refcounting in Linux and the target refcounting in OPAL ... any reason for that ? Cheers, Ben.
Re: [RFC v5 11/38] mm: introduce an additional vma bit for powerpc pkey
On Wed, 2017-07-12 at 15:23 -0700, Ram Pai wrote: > Just copying over makes checkpatch.pl unhappy. It exceeds 80 columns. Which is fine to ignore in a case like that where you remain consistent with the existing code. Ben.
Re: [PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock
On Wed, 2017-07-12 at 23:01 +0800, Jia He wrote: > Virtual time base(vtb) is a register which increases only in guest. > Any exit from guest to host will stop the vtb(saved and restored by kvm). > But if there is an IO causes guest exits to host, the guest's watchdog > (watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock) > needs to also include the time elapsed in host. get_vtb is not correct in > this case. > > Also, the TB_OFFSET is well saved and restored by qemu after commit [1]. > So we can use get_tb here. That completely defeats the purpose here... This was done specifically to exploit the VTB which doesn't count in hypervisor mode. > > [1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1 > > Signed-off-by: Jia He > --- > arch/powerpc/kernel/time.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index fe6f3a2..c542dd3 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void) > unsigned long long running_clock(void) > { > /* > - * Don't read the VTB as a host since KVM does not switch in host > - * timebase into the VTB when it takes a guest off the CPU, reading the > - * VTB would result in reading 'last switched out' guest VTB. > + * Use get_tb instead of get_vtb for guest since the TB_OFFSET has been > + * well saved/restored when qemu does suspend/resume. >* >* Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it >* would be unsafe to rely only on the #ifdef above. >*/ > if (firmware_has_feature(FW_FEATURE_LPAR) && > cpu_has_feature(CPU_FTR_ARCH_207S)) > - return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << > tb_to_ns_shift; > + return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << > tb_to_ns_shift; > > /* >* This is a next best approximation without a VTB.
Re: [RFC v5 00/38] powerpc: Memory Protection Keys
On Wed, 2017-07-12 at 09:23 +0200, Michal Hocko wrote: > > > > > Ideally the MMU looks at the PTE for keys, in order to enforce > > protection. This is the case with x86 and is the case with power9 Radix > > page table. Hence the keys have to be programmed into the PTE. > > But x86 doesn't update ptes for PKEYs, that would be just too expensive. > You could use standard mprotect to do the same... What do you mean ? x86 ends up in mprotect_fixup -> change_protection() which will update the PTEs just the same as we do. Changing the key for a page is a form mprotect. Changing the access permissions for keys is different, for us it's a special register (AMR). I don't understand why you think we are doing any differently than x86 here. > > However with HPT on power, these keys do not necessarily have to be > > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE) > > and instead just program them into the Hash Page Table(HPTE), since > > the MMU does not refer the PTE but refers the HPTE. The last version > > of the page attempted to do that. It worked as follows: > > > > a) when a address range is requested to be associated with a key; by the > >application through key_mprotect() system call, the kernel > >stores that key in the vmas corresponding to that address > >range. > > > > b) Whenever there is a hash page fault for that address, the fault > >handler reads the key from the VMA and programs the key into the > >HPTE. __hash_page() is the function that does that. > > What causes the fault here? The hardware. With the hash MMU, the HW walks a hash table which is effectively a large in-memory TLB extension. When a page isn't found there, a "hash fault" is generated allowing Linux to populate that hash table with the content of the corresponding PTE. > > c) Once the hpte is programmed, the MMU can sense key violations and > >generate key-faults. > > > > The problem is with step (b). This step is really a very critical > > path which is performance sensitive. We dont want to add any delays. > > However if we want to access the key from the vma, we will have to > > hold the vma semaphore, and that is a big NO-NO. As a result, this > > design had to be dropped. > > > > > > > > I reverted back to the old design i.e the design in v4 version. In this > > version we do the following: > > > > a) when a address range is requested to be associated with a key; by the > >application through key_mprotect() system call, the kernel > >stores that key in the vmas corresponding to that address > >range. Also the kernel programs the key into Linux PTE coresponding to > > all the > >pages associated with the address range. > > OK, so how is this any different from the regular mprotect then? It takes the key argument. This is nothing new. This was done for x86 already, we are just re-using the infrastructure. Look at do_mprotect_pkey() in mm/mprotect.c today. It's all the same code, pkey_mprotect() is just mprotect with an added key argument. > > b) Whenever there is a hash page fault for that address, the fault > >handler reads the key from the Linux PTE and programs the key into > >the HPTE. > > > > c) Once the HPTE is programmed, the MMU can sense key violations and > >generate key-faults. > > > > > > Since step (b) in this case has easy access to the Linux PTE, and hence > > to the key, it is fast to access it and program the HPTE. Thus we avoid > > taking any performance hit on this critical path. > > > > Hope this explains the rationale, > > > > > > As promised here is the high level design: > > I will read through that later > [...]
Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer
Le 12/07/2017 à 17:39, Benjamin Herrenschmidt a écrit : On Mon, 2017-06-26 at 20:08 +0200, Frederic Barrat wrote: + if (desc & OPAL_PCI_P2P_ENABLE) { + pe_init->p2p_initiator_count++; + } else { + if (pe_init->p2p_initiator_count > 0) { + pe_init->p2p_initiator_count--; + if (!pe_init->p2p_initiator_count) + pnv_pci_ioda2_set_bypass(pe_init, true); + } So you have the initiator refcounting in Linux and the target refcounting in OPAL ... any reason for that ? The initiator refcount is per PE and skiboot doesn't track PEs. Also when the initiator refcount falls back to 0, we should restore the default bypass setting for the TVE, and that's more easily done from linux, since it knows the start and end address of the memory. So for those reasons, I don't really have the choice, the initiator refcount seems better suited in linux. The target refcount is per PHB and it just seemed easier to keep it in opal. Does the lack of symmetry bother you? Fred
Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer
On Wed, 2017-07-12 at 18:01 -0500, Frederic Barrat wrote: > > So you have the initiator refcounting in Linux and the target > > refcounting in OPAL ... any reason for that ? > > The initiator refcount is per PE and skiboot doesn't track PEs. Also > when the initiator refcount falls back to 0, we should restore the > default bypass setting for the TVE, and that's more easily done from > linux, since it knows the start and end address of the memory. So for > those reasons, I don't really have the choice, the initiator refcount > seems better suited in linux. > > The target refcount is per PHB and it just seemed easier to keep it in opal. > > Does the lack of symmetry bother you? Only mildly. I would have put the whole refcount in Linux... Cheers, Ben.
Re: [PATCH v2] powerpc/powernv: Use darn instr for random_seed on p9
On Tue, Jul 11, 2017 at 7:34 PM, Daniel Axtens wrote: > Hi Matt, > >> Currently ppc_md.get_random_seed uses the powernv_get_random_long function. >> A guest calling this function would have to go through the hypervisor. The >> 'darn' instruction, introduced in POWER9, allows us to bypass this by >> directly obtaining a value from the mmio region. >> >> This patch adds a function for ppc_md.get_random_seed on p9, >> utilising the darn instruction. > > This patch looks pretty good - I'm not set up to test it but I have one > code-style nit: > >> diff --git a/arch/powerpc/platforms/powernv/rng.c >> b/arch/powerpc/platforms/powernv/rng.c >> index 5dcbdea..ab6f411 100644 >> --- a/arch/powerpc/platforms/powernv/rng.c >> +++ b/arch/powerpc/platforms/powernv/rng.c >> @@ -8,6 +8,7 @@ >> */ >> >> #define pr_fmt(fmt) "powernv-rng: " fmt >> +#define DARN_ERR 0xul >> >> #include >> #include >> @@ -16,6 +17,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -67,6 +69,21 @@ int powernv_get_random_real_mode(unsigned long *v) >> return 1; >> } >> >> +int powernv_get_random_darn(unsigned long *v) > > This is only referenced in this file so it should probably be labelled > as 'static'. > That's true, my thinking was to then use the get_random_darn where the get_random_long was being used. Looks like there is only one other caller of it at the moment, kvmppc_h_random (in arch/powerpc/kvm/book3s_hv_builtin.c). We may want to change that to get_random_darn. >> +{ >> + unsigned long val; >> + >> + /* Using DARN with L=1 - conditioned random number */ >> + asm (PPC_DARN(%0, 1)"\n" : "=r"(val) :); >> + >> + if (val == DARN_ERR) >> + return 0; >> + >> + *v = val; >> + >> + return 1; > > I was a bit confused to see 1 representing success - I think I have been > in userspace too long. But I checked against pseries_get_random_long and > it is in fact correct, so good for you! > > An excellent followup patch would be changing the type of this function > to be bool rather than int, but no pressure :) That's probably a good idea! Thanks, Matt > > Regards, > Daniel > >> +} >> + >> int powernv_get_random_long(unsigned long *v) >> { >> struct powernv_rng *rng; >> @@ -136,6 +153,7 @@ static __init int rng_create(struct device_node *dn) >> static __init int rng_init(void) >> { >> struct device_node *dn; >> + unsigned long drn_test; >> int rc; >> >> for_each_compatible_node(dn, NULL, "ibm,power-rng") { >> @@ -150,6 +168,10 @@ static __init int rng_init(void) >> of_platform_device_create(dn, NULL, NULL); >> } >> >> + if (cpu_has_feature(CPU_FTR_ARCH_300) && >> + powernv_get_random_darn(&drn_test)) >> + ppc_md.get_random_seed = powernv_get_random_darn; >> + >> return 0; >> } >> machine_subsys_initcall(powernv, rng_init); >> -- >> 2.9.3
Re: [PATCH v2] powerpc/powernv: Use darn instr for random_seed on p9
Matt Brown writes: > On Tue, Jul 11, 2017 at 7:34 PM, Daniel Axtens wrote: >>> @@ -67,6 +69,21 @@ int powernv_get_random_real_mode(unsigned long *v) >>> return 1; >>> } >>> >>> +int powernv_get_random_darn(unsigned long *v) >> >> This is only referenced in this file so it should probably be labelled >> as 'static'. >> > > That's true, my thinking was to then use the get_random_darn where the > get_random_long was being used. > Looks like there is only one other caller of it at the moment, > kvmppc_h_random (in arch/powerpc/kvm/book3s_hv_builtin.c). > We may want to change that to get_random_darn. It needs to use the right one depending on whether it's running on P8 or P9. It should really just use ppc_md.get_random_seed(), except that it can be called in real mode, which means it needs to be more careful. So as usual it's complicated :) cheers
[PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation
This patch adds emulation of the cmpb instruction, enabling xmon to emulate this instruction. Signed-off-by: Matt Brown --- arch/powerpc/lib/sstep.c | 24 1 file changed, 24 insertions(+) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 33117f8..f3e9ba8 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -596,6 +596,23 @@ static nokprobe_inline void do_cmp_unsigned(struct pt_regs *regs, unsigned long regs->ccr = (regs->ccr & ~(0xf << shift)) | (crval << shift); } +static nokprobe_inline void do_cmpb(struct pt_regs *regs, unsigned long v1, + unsigned long v2, int rd) +{ + unsigned long out_val, mask; + int i; + + out_val = 0; + for (i = 0; i < 8; i++) { + mask = 0xff << (i * 8); + + if ((v1 & mask) == (v2 & mask)) + out_val |= mask; + } + + regs->gpr[rd] = out_val; +} + static nokprobe_inline int trap_compare(long v1, long v2) { int ret = 0; @@ -1049,6 +1066,13 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, do_cmp_unsigned(regs, val, val2, rd >> 2); goto instr_done; + case 19173952: /* cmpb */ + val = regs->gpr[rd]; + val2 = regs->gpr[rb]; + + do_cmpb(regs, val, val2, ra); + goto instr_done; + /* * Arithmetic instructions */ -- 2.9.3
[PATCH 2/5] powerpc/lib/sstep: Add popcnt instruction emulation
This adds emulations for the popcntb, popcntw, and popcntd instructions. Signed-off-by: Matt Brown --- arch/powerpc/lib/sstep.c | 39 +++ 1 file changed, 39 insertions(+) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index f3e9ba8..cf69987 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -613,6 +613,30 @@ static nokprobe_inline void do_cmpb(struct pt_regs *regs, unsigned long v1, regs->gpr[rd] = out_val; } +/* + * The size parameter is used to ajust the equivalent popcnt instruction. + * popcntb = 8, popcntw = 32, popcntd = 64 + */ +static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1, + int size, int ra) +{ + unsigned int out_val, mask, n; + int i, j; + + out_val = 0; + + for (i = 0; i < (64 / size); i++) { + n = 0; + for (j = 0; j < size; j++) { + mask = 1 << (j + (i * size)); + if (v1 & mask) + n++; + } + out_val |= n << (i * size); + } + regs->gpr[ra] = out_val; +} + static nokprobe_inline int trap_compare(long v1, long v2) { int ret = 0; @@ -1234,6 +1258,21 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, regs->gpr[ra] = (signed int) regs->gpr[rd]; goto logical_done; #endif + case 299528:/* popcntb */ + val = regs->gpr[rd]; + do_popcnt(regs, val, 8, ra); + goto logical_done; + + case 17076744: /* popcntw */ + val = regs->gpr[rd]; + do_popcnt(regs, val, 32, ra); + goto logical_done; +#ifdef __powerpc64__ + case 19173896: /* popcntd */ + val = regs->gpr[rd]; + do_popcnt(regs, val, 64, ra); + goto logical_done; +#endif /* * Shift instructions -- 2.9.3
[PATCH 3/5] powerpc/lib/sstep: Add bpermd instruction emulation
This adds emulation for the bpermd instruction. Signed-off-by: Matt Brown --- arch/powerpc/lib/sstep.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index cf69987..603654d 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -637,6 +637,21 @@ static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1, regs->gpr[ra] = out_val; } +static nokprobe_inline void do_bpermd(struct pt_regs *regs, unsigned long v1, + unsigned long v2, int ra) +{ + unsigned int idx, i; + unsigned char perm; + + perm = 0x0; + for (i = 0; i < 8; i++) { + idx = (v1 >> (i * 8)) & 0xff; + if (idx < 64) + perm |= (v2 & (1 << idx)) >> (idx - i); + } + regs->gpr[ra] = 0 | perm; +} + static nokprobe_inline int trap_compare(long v1, long v2) { int ret = 0; @@ -1274,6 +1289,14 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, goto logical_done; #endif +#ifdef __powerpc64__ + case 2396736: /* bpermd */ + val = regs->gpr[rd]; + val2 = regs->gpr[rb]; + do_bpermd(regs, val, val2, ra); + goto logical_done; +#endif + /* * Shift instructions */ -- 2.9.3
[PATCH 4/5] powerpc/lib/sstep: Add prty instruction emulation
This add emulation for the prtyw and prtyd instructions. Signed-off-by: Matt Brown --- arch/powerpc/lib/sstep.c | 58 +++- 1 file changed, 52 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 603654d..3228783 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -652,6 +652,42 @@ static nokprobe_inline void do_bpermd(struct pt_regs *regs, unsigned long v1, regs->gpr[ra] = 0 | perm; } +static nokprobe_inline void do_prtyw(struct pt_regs *regs, unsigned long v, + int ra) +{ + unsigned long low, high, out; + unsigned int i; + + high = 0; + low = 0; + out = 0; + + for (i = 0; i < 8; i++) { + if (v & (1 << (i * 8))) + (i < 4) ? (low++) : (high++); + } + + if (low % 2) + out |= low; + if (high % 2) + out |= (high << 32); + + regs->gpr[ra] = out; +} + +static nokprobe_inline void do_prtyd(struct pt_regs *regs, unsigned long v, + int ra) +{ + unsigned int count, i; + + count = 0; + for (i = 0; i < 8; i++) { + if (v & (1 << (i * 8))) + count++; + } + regs->gpr[ra] = count % 2; +} + static nokprobe_inline int trap_compare(long v1, long v2) { int ret = 0; @@ -1278,16 +1314,15 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, do_popcnt(regs, val, 8, ra); goto logical_done; - case 17076744: /* popcntw */ + case 2101768: /* prtyw */ val = regs->gpr[rd]; - do_popcnt(regs, val, 32, ra); + do_prtyw(regs, val, ra); goto logical_done; -#ifdef __powerpc64__ - case 19173896: /* popcntd */ + + case 2134536: /* prtyd */ val = regs->gpr[rd]; - do_popcnt(regs, val, 64, ra); + do_prtyd(regs, val, ra); goto logical_done; -#endif #ifdef __powerpc64__ case 2396736: /* bpermd */ @@ -1297,6 +1332,17 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, goto logical_done; #endif + case 17076744: /* popcntw */ + val = regs->gpr[rd]; + do_popcnt(regs, val, 32, ra); + goto logical_done; +#ifdef __powerpc64__ + case 19173896: /* popcntd */ + val = regs->gpr[rd]; + do_popcnt(regs, val, 64, ra); + goto logical_done; +#endif + /* * Shift instructions */ -- 2.9.3
[PATCH 5/5] powerpc/lib/sstep: Add isel instruction emulation
This add emulation for the isel instruction. Signed-off-by: Matt Brown --- arch/powerpc/lib/sstep.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 3228783..bb0e301 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -1297,6 +1297,16 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, regs->gpr[ra] = ~(regs->gpr[rd] & regs->gpr[rb]); goto logical_done; + case 585: /* isel */ + mb = (instr >> 6) & 0x1f; /* bc */ + val = (regs->ccr >> (mb + 32)) & 1; + + if (val) + regs->gpr[rd] = regs->gpr[ra]; + else + regs->gpr[rd] = regs->gpr[rb]; + goto logical_done; + case 922: /* extsh */ regs->gpr[ra] = (signed short) regs->gpr[rd]; goto logical_done; -- 2.9.3
Re: [GIT PULL] Please pull JSON files for Power9 PMU events
Arnaldo Carvalho de Melo writes: > Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu: >> Sukadev Bhattiprolu writes: >> > Please pull the JSON files for POWER9 PMU events. > >> > perf vendor events: Add POWER9 PMU events >> > perf vendor events: Add POWER9 PVRs to mapfile > >> I think the PVRs need work. > >> You have: > >> +004e0100,1,power9.json,core >> +004e,1,power9.json,core > >> The first is P9 DD1, but the second doesn't exist. We have it in the >> kernel, but with a mask of 0x. From memory the perf code doesn't >> do any masking or anything fancy, it looks for an exact match. > >> So for starters you should probably drop that one and add 0x004e1200 and >> 0x004e0200. > > Could this check be somehow done automatically? Using whatever is in the > kernel sources or some IBM released document? Not really no. The kernel defines the major versions in arch/powerpc/include/asm/reg.h. eg: #define PVR_POWER8E 0x004B #define PVR_POWER8NVL 0x004C #define PVR_POWER8 0x004D #define PVR_POWER9 0x004E But that's only the top part of the value perf is looking for. To match the full value you also need the minor revision. I don't actually have an exhaustive list of all the Power8 revisions that were/will ever be made. And for Power9 I definitely can't give you a full list yet. We do want to be able to match against the full PVR, in case we need a different set of events for a particularly CPU revision. But we could make it easier to maintain by making the search iterative. ie. instead of just looking for an exact match, we could say that the mapfile is sorted, and the first entry that matches will be used. With a mapfile like: 004e0100,1,power9.json,core 004e,1,power9.json,core That would match Power9 DD1 on the first line, but all other Power9 CPUs would hit the second line. Obviously get_cpuid_str() would need to be reworked to be match_cpuid_str() or something. cheers
Re: [GIT PULL] Please pull JSON files for Power9 PMU events
Sukadev Bhattiprolu writes: ... > > tools/perf/pmu-events/arch/powerpc/mapfile.csv | 3 + > .../perf/pmu-events/arch/powerpc/power9/cache.json | 176 + > .../arch/powerpc/power9/floating-point.json| 44 ++ > .../pmu-events/arch/powerpc/power9/frontend.json | 446 +++ > .../pmu-events/arch/powerpc/power9/marked.json | 782 +++ > .../pmu-events/arch/powerpc/power9/memory.json | 158 In the map file we have "power9.json", but the files are power9/x.json. How does that work? cheers
Re: [PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation
On 13/07/17 13:25, Matt Brown wrote: @@ -1049,6 +1066,13 @@ int analyse_instr(struct instruction_op *op, struct pt_regs *regs, do_cmp_unsigned(regs, val, val2, rd >> 2); goto instr_done; + case 19173952: /* cmpb */ This looks wrong and should never trigger, given that the switch statement is comparing against ((instr >> 1) & 0x3ff). How did you get this value? -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: [RFC v5 00/38] powerpc: Memory Protection Keys
On Thu 13-07-17 08:53:52, Benjamin Herrenschmidt wrote: > On Wed, 2017-07-12 at 09:23 +0200, Michal Hocko wrote: > > > > > > > > Ideally the MMU looks at the PTE for keys, in order to enforce > > > protection. This is the case with x86 and is the case with power9 Radix > > > page table. Hence the keys have to be programmed into the PTE. > > > > But x86 doesn't update ptes for PKEYs, that would be just too expensive. > > You could use standard mprotect to do the same... > > What do you mean ? x86 ends up in mprotect_fixup -> change_protection() > which will update the PTEs just the same as we do. > > Changing the key for a page is a form mprotect. Changing the access > permissions for keys is different, for us it's a special register > (AMR). > > I don't understand why you think we are doing any differently than x86 > here. That was a misunderstanding on my side as explained in other reply. > > > However with HPT on power, these keys do not necessarily have to be > > > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE) > > > and instead just program them into the Hash Page Table(HPTE), since > > > the MMU does not refer the PTE but refers the HPTE. The last version > > > of the page attempted to do that. It worked as follows: > > > > > > a) when a address range is requested to be associated with a key; by the > > >application through key_mprotect() system call, the kernel > > >stores that key in the vmas corresponding to that address > > >range. > > > > > > b) Whenever there is a hash page fault for that address, the fault > > >handler reads the key from the VMA and programs the key into the > > >HPTE. __hash_page() is the function that does that. > > > > What causes the fault here? > > The hardware. With the hash MMU, the HW walks a hash table which is > effectively a large in-memory TLB extension. When a page isn't found > there, a "hash fault" is generated allowing Linux to populate that > hash table with the content of the corresponding PTE. Thanks for the clarification -- Michal Hocko SUSE Labs
Re: [PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation
On Thu, Jul 13, 2017 at 01:51:30PM +1000, Andrew Donnellan wrote: > On 13/07/17 13:25, Matt Brown wrote: > >@@ -1049,6 +1066,13 @@ int analyse_instr(struct instruction_op *op, struct > >pt_regs *regs, > > do_cmp_unsigned(regs, val, val2, rd >> 2); > > goto instr_done; > > > >+case 19173952: /* cmpb */ > > This looks wrong and should never trigger, given that the switch > statement is comparing against ((instr >> 1) & 0x3ff). > > How did you get this value? The correct number is 508, and 19173952 = 37744*508. How to get 37744 is a mystery though :-) Segher
Re: [PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation
On Thu, Jul 13, 2017 at 01:25:44PM +1000, Matt Brown wrote: > +static nokprobe_inline void do_cmpb(struct pt_regs *regs, unsigned long v1, > + unsigned long v2, int rd) > +{ > + unsigned long out_val, mask; > + int i; > + > + out_val = 0; > + for (i = 0; i < 8; i++) { > + mask = 0xff << (i * 8); 0xffUL ? > + > + if ((v1 & mask) == (v2 & mask)) > + out_val |= mask; > + } > + > + regs->gpr[rd] = out_val; > +} Segher
Re: [PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock
Hi Ben I add some printk logs in watchdog_timer_fn in the guest [ 16.025222] get_vtb=8236291881, get_tb=13756711357, get_timestamp=4 [ 20.025624] get_vtb=9745285807, get_tb=15804711283, get_timestamp=7 [ 24.025042] get_vtb=11518119641, get_tb=17852711085, get_timestamp=10 [ 28.024074] get_vtb=13192704319, get_tb=19900711071, get_timestamp=13 [ 32.024086] get_vtb=14856516982, get_tb=21948711066, get_timestamp=16 [ 36.024075] get_vtb=16569127618, get_tb=23996711078, get_timestamp=20 [ 40.024138] get_vtb=17008865823, get_tb=26044718418, get_timestamp=20 [ 44.023993] get_vtb=17020637241, get_tb=28092716383, get_timestamp=20 [ 48.023996] get_vtb=17022857170, get_tb=30140718472, get_timestamp=20 [ 52.023996] get_vtb=17024268541, get_tb=32188718432, get_timestamp=20 [ 56.023996] get_vtb=17036577783, get_tb=34236718077, get_timestamp=20 [ 60.023996] get_vtb=17037829743, get_tb=36284718437, get_timestamp=20 [ 64.023992] get_vtb=17039846747, get_tb=38332716609, get_timestamp=20 [ 68.023991] get_vtb=17041448345, get_tb=40380715903, get_timestamp=20 The get_timestamp(use get_vtb(),unit is second) is slower down compared with printk time. You also can obviously watch the get_vtb increment is slowly less than get_tb increment. Without this patch, I thought there might be some softlockup warnings missed in guest. -Jia On 13/07/2017 6:45 AM, Benjamin Herrenschmidt wrote: On Wed, 2017-07-12 at 23:01 +0800, Jia He wrote: Virtual time base(vtb) is a register which increases only in guest. Any exit from guest to host will stop the vtb(saved and restored by kvm). But if there is an IO causes guest exits to host, the guest's watchdog (watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock) needs to also include the time elapsed in host. get_vtb is not correct in this case. Also, the TB_OFFSET is well saved and restored by qemu after commit [1]. So we can use get_tb here. That completely defeats the purpose here... This was done specifically to exploit the VTB which doesn't count in hypervisor mode. [1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1 Signed-off-by: Jia He --- arch/powerpc/kernel/time.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index fe6f3a2..c542dd3 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void) unsigned long long running_clock(void) { /* -* Don't read the VTB as a host since KVM does not switch in host -* timebase into the VTB when it takes a guest off the CPU, reading the -* VTB would result in reading 'last switched out' guest VTB. +* Use get_tb instead of get_vtb for guest since the TB_OFFSET has been +* well saved/restored when qemu does suspend/resume. * * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it * would be unsafe to rely only on the #ifdef above. */ if (firmware_has_feature(FW_FEATURE_LPAR) && cpu_has_feature(CPU_FTR_ARCH_207S)) - return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; + return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; /* * This is a next best approximation without a VTB.