[PATCH v5 2/5] powerpc/kprobes: Mark newly allocated probes as RO
With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one W+X page at boot by default. This can be tested with CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the kernel log during boot. powerpc doesn't implement its own alloc() for kprobes like other architectures do, but we couldn't immediately mark RO anyway since we do a memcpy to the page we allocate later. After that, nothing should be allowed to modify the page, and write permissions are removed well before the kprobe is armed. Thus mark newly allocated probes as read-only once it's safe to do so. Signed-off-by: Russell Currey --- arch/powerpc/kernel/kprobes.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 2d27ec4feee4..2610496de7c7 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -24,6 +24,7 @@ #include #include #include +#include DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL; DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); @@ -131,6 +132,8 @@ int arch_prepare_kprobe(struct kprobe *p) (unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t)); } + set_memory_ro((unsigned long)p->ainsn.insn, 1); + p->ainsn.boostable = 0; return ret; } -- 2.23.0
[PATCH v5 1/5] powerpc/mm: Implement set_memory() routines
The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX, and are generally useful primitives to have. This implementation is designed to be completely generic across powerpc's many MMUs. It's possible that this could be optimised to be faster for specific MMUs, but the focus is on having a generic and safe implementation for now. This implementation does not handle cases where the caller is attempting to change the mapping of the page it is executing from, or if another CPU is concurrently using the page being altered. These cases likely shouldn't happen, but a more complex implementation with MMU-specific code could safely handle them, so that is left as a TODO for now. Signed-off-by: Russell Currey --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/set_memory.h | 32 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/pageattr.c| 77 +++ 4 files changed, 111 insertions(+) create mode 100644 arch/powerpc/include/asm/set_memory.h create mode 100644 arch/powerpc/mm/pageattr.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3e56c9c2f16e..8f7005f0d097 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -133,6 +133,7 @@ config PPC select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 + select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE diff --git a/arch/powerpc/include/asm/set_memory.h b/arch/powerpc/include/asm/set_memory.h new file mode 100644 index ..5230ddb2fefd --- /dev/null +++ b/arch/powerpc/include/asm/set_memory.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_SET_MEMORY_H +#define _ASM_POWERPC_SET_MEMORY_H + +#define SET_MEMORY_RO 1 +#define SET_MEMORY_RW 2 +#define SET_MEMORY_NX 3 +#define SET_MEMORY_X 4 + +int change_memory_attr(unsigned long addr, int numpages, int action); + +static inline int set_memory_ro(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_RO); +} + +static inline int set_memory_rw(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_RW); +} + +static inline int set_memory_nx(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_NX); +} + +static inline int set_memory_x(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_X); +} + +#endif diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 5e147986400d..d0a0bcbc9289 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o obj-$(CONFIG_PPC_PTDUMP) += ptdump/ obj-$(CONFIG_KASAN)+= kasan/ +obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pageattr.o diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c new file mode 100644 index ..aedd79173a44 --- /dev/null +++ b/arch/powerpc/mm/pageattr.c @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * MMU-generic set_memory implementation for powerpc + * + * Copyright 2019, IBM Corporation. + */ + +#include +#include + +#include +#include +#include + + +/* + * Updates the attributes of a page in three steps: + * + * 1. invalidate the page table entry + * 2. flush the TLB + * 3. install the new entry with the updated attributes + * + * This is unsafe if the caller is attempting to change the mapping of the + * page it is executing from, or if another CPU is concurrently using the + * page being altered. + * + * TODO make the implementation resistant to this. + */ +static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) +{ + int action = *((int *)data); + pte_t pte_val; + int ret = 0; + + spin_lock(&init_mm.page_table_lock); + + // invalidate the PTE so it's safe to modify + pte_val = ptep_get_and_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + // modify the PTE bits as desired, then apply + switch (action) { + case SET_MEMORY_RO: + pte_val = pte_wrprotect(pte_val); + break; + case SET_MEMORY_RW: + pte_val = pte_mkwrite(pte_val); + break; + case SET_MEMORY_NX: + pte_val = pte_exprotect(pte_val); + break; + case SET_MEMORY_X: + pte_val = pte_mkexec(pte_val); + break; + default: + WARN_ON(true); + ret = -EINVAL; + goto out; +
[PATCH v5 3/5] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
Very rudimentary, just echo 1 > [debugfs]/check_wx_pages and check the kernel log. Useful for testing strict module RWX. Updated the Kconfig entry to reflect this. Also fixed a typo. Signed-off-by: Russell Currey --- arch/powerpc/Kconfig.debug | 6 -- arch/powerpc/mm/ptdump/ptdump.c | 21 - 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug index c59920920ddc..dcfe83d4c211 100644 --- a/arch/powerpc/Kconfig.debug +++ b/arch/powerpc/Kconfig.debug @@ -370,7 +370,7 @@ config PPC_PTDUMP If you are unsure, say N. config PPC_DEBUG_WX - bool "Warn on W+X mappings at boot" + bool "Warn on W+X mappings at boot & enable manual checks at runtime" depends on PPC_PTDUMP help Generate a warning if any W+X mappings are found at boot. @@ -384,7 +384,9 @@ config PPC_DEBUG_WX of other unfixed kernel bugs easier. There is no runtime or memory usage effect of this option - once the kernel has booted up - it's a one time check. + once the kernel has booted up, it only automatically checks once. + + Enables the "check_wx_pages" debugfs entry for checking at runtime. If in doubt, say "Y". diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 2f9ddc29c535..b6cba29ae4a0 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -4,7 +4,7 @@ * * This traverses the kernel pagetables and dumps the * information about the used sections of memory to - * /sys/kernel/debug/kernel_pagetables. + * /sys/kernel/debug/kernel_page_tables. * * Derived from the arm64 implementation: * Copyright (c) 2014, The Linux Foundation, Laura Abbott. @@ -409,6 +409,25 @@ void ptdump_check_wx(void) else pr_info("Checked W+X mappings: passed, no W+X pages found\n"); } + +static int check_wx_debugfs_set(void *data, u64 val) +{ + if (val != 1ULL) + return -EINVAL; + + ptdump_check_wx(); + + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(check_wx_fops, NULL, check_wx_debugfs_set, "%llu\n"); + +static int ptdump_check_wx_init(void) +{ + return debugfs_create_file("check_wx_pages", 0200, NULL, + NULL, &check_wx_fops) ? 0 : -ENOMEM; +} +device_initcall(ptdump_check_wx_init); #endif static int ptdump_init(void) -- 2.23.0
[PATCH v5 5/5] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig
skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX enabled, and if you want memory protection for kernel text you'd want it for modules too, so enable STRICT_MODULE_RWX there. Signed-off-by: Russell Currey --- arch/powerpc/configs/skiroot_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/configs/skiroot_defconfig b/arch/powerpc/configs/skiroot_defconfig index 1253482a67c0..719d899081b3 100644 --- a/arch/powerpc/configs/skiroot_defconfig +++ b/arch/powerpc/configs/skiroot_defconfig @@ -31,6 +31,7 @@ CONFIG_PERF_EVENTS=y CONFIG_SLAB_FREELIST_HARDENED=y CONFIG_JUMP_LABEL=y CONFIG_STRICT_KERNEL_RWX=y +CONFIG_STRICT_MODULE_RWX=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_SIG=y -- 2.23.0
[PATCH v5 4/5] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
To enable strict module RWX on powerpc, set: CONFIG_STRICT_MODULE_RWX=y You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real security benefit. ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX. This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that makes STRICT_MODULE_RWX *on by default* in configurations where STRICT_KERNEL_RWX is *unavailable*. Since this doesn't make much sense, and module RWX without kernel RWX doesn't make much sense, having the same dependencies as kernel RWX works around this problem. Signed-off-by: Russell Currey --- This means that Daniel's test on book3e64 is pretty useless since we've gone from being unable to turn it *off* to being unable to turn it *on*. I think this is the right course of action for now. arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8f7005f0d097..52028e27f2d3 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -135,6 +135,7 @@ config PPC select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) + select ARCH_HAS_STRICT_MODULE_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE select ARCH_HAS_UACCESS_MCSAFE if PPC64 -- 2.23.0
[PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
v4 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html v3 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html Changes since v4: [1/5]: Addressed review comments from Michael Ellerman (thanks!) [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid STRICT_MODULE_RWX being *on by default* in cases where STRICT_KERNEL_RWX is *unavailable* [5/5]: split skiroot_defconfig changes out into its own patch The whole Kconfig situation is really weird and confusing, I believe the correct resolution is to change arch/Kconfig but the consequences are so minor that I don't think it's worth it, especially given that I expect powerpc to have mandatory strict RWX Soon(tm). Russell Currey (5): powerpc/mm: Implement set_memory() routines powerpc/kprobes: Mark newly allocated probes as RO powerpc/mm/ptdump: debugfs handler for W+X checks at runtime powerpc: Set ARCH_HAS_STRICT_MODULE_RWX powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig arch/powerpc/Kconfig | 2 + arch/powerpc/Kconfig.debug | 6 +- arch/powerpc/configs/skiroot_defconfig | 1 + arch/powerpc/include/asm/set_memory.h | 32 +++ arch/powerpc/kernel/kprobes.c | 3 + arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/pageattr.c | 77 ++ arch/powerpc/mm/ptdump/ptdump.c| 21 ++- 8 files changed, 140 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/include/asm/set_memory.h create mode 100644 arch/powerpc/mm/pageattr.c -- 2.23.0
Re: [PATCH 6/6] s390x: Mark archrandom.h functions __must_check
On 29.10.19 14:18, Richard Henderson wrote: > On 10/29/19 8:26 AM, Harald Freudenberger wrote: >> Fine with me, Thanks, reviewed, build and tested. >> You may add my reviewed-by: Harald Freudenberger >> However, will this go into the kernel tree via crypto or s390 subsystem ? > That's an excellent question. > > As an API decision, perhaps going via crypto makes more sense, > but none of the patches are dependent on one another, so they > could go through separate architecture trees. > > It has been a long time since I have done much kernel work; > I'm open to suggestions on the subject. > > > r~ Since the change needs to be done in include/linux/random.h and in parallel with all the arch files in arch/xxx/include/asm/archrandom.h it should go in one shot. I'd suggest to post the patch series to linux-crypto and let Herbert Xu handle this.
Re: [PATCH v5 1/5] powerpc/mm: Implement set_memory() routines
Le 30/10/2019 à 08:31, Russell Currey a écrit : The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX, and are generally useful primitives to have. This implementation is designed to be completely generic across powerpc's many MMUs. It's possible that this could be optimised to be faster for specific MMUs, but the focus is on having a generic and safe implementation for now. This implementation does not handle cases where the caller is attempting to change the mapping of the page it is executing from, or if another CPU is concurrently using the page being altered. These cases likely shouldn't happen, but a more complex implementation with MMU-specific code could safely handle them, so that is left as a TODO for now. Signed-off-by: Russell Currey --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/set_memory.h | 32 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/pageattr.c| 77 +++ 4 files changed, 111 insertions(+) create mode 100644 arch/powerpc/include/asm/set_memory.h create mode 100644 arch/powerpc/mm/pageattr.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3e56c9c2f16e..8f7005f0d097 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -133,6 +133,7 @@ config PPC select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64 + select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE diff --git a/arch/powerpc/include/asm/set_memory.h b/arch/powerpc/include/asm/set_memory.h new file mode 100644 index ..5230ddb2fefd --- /dev/null +++ b/arch/powerpc/include/asm/set_memory.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_SET_MEMORY_H +#define _ASM_POWERPC_SET_MEMORY_H + +#define SET_MEMORY_RO 1 +#define SET_MEMORY_RW 2 +#define SET_MEMORY_NX 3 +#define SET_MEMORY_X 4 + +int change_memory_attr(unsigned long addr, int numpages, int action); + +static inline int set_memory_ro(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_RO); +} + +static inline int set_memory_rw(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_RW); +} + +static inline int set_memory_nx(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_NX); +} + +static inline int set_memory_x(unsigned long addr, int numpages) +{ + return change_memory_attr(addr, numpages, SET_MEMORY_X); +} + +#endif diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 5e147986400d..d0a0bcbc9289 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o obj-$(CONFIG_PPC_PTDUMP) += ptdump/ obj-$(CONFIG_KASAN) += kasan/ +obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pageattr.o diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c new file mode 100644 index ..aedd79173a44 --- /dev/null +++ b/arch/powerpc/mm/pageattr.c @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * MMU-generic set_memory implementation for powerpc + * + * Copyright 2019, IBM Corporation. + */ + +#include +#include + +#include +#include +#include + + +/* + * Updates the attributes of a page in three steps: + * + * 1. invalidate the page table entry + * 2. flush the TLB + * 3. install the new entry with the updated attributes + * + * This is unsafe if the caller is attempting to change the mapping of the + * page it is executing from, or if another CPU is concurrently using the + * page being altered. + * + * TODO make the implementation resistant to this. + */ +static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) I don't like too much the way you are making this function more complex and less readable with local var, goto, etc ... You could just keep the v4 version of change_page_attr(), rename it __change_page_attr(), then add: static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) { int ret; spin_lock(&init_mm.page_table_lock); ret = __change_page_attr(ptep, addr, data); spin_unlock(&init_mm.page_table_lock); return ret; } Christophe +{ + int action = *((int *)data); + pte_t pte_val; + int ret = 0; + + spin_lock(&init_mm.page_table_lock); + + // invalidate the PTE so it's safe to modify + pte_val = ptep_get_and_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + // modify the PTE
Re: [PATCH 2/2] ASoC: fsl_asrc: Add support for imx8qm
On Wed, Oct 30, 2019 at 03:20:35AM +, S.j. Wang wrote: > Hi > > > > > On Tue, Oct 29, 2019 at 05:17:09PM +0800, Shengjiu Wang wrote: > > > There are two asrc module in imx8qm, each module has different clock > > > configuration, and the DMA type is EDMA. > > > > > > So in this patch, we define the new clocks, refine the clock map, and > > > include struct fsl_asrc_soc_data for different soc usage. > > > > > > The EDMA channel is fixed with each dma request, one dma request > > > corresponding to one dma channel. So we need to request dma channel > > > with dma request of asrc module. > > > > > > Signed-off-by: Shengjiu Wang > > > --- > > > sound/soc/fsl/fsl_asrc.c | 91 +--- > > > sound/soc/fsl/fsl_asrc.h | 65 +- > > > sound/soc/fsl/fsl_asrc_dma.c | 39 > > > 3 files changed, 167 insertions(+), 28 deletions(-) > > > > > diff --git a/sound/soc/fsl/fsl_asrc_dma.c > > > b/sound/soc/fsl/fsl_asrc_dma.c index d6146de9acd2..dbb07a486504 > > 100644 > > > --- a/sound/soc/fsl/fsl_asrc_dma.c > > > +++ b/sound/soc/fsl/fsl_asrc_dma.c > > > @@ -199,19 +199,40 @@ static int fsl_asrc_dma_hw_params(struct > > > snd_soc_component *component, > > > > > > /* Get DMA request of Back-End */ > > > tmp_chan = dma_request_slave_channel(dev_be, tx ? "tx" : "rx"); > > > - tmp_data = tmp_chan->private; > > > - pair->dma_data.dma_request = tmp_data->dma_request; > > > - dma_release_channel(tmp_chan); > > > + /* tmp_chan may be NULL for it is already allocated by Back-End */ > > > + if (tmp_chan) { > > > + tmp_data = tmp_chan->private; > > > + if (tmp_data) > > > + pair->dma_data.dma_request = > > > + tmp_data->dma_request; > > > > If this patch is supposed to add a !tmp_chan case for EDMA, we probably > > shouldn't mute the !tmp_data case because dma_request will be NULL, > > although the code previously didn't have a check either. I mean we might > > need to error-out for !tmp_chan. Or... > > is this intentional? > > > > Yes, intentional. May be we can change to > > if (!asrc_priv->soc->use_edma) { > /* Get DMA request of Back-End */ > tmp_chan = dma_request_slave_channel(dev_be, tx ? "tx" : > "rx"); > tmp_data = tmp_chan->private; > pair->dma_data.dma_request = tmp_data->dma_request; > dma_release_channel(tmp_chan); > > /* Get DMA request of Front-End */ > tmp_chan = fsl_asrc_get_dma_channel(pair, dir); > tmp_data = tmp_chan->private; > pair->dma_data.dma_request2 = tmp_data->dma_request; > pair->dma_data.peripheral_type = tmp_data->peripheral_type; > pair->dma_data.priority = tmp_data->priority; > dma_release_channel(tmp_chan); > } Oh...now I understand..yea, I think this would be better. Would you please change it in v2? I am fine with other places, so may add: Acked-by: Nicolin Chen Thanks
Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
Le 30/10/2019 à 08:31, Russell Currey a écrit : v4 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html v3 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html Changes since v4: [1/5]: Addressed review comments from Michael Ellerman (thanks!) [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid STRICT_MODULE_RWX being *on by default* in cases where STRICT_KERNEL_RWX is *unavailable* [5/5]: split skiroot_defconfig changes out into its own patch The whole Kconfig situation is really weird and confusing, I believe the correct resolution is to change arch/Kconfig but the consequences are so minor that I don't think it's worth it, especially given that I expect powerpc to have mandatory strict RWX Soon(tm). I'm not such strict RWX can be made mandatory due to the impact it has on some subarches: - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which is not acceptable on systems having very few RAM. - On hash book3s32, we are able to map the kernel BATs. With a few alignment constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to provide exec protection on page granularity. Only on 256Mbytes segments. So for modules, we have to have the vmspace X. It is also not possible to have a kernel area RO. Only user areas can be made RO. Christophe Russell Currey (5): powerpc/mm: Implement set_memory() routines powerpc/kprobes: Mark newly allocated probes as RO powerpc/mm/ptdump: debugfs handler for W+X checks at runtime powerpc: Set ARCH_HAS_STRICT_MODULE_RWX powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig arch/powerpc/Kconfig | 2 + arch/powerpc/Kconfig.debug | 6 +- arch/powerpc/configs/skiroot_defconfig | 1 + arch/powerpc/include/asm/set_memory.h | 32 +++ arch/powerpc/kernel/kprobes.c | 3 + arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/pageattr.c | 77 ++ arch/powerpc/mm/ptdump/ptdump.c| 21 ++- 8 files changed, 140 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/include/asm/set_memory.h create mode 100644 arch/powerpc/mm/pageattr.c
[PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
When passing the return value of dev_to_node() to cpumask_of_node() without checking if the device's node id is NUMA_NO_NODE, there is global-out-of-bounds detected by KASAN. >From the discussion [1], NUMA_NO_NODE really means no node affinity, which also means all cpus should be usable. So the cpumask_of_node() should always return all cpus online when user passes the node id as NUMA_NO_NODE, just like similar semantic that page allocator handles NUMA_NO_NODE. But we cannot really copy the page allocator logic. Simply because the page allocator doesn't enforce the near node affinity. It just picks it up as a preferred node but then it is free to fallback to any other numa node. This is not the case here and node_to_cpumask_map will only restrict to the particular node's cpus which would have really non deterministic behavior depending on where the code is executed. So in fact we really want to return cpu_online_mask for NUMA_NO_NODE. Also there is a debugging version of node_to_cpumask_map() for x86 and arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). [1] https://lkml.org/lkml/2019/9/11/66 Signed-off-by: Yunsheng Lin Suggested-by: Michal Hocko Acked-by: Michal Hocko Acked-by: Paul Burton # MIPS bits --- V7: replace -1 with NUMA_NO_NODE for mips ip27 as suggested by Paul. V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a little controversial, may need deeper investigation, and rebased on the latest linux-next. V5: Drop unsigned "fix" change for x86/arm64, and change comment log according to Michal's comment. V4: Have all these changes in a single patch. V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask for NUMA_NO_NODE case, and change the commit log to better justify the change. V2: make the node id checking change to other arches too. --- arch/arm64/include/asm/numa.h| 3 +++ arch/arm64/mm/numa.c | 3 +++ arch/mips/include/asm/mach-ip27/topology.h | 2 +- arch/mips/include/asm/mach-loongson64/topology.h | 4 +++- arch/s390/include/asm/topology.h | 3 +++ arch/x86/include/asm/topology.h | 3 +++ arch/x86/mm/numa.c | 3 +++ 7 files changed, 19 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h index 626ad01..c8a4b31 100644 --- a/arch/arm64/include/asm/numa.h +++ b/arch/arm64/include/asm/numa.h @@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node); /* Returns a pointer to the cpumask of CPUs on Node 'node'. */ static inline const struct cpumask *cpumask_of_node(int node) { + if (node == NUMA_NO_NODE) + return cpu_online_mask; + return node_to_cpumask_map[node]; } #endif diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index 4decf16..5ae7eea 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map); */ const struct cpumask *cpumask_of_node(int node) { + if (node == NUMA_NO_NODE) + return cpu_online_mask; + if (WARN_ON(node >= nr_node_ids)) return cpu_none_mask; diff --git a/arch/mips/include/asm/mach-ip27/topology.h b/arch/mips/include/asm/mach-ip27/topology.h index 965f079..db293cf 100644 --- a/arch/mips/include/asm/mach-ip27/topology.h +++ b/arch/mips/include/asm/mach-ip27/topology.h @@ -15,7 +15,7 @@ struct cpuinfo_ip27 { extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS]; #define cpu_to_node(cpu) (sn_cpu_info[(cpu)].p_nodeid) -#define cpumask_of_node(node) ((node) == -1 ? \ +#define cpumask_of_node(node) ((node) == NUMA_NO_NODE ? \ cpu_all_mask : \ &hub_data(node)->h_cpus) struct pci_bus; diff --git a/arch/mips/include/asm/mach-loongson64/topology.h b/arch/mips/include/asm/mach-loongson64/topology.h index 7ff819a..e78daa6 100644 --- a/arch/mips/include/asm/mach-loongson64/topology.h +++ b/arch/mips/include/asm/mach-loongson64/topology.h @@ -5,7 +5,9 @@ #ifdef CONFIG_NUMA #define cpu_to_node(cpu) (cpu_logical_map(cpu) >> 2) -#define cpumask_of_node(node) (&__node_data[(node)]->cpumask) +#define cpumask_of_node(node) ((node) == NUMA_NO_NODE ? \ +cpu_online_mask : \ +&__node_data[(node)]->cpumask) struct pci_bus; extern int pcibus_to_node(struct pci_bus *); diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index cca406f..1bd2e73 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -78,6 +78,9 @@ static inline int cpu_to_node(int cpu) #define cpumask_of_node cpumask_of_node static inline const struct cpumask
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote: > When passing the return value of dev_to_node() to cpumask_of_node() > without checking if the device's node id is NUMA_NO_NODE, there is > global-out-of-bounds detected by KASAN. > > From the discussion [1], NUMA_NO_NODE really means no node affinity, > which also means all cpus should be usable. So the cpumask_of_node() > should always return all cpus online when user passes the node id as > NUMA_NO_NODE, just like similar semantic that page allocator handles > NUMA_NO_NODE. > > But we cannot really copy the page allocator logic. Simply because the > page allocator doesn't enforce the near node affinity. It just picks it > up as a preferred node but then it is free to fallback to any other numa > node. This is not the case here and node_to_cpumask_map will only restrict > to the particular node's cpus which would have really non deterministic > behavior depending on where the code is executed. So in fact we really > want to return cpu_online_mask for NUMA_NO_NODE. > > Also there is a debugging version of node_to_cpumask_map() for x86 and > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). > > [1] https://lkml.org/lkml/2019/9/11/66 > Signed-off-by: Yunsheng Lin > Suggested-by: Michal Hocko > Acked-by: Michal Hocko > Acked-by: Paul Burton # MIPS bits Still: Nacked-by: Peter Zijlstra (Intel)
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On Wed 30-10-19 17:34:28, Yunsheng Lin wrote: > When passing the return value of dev_to_node() to cpumask_of_node() > without checking if the device's node id is NUMA_NO_NODE, there is > global-out-of-bounds detected by KASAN. > > >From the discussion [1], NUMA_NO_NODE really means no node affinity, > which also means all cpus should be usable. So the cpumask_of_node() > should always return all cpus online when user passes the node id as > NUMA_NO_NODE, just like similar semantic that page allocator handles > NUMA_NO_NODE. > > But we cannot really copy the page allocator logic. Simply because the > page allocator doesn't enforce the near node affinity. It just picks it > up as a preferred node but then it is free to fallback to any other numa > node. This is not the case here and node_to_cpumask_map will only restrict > to the particular node's cpus which would have really non deterministic > behavior depending on where the code is executed. So in fact we really > want to return cpu_online_mask for NUMA_NO_NODE. > > Also there is a debugging version of node_to_cpumask_map() for x86 and > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). > > [1] https://lkml.org/lkml/2019/9/11/66 Please do not use lkml.org links. They tend to break quite often. Use http://lkml.kernel.org/r/$msg_id or lore.kernel.org > Signed-off-by: Yunsheng Lin > Suggested-by: Michal Hocko > Acked-by: Michal Hocko > Acked-by: Paul Burton # MIPS bits > --- > V7: replace -1 with NUMA_NO_NODE for mips ip27 as suggested by Paul. > V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a > little controversial, may need deeper investigation, and rebased > on the latest linux-next. > V5: Drop unsigned "fix" change for x86/arm64, and change comment log > according to Michal's comment. > V4: Have all these changes in a single patch. > V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask > for NUMA_NO_NODE case, and change the commit log to better justify > the change. > V2: make the node id checking change to other arches too. > --- > arch/arm64/include/asm/numa.h| 3 +++ > arch/arm64/mm/numa.c | 3 +++ > arch/mips/include/asm/mach-ip27/topology.h | 2 +- > arch/mips/include/asm/mach-loongson64/topology.h | 4 +++- > arch/s390/include/asm/topology.h | 3 +++ > arch/x86/include/asm/topology.h | 3 +++ > arch/x86/mm/numa.c | 3 +++ > 7 files changed, 19 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h > index 626ad01..c8a4b31 100644 > --- a/arch/arm64/include/asm/numa.h > +++ b/arch/arm64/include/asm/numa.h > @@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node); > /* Returns a pointer to the cpumask of CPUs on Node 'node'. */ > static inline const struct cpumask *cpumask_of_node(int node) > { > + if (node == NUMA_NO_NODE) > + return cpu_online_mask; > + > return node_to_cpumask_map[node]; > } > #endif > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c > index 4decf16..5ae7eea 100644 > --- a/arch/arm64/mm/numa.c > +++ b/arch/arm64/mm/numa.c > @@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map); > */ > const struct cpumask *cpumask_of_node(int node) > { > + if (node == NUMA_NO_NODE) > + return cpu_online_mask; > + > if (WARN_ON(node >= nr_node_ids)) > return cpu_none_mask; > > diff --git a/arch/mips/include/asm/mach-ip27/topology.h > b/arch/mips/include/asm/mach-ip27/topology.h > index 965f079..db293cf 100644 > --- a/arch/mips/include/asm/mach-ip27/topology.h > +++ b/arch/mips/include/asm/mach-ip27/topology.h > @@ -15,7 +15,7 @@ struct cpuinfo_ip27 { > extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS]; > > #define cpu_to_node(cpu) (sn_cpu_info[(cpu)].p_nodeid) > -#define cpumask_of_node(node)((node) == -1 ? > \ > +#define cpumask_of_node(node)((node) == NUMA_NO_NODE ? > \ >cpu_all_mask : \ >&hub_data(node)->h_cpus) > struct pci_bus; > diff --git a/arch/mips/include/asm/mach-loongson64/topology.h > b/arch/mips/include/asm/mach-loongson64/topology.h > index 7ff819a..e78daa6 100644 > --- a/arch/mips/include/asm/mach-loongson64/topology.h > +++ b/arch/mips/include/asm/mach-loongson64/topology.h > @@ -5,7 +5,9 @@ > #ifdef CONFIG_NUMA > > #define cpu_to_node(cpu) (cpu_logical_map(cpu) >> 2) > -#define cpumask_of_node(node)(&__node_data[(node)]->cpumask) > +#define cpumask_of_node(node)((node) == NUMA_NO_NODE ? > \ > + cpu_online_mask : \ > + &__node_data[(node)]->cpumask) >
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On Wed 30-10-19 11:14:49, Peter Zijlstra wrote: > On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote: > > When passing the return value of dev_to_node() to cpumask_of_node() > > without checking if the device's node id is NUMA_NO_NODE, there is > > global-out-of-bounds detected by KASAN. > > > > From the discussion [1], NUMA_NO_NODE really means no node affinity, > > which also means all cpus should be usable. So the cpumask_of_node() > > should always return all cpus online when user passes the node id as > > NUMA_NO_NODE, just like similar semantic that page allocator handles > > NUMA_NO_NODE. > > > > But we cannot really copy the page allocator logic. Simply because the > > page allocator doesn't enforce the near node affinity. It just picks it > > up as a preferred node but then it is free to fallback to any other numa > > node. This is not the case here and node_to_cpumask_map will only restrict > > to the particular node's cpus which would have really non deterministic > > behavior depending on where the code is executed. So in fact we really > > want to return cpu_online_mask for NUMA_NO_NODE. > > > > Also there is a debugging version of node_to_cpumask_map() for x86 and > > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this > > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). > > > > [1] https://lkml.org/lkml/2019/9/11/66 > > Signed-off-by: Yunsheng Lin > > Suggested-by: Michal Hocko > > Acked-by: Michal Hocko > > Acked-by: Paul Burton # MIPS bits > > Still: > > Nacked-by: Peter Zijlstra (Intel) Do you have any other proposal that doesn't make any wild guesses about which node to use instead of the undefined one? -- Michal Hocko SUSE Labs
Re: [PATCH v2 07/23] soc: fsl: qe: merge qe_ic.h into qe_ic.c
Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : The local qe_ic.h header is only used by qe_ic.c, so merge its contents into the .c file. This is preparation for moving the driver to drivers/irqchip/. It also avoids confusion between this header and the one at include/soc/fsl/qe/qe_ic.h, which is included from a number of places (qe_ic.c among others). Signed-off-by: Rasmus Villemoes --- drivers/soc/fsl/qe/qe_ic.c | 91 ++- drivers/soc/fsl/qe/qe_ic.h | 108 - 2 files changed, 90 insertions(+), 109 deletions(-) delete mode 100644 drivers/soc/fsl/qe/qe_ic.h diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c index d420492b4c23..7b1870d2866a 100644 --- a/drivers/soc/fsl/qe/qe_ic.c +++ b/drivers/soc/fsl/qe/qe_ic.c @@ -26,7 +26,96 @@ #include #include -#include "qe_ic.h" +#define NR_QE_IC_INTS 64 + +/* QE IC registers offset */ +#define QEIC_CICR 0x00 +#define QEIC_CIVEC 0x04 +#define QEIC_CRIPNR0x08 +#define QEIC_CIPNR 0x0c +#define QEIC_CIPXCC0x10 +#define QEIC_CIPYCC0x14 +#define QEIC_CIPWCC0x18 +#define QEIC_CIPZCC0x1c +#define QEIC_CIMR 0x20 +#define QEIC_CRIMR 0x24 +#define QEIC_CICNR 0x28 +#define QEIC_CIPRTA0x30 +#define QEIC_CIPRTB0x34 +#define QEIC_CRICR 0x3c +#define QEIC_CHIVEC0x60 + +/* Interrupt priority registers */ +#define CIPCC_SHIFT_PRI0 29 +#define CIPCC_SHIFT_PRI1 26 +#define CIPCC_SHIFT_PRI2 23 +#define CIPCC_SHIFT_PRI3 20 +#define CIPCC_SHIFT_PRI4 13 +#define CIPCC_SHIFT_PRI5 10 +#define CIPCC_SHIFT_PRI6 7 +#define CIPCC_SHIFT_PRI7 4 I think you should drop all unused consts and only keep the ones that are used. + +/* CICR priority modes */ +#define CICR_GWCC 0x0004 +#define CICR_GXCC 0x0002 +#define CICR_GYCC 0x0001 +#define CICR_GZCC 0x0008 +#define CICR_GRTA 0x0020 +#define CICR_GRTB 0x0040 +#define CICR_HPIT_SHIFT8 +#define CICR_HPIT_MASK 0x0300 +#define CICR_HP_SHIFT 24 +#define CICR_HP_MASK 0x3f00 + +/* CICNR */ +#define CICNR_WCC1T_SHIFT 20 +#define CICNR_ZCC1T_SHIFT 28 +#define CICNR_YCC1T_SHIFT 12 +#define CICNR_XCC1T_SHIFT 4 Same here + +/* CRICR */ +#define CRICR_RTA1T_SHIFT 20 +#define CRICR_RTB1T_SHIFT 28 Same + +/* Signal indicator */ +#define SIGNAL_MASK3 +#define SIGNAL_HIGH2 +#define SIGNAL_LOW 0 Only SIGNAL_HIGH seems to be used. Christophe + +struct qe_ic { + /* Control registers offset */ + u32 __iomem *regs; + + /* The remapper for this QEIC */ + struct irq_domain *irqhost; + + /* The "linux" controller struct */ + struct irq_chip hc_irq; + + /* VIRQ numbers of QE high/low irqs */ + unsigned int virq_high; + unsigned int virq_low; +}; + +/* + * QE interrupt controller internal structure + */ +struct qe_ic_info { + /* Location of this source at the QIMR register */ + u32 mask; + + /* Mask register offset */ + u32 mask_reg; + + /* +* For grouped interrupts sources - the interrupt code as +* appears at the group priority register +*/ + u8 pri_code; + + /* Group priority register offset */ + u32 pri_reg; +}; static DEFINE_RAW_SPINLOCK(qe_ic_lock); diff --git a/drivers/soc/fsl/qe/qe_ic.h b/drivers/soc/fsl/qe/qe_ic.h deleted file mode 100644 index 29b4d768e4a8.. --- a/drivers/soc/fsl/qe/qe_ic.h +++ /dev/null @@ -1,108 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-or-later */ -/* - * drivers/soc/fsl/qe/qe_ic.h - * - * QUICC ENGINE Interrupt Controller Header - * - * Copyright (C) 2006 Freescale Semiconductor, Inc. All rights reserved. - * - * Author: Li Yang - * Based on code from Shlomi Gridish - */ -#ifndef _POWERPC_SYSDEV_QE_IC_H -#define _POWERPC_SYSDEV_QE_IC_H - -#include - -#define NR_QE_IC_INTS 64 - -/* QE IC registers offset */ -#define QEIC_CICR 0x00 -#define QEIC_CIVEC 0x04 -#define QEIC_CRIPNR0x08 -#define QEIC_CIPNR 0x0c -#define QEIC_CIPXCC0x10 -#define QEIC_CIPYCC0x14 -#define QEIC_CIPWCC0x18 -#define QEIC_CIPZCC0x1c -#define QEIC_CIMR 0x20 -#define QEIC_CRIMR 0x24 -#define QEIC_CICNR 0x28 -#define QEIC_CIPRTA0x30 -#define QEIC_CIPRTB0x34 -#define QEIC_CRICR 0x3c -#define QEIC_CHIVEC0x60 - -/* Interrupt priority registers */ -#define CIPCC_SHIFT_PRI0 29 -#define CIPCC_SHIFT_PRI1 26 -#define CIPCC_SHIFT_PRI2 23 -#define
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On Wed, Oct 30, 2019 at 11:22:29AM +0100, Michal Hocko wrote: > On Wed 30-10-19 11:14:49, Peter Zijlstra wrote: > > On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote: > > > When passing the return value of dev_to_node() to cpumask_of_node() > > > without checking if the device's node id is NUMA_NO_NODE, there is > > > global-out-of-bounds detected by KASAN. > > > > > > From the discussion [1], NUMA_NO_NODE really means no node affinity, > > > which also means all cpus should be usable. So the cpumask_of_node() > > > should always return all cpus online when user passes the node id as > > > NUMA_NO_NODE, just like similar semantic that page allocator handles > > > NUMA_NO_NODE. > > > > > > But we cannot really copy the page allocator logic. Simply because the > > > page allocator doesn't enforce the near node affinity. It just picks it > > > up as a preferred node but then it is free to fallback to any other numa > > > node. This is not the case here and node_to_cpumask_map will only restrict > > > to the particular node's cpus which would have really non deterministic > > > behavior depending on where the code is executed. So in fact we really > > > want to return cpu_online_mask for NUMA_NO_NODE. > > > > > > Also there is a debugging version of node_to_cpumask_map() for x86 and > > > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this > > > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). > > > > > > [1] https://lkml.org/lkml/2019/9/11/66 > > > Signed-off-by: Yunsheng Lin > > > Suggested-by: Michal Hocko > > > Acked-by: Michal Hocko > > > Acked-by: Paul Burton # MIPS bits > > > > Still: > > > > Nacked-by: Peter Zijlstra (Intel) > > Do you have any other proposal that doesn't make any wild guesses about > which node to use instead of the undefined one? It only makes 'wild' guesses when the BIOS is shit and it complains about that. Or do you like you BIOS broken?
Re: [PATCH v3] drm/radeon: Fix EEH during kexec
Hi Kyle, KyleMahlkuch writes: > From: Kyle Mahlkuch > > During kexec some adapters hit an EEH since they are not properly > shut down in the radeon_pci_shutdown() function. Adding > radeon_suspend_kms() fixes this issue. > Enabled only on PPC because this patch causes issues on some other > boards. Which adapters hit the issues? And do we know why they're not shut down correctly in radeon_pci_shutdown()? That seems like the root cause no? > diff --git a/drivers/gpu/drm/radeon/radeon_drv.c > b/drivers/gpu/drm/radeon/radeon_drv.c > index 9e55076..4528f4d 100644 > --- a/drivers/gpu/drm/radeon/radeon_drv.c > +++ b/drivers/gpu/drm/radeon/radeon_drv.c > @@ -379,11 +379,25 @@ static int radeon_pci_probe(struct pci_dev *pdev, > static void > radeon_pci_shutdown(struct pci_dev *pdev) > { > +#ifdef CONFIG_PPC64 > + struct drm_device *ddev = pci_get_drvdata(pdev); > +#endif This local serves no real purpose and could be avoided, which would also avoid this ifdef. > /* if we are running in a VM, make sure the device >* torn down properly on reboot/shutdown >*/ > if (radeon_device_is_virtual()) > radeon_pci_remove(pdev); > + > +#ifdef CONFIG_PPC64 > + /* Some adapters need to be suspended before a AFAIK drm uses normal kernel comment style, so this should be: /* * Some adapters need to be suspended before a > + * shutdown occurs in order to prevent an error > + * during kexec. > + * Make this power specific becauase it breaks > + * some non-power boards. > + */ > + radeon_suspend_kms(ddev, true, true, false); ie, instead do: radeon_suspend_kms(pci_get_drvdata(pdev), true, true, false); > +#endif > } > > static int radeon_pmops_suspend(struct device *dev) > -- > 1.8.3.1 cheers
Re: [PATCH v2 09/23] soc: fsl: qe: move qe_ic_cascade_* functions to qe_ic.c
Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : These functions are only ever called through a function pointer, and therefore it makes no sense for them to be "static inline" - gcc has no choice but to emit a copy in each translation unit that takes the address of one of these (currently various platform code under arch/powerpc/). So move them into qe_ic.c and leave ordinary extern declarations in the header file. What is the point in moving fonctions that you will drop in the next patch (qe_ic_cascade_low_ipic() and qe_ic_cascade_high_ipic()) Only move the ones that will remain. Christophe Signed-off-by: Rasmus Villemoes --- drivers/soc/fsl/qe/qe_ic.c | 58 +++ include/soc/fsl/qe/qe_ic.h | 62 +++--- 2 files changed, 63 insertions(+), 57 deletions(-) diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c index 7b1870d2866a..a847b2672e90 100644 --- a/drivers/soc/fsl/qe/qe_ic.c +++ b/drivers/soc/fsl/qe/qe_ic.c @@ -402,6 +402,64 @@ unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic) return irq_linear_revmap(qe_ic->irqhost, irq); } +void qe_ic_cascade_low_ipic(struct irq_desc *desc) +{ + struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); + unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); + + if (cascade_irq != NO_IRQ) + generic_handle_irq(cascade_irq); +} + +void qe_ic_cascade_high_ipic(struct irq_desc *desc) +{ + struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); + unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); + + if (cascade_irq != NO_IRQ) + generic_handle_irq(cascade_irq); +} + +void qe_ic_cascade_low_mpic(struct irq_desc *desc) +{ + struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); + unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); + struct irq_chip *chip = irq_desc_get_chip(desc); + + if (cascade_irq != NO_IRQ) + generic_handle_irq(cascade_irq); + + chip->irq_eoi(&desc->irq_data); +} + +void qe_ic_cascade_high_mpic(struct irq_desc *desc) +{ + struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); + unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); + struct irq_chip *chip = irq_desc_get_chip(desc); + + if (cascade_irq != NO_IRQ) + generic_handle_irq(cascade_irq); + + chip->irq_eoi(&desc->irq_data); +} + +void qe_ic_cascade_muxed_mpic(struct irq_desc *desc) +{ + struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); + unsigned int cascade_irq; + struct irq_chip *chip = irq_desc_get_chip(desc); + + cascade_irq = qe_ic_get_high_irq(qe_ic); + if (cascade_irq == NO_IRQ) + cascade_irq = qe_ic_get_low_irq(qe_ic); + + if (cascade_irq != NO_IRQ) + generic_handle_irq(cascade_irq); + + chip->irq_eoi(&desc->irq_data); +} + void __init qe_ic_init(struct device_node *node, unsigned int flags, void (*low_handler)(struct irq_desc *desc), void (*high_handler)(struct irq_desc *desc)) diff --git a/include/soc/fsl/qe/qe_ic.h b/include/soc/fsl/qe/qe_ic.h index 714a9b890d8d..f3492eb13052 100644 --- a/include/soc/fsl/qe/qe_ic.h +++ b/include/soc/fsl/qe/qe_ic.h @@ -74,62 +74,10 @@ void qe_ic_set_highest_priority(unsigned int virq, int high); int qe_ic_set_priority(unsigned int virq, unsigned int priority); int qe_ic_set_high_priority(unsigned int virq, unsigned int priority, int high); -static inline void qe_ic_cascade_low_ipic(struct irq_desc *desc) -{ - struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); - unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); - - if (cascade_irq != NO_IRQ) - generic_handle_irq(cascade_irq); -} - -static inline void qe_ic_cascade_high_ipic(struct irq_desc *desc) -{ - struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); - unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); - - if (cascade_irq != NO_IRQ) - generic_handle_irq(cascade_irq); -} - -static inline void qe_ic_cascade_low_mpic(struct irq_desc *desc) -{ - struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); - unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); - struct irq_chip *chip = irq_desc_get_chip(desc); - - if (cascade_irq != NO_IRQ) - generic_handle_irq(cascade_irq); - - chip->irq_eoi(&desc->irq_data); -} - -static inline void qe_ic_cascade_high_mpic(struct irq_desc *desc) -{ - struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); - unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); - struct irq_chip *chip = irq_desc_get_chip(desc); - - if (cascade_irq != NO_IRQ) - generic_handle_irq(cascade_irq); - - chip->irq_eoi(&desc->irq_data); -} - -static inline void qe_ic_cascade_muxed_mpic(struct irq_desc *desc) -{ - struct qe_ic *qe_ic = irq_desc_ge
Re: [PATCH v2 11/23] soc: fsl: qe: rename qe_ic_cascade_low_mpic -> qe_ic_cascade_low
Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : The qe_ic_cascade_{low,high}_mpic functions are now used as handlers both when the interrupt parent is mpic as well as ipic, so remove the _mpic suffix. Here you are modifying code that you drop in patch 14. That's pointless. You should perform name change after patch 14. Christophe Signed-off-by: Rasmus Villemoes --- arch/powerpc/platforms/83xx/misc.c| 2 +- arch/powerpc/platforms/85xx/corenet_generic.c | 4 ++-- arch/powerpc/platforms/85xx/mpc85xx_mds.c | 4 ++-- arch/powerpc/platforms/85xx/mpc85xx_rdb.c | 4 ++-- arch/powerpc/platforms/85xx/twr_p102x.c | 4 ++-- drivers/soc/fsl/qe/qe_ic.c| 4 ++-- include/soc/fsl/qe/qe_ic.h| 4 ++-- 7 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/83xx/misc.c b/arch/powerpc/platforms/83xx/misc.c index 779791c0570f..835d082218ae 100644 --- a/arch/powerpc/platforms/83xx/misc.c +++ b/arch/powerpc/platforms/83xx/misc.c @@ -100,7 +100,7 @@ void __init mpc83xx_qe_init_IRQ(void) if (!np) return; } - qe_ic_init(np, 0, qe_ic_cascade_low_mpic, qe_ic_cascade_high_mpic); + qe_ic_init(np, 0, qe_ic_cascade_low, qe_ic_cascade_high); of_node_put(np); } diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c b/arch/powerpc/platforms/85xx/corenet_generic.c index 7ee2c6628f64..2ed9e84ca03a 100644 --- a/arch/powerpc/platforms/85xx/corenet_generic.c +++ b/arch/powerpc/platforms/85xx/corenet_generic.c @@ -50,8 +50,8 @@ void __init corenet_gen_pic_init(void) np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic"); if (np) { - qe_ic_init(np, 0, qe_ic_cascade_low_mpic, - qe_ic_cascade_high_mpic); + qe_ic_init(np, 0, qe_ic_cascade_low, + qe_ic_cascade_high); of_node_put(np); } } diff --git a/arch/powerpc/platforms/85xx/mpc85xx_mds.c b/arch/powerpc/platforms/85xx/mpc85xx_mds.c index 5ca254256c47..24211a1787b2 100644 --- a/arch/powerpc/platforms/85xx/mpc85xx_mds.c +++ b/arch/powerpc/platforms/85xx/mpc85xx_mds.c @@ -288,8 +288,8 @@ static void __init mpc85xx_mds_qeic_init(void) } if (machine_is(p1021_mds)) - qe_ic_init(np, 0, qe_ic_cascade_low_mpic, - qe_ic_cascade_high_mpic); + qe_ic_init(np, 0, qe_ic_cascade_low, + qe_ic_cascade_high); else qe_ic_init(np, 0, qe_ic_cascade_muxed_mpic, NULL); of_node_put(np); diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c index d3c540ee558f..093867879081 100644 --- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c +++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c @@ -66,8 +66,8 @@ void __init mpc85xx_rdb_pic_init(void) #ifdef CONFIG_QUICC_ENGINE np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic"); if (np) { - qe_ic_init(np, 0, qe_ic_cascade_low_mpic, - qe_ic_cascade_high_mpic); + qe_ic_init(np, 0, qe_ic_cascade_low, + qe_ic_cascade_high); of_node_put(np); } else diff --git a/arch/powerpc/platforms/85xx/twr_p102x.c b/arch/powerpc/platforms/85xx/twr_p102x.c index 720b0c0f03ba..2e0fb23854c0 100644 --- a/arch/powerpc/platforms/85xx/twr_p102x.c +++ b/arch/powerpc/platforms/85xx/twr_p102x.c @@ -45,8 +45,8 @@ static void __init twr_p1025_pic_init(void) #ifdef CONFIG_QUICC_ENGINE np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic"); if (np) { - qe_ic_init(np, 0, qe_ic_cascade_low_mpic, - qe_ic_cascade_high_mpic); + qe_ic_init(np, 0, qe_ic_cascade_low, + qe_ic_cascade_high); of_node_put(np); } else pr_err("Could not find qe-ic node\n"); diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c index 0ff802816c0c..f3659c312e13 100644 --- a/drivers/soc/fsl/qe/qe_ic.c +++ b/drivers/soc/fsl/qe/qe_ic.c @@ -402,7 +402,7 @@ unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic) return irq_linear_revmap(qe_ic->irqhost, irq); } -void qe_ic_cascade_low_mpic(struct irq_desc *desc) +void qe_ic_cascade_low(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); @@ -415,7 +415,7 @@ void qe_ic_cascade_low_mpic(struct irq_desc *desc) chip->irq_eoi(&desc->irq_data); } -void qe_ic_cascade_high_mpic(struct irq_desc *desc) +void qe_ic_cascade_high(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); diff --git a/include/soc/fsl/qe/q
Re: [PATCH v2 17/23] soc: fsl: qe: make qe_ic_cascade_* static
Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : Now that the references from arch/powerpc/ are gone, these are only referenced from inside qe_ic.c, so make them static. Why do that in two steps ? I think patch 9 could remain until here, and then you could squash patch 9 and patch 17 together here. Christophe Signed-off-by: Rasmus Villemoes --- drivers/soc/fsl/qe/qe_ic.c | 6 +++--- include/soc/fsl/qe/qe_ic.h | 4 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c index 545eb67094d1..e20f1205c0df 100644 --- a/drivers/soc/fsl/qe/qe_ic.c +++ b/drivers/soc/fsl/qe/qe_ic.c @@ -402,7 +402,7 @@ unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic) return irq_linear_revmap(qe_ic->irqhost, irq); } -void qe_ic_cascade_low(struct irq_desc *desc) +static void qe_ic_cascade_low(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic); @@ -415,7 +415,7 @@ void qe_ic_cascade_low(struct irq_desc *desc) chip->irq_eoi(&desc->irq_data); } -void qe_ic_cascade_high(struct irq_desc *desc) +static void qe_ic_cascade_high(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic); @@ -428,7 +428,7 @@ void qe_ic_cascade_high(struct irq_desc *desc) chip->irq_eoi(&desc->irq_data); } -void qe_ic_cascade_muxed_mpic(struct irq_desc *desc) +static void qe_ic_cascade_muxed_mpic(struct irq_desc *desc) { struct qe_ic *qe_ic = irq_desc_get_handler_data(desc); unsigned int cascade_irq; diff --git a/include/soc/fsl/qe/qe_ic.h b/include/soc/fsl/qe/qe_ic.h index 8ec21a3bd859..43e4ce95c6a0 100644 --- a/include/soc/fsl/qe/qe_ic.h +++ b/include/soc/fsl/qe/qe_ic.h @@ -67,8 +67,4 @@ void qe_ic_set_highest_priority(unsigned int virq, int high); int qe_ic_set_priority(unsigned int virq, unsigned int priority); int qe_ic_set_high_priority(unsigned int virq, unsigned int priority, int high); -void qe_ic_cascade_low(struct irq_desc *desc); -void qe_ic_cascade_high(struct irq_desc *desc); -void qe_ic_cascade_muxed_mpic(struct irq_desc *desc); - #endif /* _ASM_POWERPC_QE_IC_H */
Re: [PATCH v2 19/23] net: ethernet: freescale: make UCC_GETH explicitly depend on PPC32
Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : Currently, QUICC_ENGINE depends on PPC32, so this in itself does not change anything. In order to allow removing the PPC32 dependency from QUICC_ENGINE and avoid allmodconfig build failures, add this explicit dependency. Signed-off-by: Rasmus Villemoes --- drivers/net/ethernet/freescale/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/freescale/Kconfig b/drivers/net/ethernet/freescale/Kconfig index 6a7e8993119f..97d27c7740d4 100644 --- a/drivers/net/ethernet/freescale/Kconfig +++ b/drivers/net/ethernet/freescale/Kconfig @@ -75,6 +75,7 @@ config FSL_XGMAC_MDIO config UCC_GETH tristate "Freescale QE Gigabit Ethernet" depends on QUICC_ENGINE + depends on PPC32 I think it would be more obvious to have: depends on QUICC_ENGINE && PPC32 Christophe select FSL_PQ_MDIO select PHYLIB ---help---
Re: [PATCH v2 20/23] serial: make SERIAL_QE depend on PPC32
Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : Currently SERIAL_QE depends on QUICC_ENGINE, which in turn depends on PPC32, so this doesn't add any extra dependency. However, the QUICC Engine IP block also exists on some arm boards, so this serves as preparation for removing the PPC32 dependency from QUICC_ENGINE and build the QE support in drivers/soc/fsl/qe, while preventing allmodconfig/randconfig failures due to SERIAL_QE not being supported yet. Signed-off-by: Rasmus Villemoes --- drivers/tty/serial/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/tty/serial/Kconfig b/drivers/tty/serial/Kconfig index 67a9eb3f94ce..78246f535809 100644 --- a/drivers/tty/serial/Kconfig +++ b/drivers/tty/serial/Kconfig @@ -1056,6 +1056,7 @@ config SERIAL_LANTIQ config SERIAL_QE tristate "Freescale QUICC Engine serial port support" depends on QUICC_ENGINE + depends on PPC32 Same, would be more obvious as depends on QUICC_ENGINE && PPC32 Christophe select SERIAL_CORE select FW_LOADER help
[PATCH] powerpc: Add build-time check of ptrace PT_xx defines
As part of the uapi we export a lot of PT_xx defines for each register in struct pt_regs. These are expressed as an index from gpr[0], in units of unsigned long. Currently there's nothing tying the values of those defines to the actual layout of the struct. But we *don't* want to change the uapi defines to derive the PT_xx values based on the layout of the struct, those values are ABI and must never change. Instead we want to do the reverse, make sure that the layout of the struct never changes vs the PT_xx defines. So add build time checks of that. This probably seems paranoid, but at least once in the past someone has sent a patch that would have broken the ABI if it hadn't been spotted. Although it probably would have been detected via testing, it's preferable to just quash any issues at the source. Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/ptrace.c | 63 1 file changed, 63 insertions(+) diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 8c92febf5f44..abd888bada03 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -3398,4 +3398,67 @@ void __init pt_regs_check(void) offsetof(struct user_pt_regs, result)); BUILD_BUG_ON(sizeof(struct user_pt_regs) > sizeof(struct pt_regs)); + + // Now check that the pt_regs offsets match the uapi #defines + #define CHECK_REG(_pt, _reg) \ + BUILD_BUG_ON(_pt != (offsetof(struct user_pt_regs, _reg) / \ +sizeof(unsigned long))); + + CHECK_REG(PT_R0, gpr[0]); + CHECK_REG(PT_R1, gpr[1]); + CHECK_REG(PT_R2, gpr[2]); + CHECK_REG(PT_R3, gpr[3]); + CHECK_REG(PT_R4, gpr[4]); + CHECK_REG(PT_R5, gpr[5]); + CHECK_REG(PT_R6, gpr[6]); + CHECK_REG(PT_R7, gpr[7]); + CHECK_REG(PT_R8, gpr[8]); + CHECK_REG(PT_R9, gpr[9]); + CHECK_REG(PT_R10, gpr[10]); + CHECK_REG(PT_R11, gpr[11]); + CHECK_REG(PT_R12, gpr[12]); + CHECK_REG(PT_R13, gpr[13]); + CHECK_REG(PT_R14, gpr[14]); + CHECK_REG(PT_R15, gpr[15]); + CHECK_REG(PT_R16, gpr[16]); + CHECK_REG(PT_R17, gpr[17]); + CHECK_REG(PT_R18, gpr[18]); + CHECK_REG(PT_R19, gpr[19]); + CHECK_REG(PT_R20, gpr[20]); + CHECK_REG(PT_R21, gpr[21]); + CHECK_REG(PT_R22, gpr[22]); + CHECK_REG(PT_R23, gpr[23]); + CHECK_REG(PT_R24, gpr[24]); + CHECK_REG(PT_R25, gpr[25]); + CHECK_REG(PT_R26, gpr[26]); + CHECK_REG(PT_R27, gpr[27]); + CHECK_REG(PT_R28, gpr[28]); + CHECK_REG(PT_R29, gpr[29]); + CHECK_REG(PT_R30, gpr[30]); + CHECK_REG(PT_R31, gpr[31]); + CHECK_REG(PT_NIP, nip); + CHECK_REG(PT_MSR, msr); + CHECK_REG(PT_ORIG_R3, orig_gpr3); + CHECK_REG(PT_CTR, ctr); + CHECK_REG(PT_LNK, link); + CHECK_REG(PT_XER, xer); + CHECK_REG(PT_CCR, ccr); +#ifdef CONFIG_PPC64 + CHECK_REG(PT_SOFTE, softe); +#else + CHECK_REG(PT_MQ, mq); +#endif + CHECK_REG(PT_TRAP, trap); + CHECK_REG(PT_DAR, dar); + CHECK_REG(PT_DSISR, dsisr); + CHECK_REG(PT_RESULT, result); + #undef CHECK_REG + + BUILD_BUG_ON(PT_REGS_COUNT != sizeof(struct user_pt_regs) / sizeof(unsigned long)); + + /* +* PT_DSCR isn't a real reg, but it's important that it doesn't overlap the +* real registers. +*/ + BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned long)); } -- 2.21.0
Re: [PATCH] powerpc: Add build-time check of ptrace PT_xx defines
Le 30/10/2019 à 12:12, Michael Ellerman a écrit : As part of the uapi we export a lot of PT_xx defines for each register in struct pt_regs. These are expressed as an index from gpr[0], in units of unsigned long. Currently there's nothing tying the values of those defines to the actual layout of the struct. But we *don't* want to change the uapi defines to derive the PT_xx values based on the layout of the struct, those values are ABI and must never change. Instead we want to do the reverse, make sure that the layout of the struct never changes vs the PT_xx defines. So add build time checks of that. This probably seems paranoid, but at least once in the past someone has sent a patch that would have broken the ABI if it hadn't been spotted. Although it probably would have been detected via testing, it's preferable to just quash any issues at the source. While you are playing with pt_regs_check(), could you also take the patch from Mathieu https://patchwork.ozlabs.org/patch/1009816/ ? Christophe Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/ptrace.c | 63 1 file changed, 63 insertions(+) diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 8c92febf5f44..abd888bada03 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -3398,4 +3398,67 @@ void __init pt_regs_check(void) offsetof(struct user_pt_regs, result)); BUILD_BUG_ON(sizeof(struct user_pt_regs) > sizeof(struct pt_regs)); + + // Now check that the pt_regs offsets match the uapi #defines + #define CHECK_REG(_pt, _reg) \ + BUILD_BUG_ON(_pt != (offsetof(struct user_pt_regs, _reg) / \ +sizeof(unsigned long))); + + CHECK_REG(PT_R0, gpr[0]); + CHECK_REG(PT_R1, gpr[1]); + CHECK_REG(PT_R2, gpr[2]); + CHECK_REG(PT_R3, gpr[3]); + CHECK_REG(PT_R4, gpr[4]); + CHECK_REG(PT_R5, gpr[5]); + CHECK_REG(PT_R6, gpr[6]); + CHECK_REG(PT_R7, gpr[7]); + CHECK_REG(PT_R8, gpr[8]); + CHECK_REG(PT_R9, gpr[9]); + CHECK_REG(PT_R10, gpr[10]); + CHECK_REG(PT_R11, gpr[11]); + CHECK_REG(PT_R12, gpr[12]); + CHECK_REG(PT_R13, gpr[13]); + CHECK_REG(PT_R14, gpr[14]); + CHECK_REG(PT_R15, gpr[15]); + CHECK_REG(PT_R16, gpr[16]); + CHECK_REG(PT_R17, gpr[17]); + CHECK_REG(PT_R18, gpr[18]); + CHECK_REG(PT_R19, gpr[19]); + CHECK_REG(PT_R20, gpr[20]); + CHECK_REG(PT_R21, gpr[21]); + CHECK_REG(PT_R22, gpr[22]); + CHECK_REG(PT_R23, gpr[23]); + CHECK_REG(PT_R24, gpr[24]); + CHECK_REG(PT_R25, gpr[25]); + CHECK_REG(PT_R26, gpr[26]); + CHECK_REG(PT_R27, gpr[27]); + CHECK_REG(PT_R28, gpr[28]); + CHECK_REG(PT_R29, gpr[29]); + CHECK_REG(PT_R30, gpr[30]); + CHECK_REG(PT_R31, gpr[31]); + CHECK_REG(PT_NIP, nip); + CHECK_REG(PT_MSR, msr); + CHECK_REG(PT_ORIG_R3, orig_gpr3); + CHECK_REG(PT_CTR, ctr); + CHECK_REG(PT_LNK, link); + CHECK_REG(PT_XER, xer); + CHECK_REG(PT_CCR, ccr); +#ifdef CONFIG_PPC64 + CHECK_REG(PT_SOFTE, softe); +#else + CHECK_REG(PT_MQ, mq); +#endif + CHECK_REG(PT_TRAP, trap); + CHECK_REG(PT_DAR, dar); + CHECK_REG(PT_DSISR, dsisr); + CHECK_REG(PT_RESULT, result); + #undef CHECK_REG + + BUILD_BUG_ON(PT_REGS_COUNT != sizeof(struct user_pt_regs) / sizeof(unsigned long)); + + /* +* PT_DSCR isn't a real reg, but it's important that it doesn't overlap the +* real registers. +*/ + BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned long)); }
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On Wed 30-10-19 11:28:00, Peter Zijlstra wrote: > On Wed, Oct 30, 2019 at 11:22:29AM +0100, Michal Hocko wrote: > > On Wed 30-10-19 11:14:49, Peter Zijlstra wrote: > > > On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote: > > > > When passing the return value of dev_to_node() to cpumask_of_node() > > > > without checking if the device's node id is NUMA_NO_NODE, there is > > > > global-out-of-bounds detected by KASAN. > > > > > > > > From the discussion [1], NUMA_NO_NODE really means no node affinity, > > > > which also means all cpus should be usable. So the cpumask_of_node() > > > > should always return all cpus online when user passes the node id as > > > > NUMA_NO_NODE, just like similar semantic that page allocator handles > > > > NUMA_NO_NODE. > > > > > > > > But we cannot really copy the page allocator logic. Simply because the > > > > page allocator doesn't enforce the near node affinity. It just picks it > > > > up as a preferred node but then it is free to fallback to any other numa > > > > node. This is not the case here and node_to_cpumask_map will only > > > > restrict > > > > to the particular node's cpus which would have really non deterministic > > > > behavior depending on where the code is executed. So in fact we really > > > > want to return cpu_online_mask for NUMA_NO_NODE. > > > > > > > > Also there is a debugging version of node_to_cpumask_map() for x86 and > > > > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, > > > > this > > > > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). > > > > > > > > [1] https://lkml.org/lkml/2019/9/11/66 > > > > Signed-off-by: Yunsheng Lin > > > > Suggested-by: Michal Hocko > > > > Acked-by: Michal Hocko > > > > Acked-by: Paul Burton # MIPS bits > > > > > > Still: > > > > > > Nacked-by: Peter Zijlstra (Intel) > > > > Do you have any other proposal that doesn't make any wild guesses about > > which node to use instead of the undefined one? > > It only makes 'wild' guesses when the BIOS is shit and it complains > about that. I really do not see how this is any better than simply using the online cpu mask in the same "broken" situation. We are effectivelly talking about a suboptimal path for suboptimal setups. I haven't heard any actual technical argument why cpu_online_mask is any worse than adding some sort of failover guessing which node to use as a replacement. I completely do you point about complaining loud about broken BIOS/fw. It seems we just disagree where we should workaround those issues because as of now we simply do generate semi random behavior because of an uninitialized memory access. > Or do you like you BIOS broken? I do not see anything like that in my response nor in my previous communication. Moreover a patch to warn about this should be on the way to get merged AFAIK. -- Michal Hocko SUSE Labs
Re: [PATCH 5/6] powerpc: Mark archrandom.h functions __must_check
Richard Henderson writes: > We cannot use the pointer output without validating the > success of the random read. You _can_, but you must not. > Signed-off-by: Richard Henderson > --- > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Michael Ellerman > --- > arch/powerpc/include/asm/archrandom.h | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) Acked-by: Michael Ellerman cheers > diff --git a/arch/powerpc/include/asm/archrandom.h > b/arch/powerpc/include/asm/archrandom.h > index f8a887c8b7f8..ee214b153a71 100644 > --- a/arch/powerpc/include/asm/archrandom.h > +++ b/arch/powerpc/include/asm/archrandom.h > @@ -6,17 +6,17 @@ > > #include > > -static inline bool arch_get_random_long(unsigned long *v) > +static inline bool __must_check arch_get_random_long(unsigned long *v) > { > return false; > } > > -static inline bool arch_get_random_int(unsigned int *v) > +static inline bool __must_check arch_get_random_int(unsigned int *v) > { > return false; > } > > -static inline bool arch_get_random_seed_long(unsigned long *v) > +static inline bool __must_check arch_get_random_seed_long(unsigned long *v) > { > if (ppc_md.get_random_seed) > return ppc_md.get_random_seed(v); > @@ -24,7 +24,7 @@ static inline bool arch_get_random_seed_long(unsigned long > *v) > return false; > } > > -static inline bool arch_get_random_seed_int(unsigned int *v) > +static inline bool __must_check arch_get_random_seed_int(unsigned int *v) > { > unsigned long val; > bool rc; > -- > 2.17.1
[PATCH V2 1/2] ASoC: dt-bindings: fsl_asrc: add compatible string for imx8qm
In order to support the two asrc modules in imx8qm, we need to add compatible string "fsl,imx8qm-asrc0" and "fsl,imx8qm-asrc1" Signed-off-by: Shengjiu Wang --- Documentation/devicetree/bindings/sound/fsl,asrc.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/sound/fsl,asrc.txt b/Documentation/devicetree/bindings/sound/fsl,asrc.txt index 1d4d9f938689..cd2bd3daa7e1 100644 --- a/Documentation/devicetree/bindings/sound/fsl,asrc.txt +++ b/Documentation/devicetree/bindings/sound/fsl,asrc.txt @@ -8,7 +8,8 @@ three substreams within totally 10 channels. Required properties: - - compatible : Contains "fsl,imx35-asrc" or "fsl,imx53-asrc". + - compatible : Contains "fsl,imx35-asrc", "fsl,imx53-asrc", + "fsl,imx8qm-asrc0" or "fsl,imx8qm-asrc1". - reg: Offset and length of the register set for the device. -- 2.21.0
[PATCH V2 2/2] ASoC: fsl_asrc: Add support for imx8qm
There are two asrc module in imx8qm, each module has different clock configuration, and the DMA type is EDMA. So in this patch, we define the new clocks, refine the clock map, and include struct fsl_asrc_soc_data for different soc usage. The EDMA channel is fixed with each dma request, one dma request corresponding to one dma channel. So we need to request dma channel with dma request of asrc module. Signed-off-by: Shengjiu Wang Acked-by: Nicolin Chen --- Changes in v2 - use !use_edma to wrap code in fsl_asrc_dma - add Acked-by: Nicolin Chen sound/soc/fsl/fsl_asrc.c | 91 +--- sound/soc/fsl/fsl_asrc.h | 65 +- sound/soc/fsl/fsl_asrc_dma.c | 42 +++-- 3 files changed, 166 insertions(+), 32 deletions(-) diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c index a3cfceea7d2f..9dc5b5a93fb0 100644 --- a/sound/soc/fsl/fsl_asrc.c +++ b/sound/soc/fsl/fsl_asrc.c @@ -43,24 +43,55 @@ static struct snd_pcm_hw_constraint_list fsl_asrc_rate_constraints = { */ static unsigned char input_clk_map_imx35[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, + 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, + 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, }; static unsigned char output_clk_map_imx35[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, + 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, + 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, }; /* i.MX53 uses the same map for input and output */ static unsigned char input_clk_map_imx53[] = { /* 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xa 0xb 0xc 0xd 0xe 0xf */ 0x0, 0x1, 0x2, 0x7, 0x4, 0x5, 0x6, 0x3, 0x8, 0x9, 0xa, 0xb, 0xc, 0xf, 0xe, 0xd, + 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, + 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; static unsigned char output_clk_map_imx53[] = { /* 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xa 0xb 0xc 0xd 0xe 0xf */ 0x8, 0x9, 0xa, 0x7, 0xc, 0x5, 0x6, 0xb, 0x0, 0x1, 0x2, 0x3, 0x4, 0xf, 0xe, 0xd, + 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, + 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; -static unsigned char *clk_map[2]; +/* i.MX8QM uses the same map for input and output */ +static unsigned char input_clk_map_imx8_0[] = { + 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x0, + 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, + 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, +}; + +static unsigned char output_clk_map_imx8_0[] = { + 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x0, + 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, + 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, +}; + +static unsigned char input_clk_map_imx8_1[] = { + 0xf, 0xf, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x0, + 0x0, 0x1, 0x2, 0x3, 0xb, 0xc, 0xf, 0xf, 0xd, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, + 0x4, 0x5, 0x6, 0xf, 0x8, 0x9, 0xa, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, +}; + +static unsigned char output_clk_map_imx8_1[] = { + 0xf, 0xf, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x0, + 0x0, 0x1, 0x2, 0x3, 0xb, 0xc, 0xf, 0xf, 0xd, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, + 0x4, 0x5, 0x6, 0xf, 0x8, 0x9, 0xa, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, +}; /** * Select the pre-processing and post-processing options @@ -353,8 +384,8 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair, bool use_ideal_rate) } /* Validate input and output clock sources */ - clk_index[IN] = clk_map[IN][config->inclk]; - clk_index[OUT] = clk_map[OUT][config->outclk]; + clk_index[IN] = asrc_priv->soc->inclk_map[config->inclk]; + clk_index[OUT] = asrc_priv->soc->outclk_map[config->outclk]; /* We only have output clock for ideal ratio mode */ clk = asrc_priv->asrck_clk[clk_index[ideal ? OUT : IN]]; @@ -398,13 +429,13 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair, bool use_ideal_rate) /* Set the channel number */ channels = config->channel_num; - if (asrc_priv->channel_bits < 4) + if (asrc_priv->soc->channel_bits < 4) channels /= 2; /* Update channels for current pair */ regmap_update_bits(asrc_priv->regmap, REG_ASRCNCR, - ASRCNCR_ANCi_MASK(index, asrc_priv->channel_bits), - ASRCNCR_ANCi(index, channels, asrc_priv->channel_bits)); + ASRCNCR_ANCi_M
Re: [PATCH] powerpc/tools: Don't quote $objdump in scripts
Segher Boessenkool writes: > On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote: >> Some of our scripts are passed $objdump and then call it as >> "$objdump". This doesn't work if it contains spaces because we're >> using ccache, for example you get errors such as: >> >> ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No >> such file or directory >> ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache >> ppc64le-objdump: No such file or directory >> >> Fix it by not quoting the string when we expand it, allowing the shell >> to do the right thing for us. > > This breaks things for people with spaces in their paths. Spaces in their what? Who does that? :) Also we don't support it: $ pwd $ /home/michael/foo bar $ make clean Makefile:147: *** source directory cannot contain spaces or colons. Stop. cheers
Re: [PATCH] powerpc/configs: Rename foo_basic_defconfig to foo_base.config
On Tue, 2019-05-28 at 08:16:14 UTC, Michael Ellerman wrote: > We have several "defconfigs" that are not actually full defconfigs > they are just a base set of options which are then merged with other > fragments to produce a working defconfig. > > The most obvious example is corenet_basic_defconfig which only > contains one symbol CONFIG_CORENET_GENERIC=y. But there is also > mpc85xx_base_defconfig which doesn't actually enable CONFIG_PPC_85xx. > > To avoid confusion, rename these config fragments to "foo_base.config" > to make it clearer that they are not full defconfigs. > > Reported-by: Christophe Leroy > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/58b12eb28e34d3dd8a2d6743c26bf941ca1fbf37 cheers
Re: [PATCH v4] powerpc/setup_64: fix -Wempty-body warnings
On Mon, 2019-07-15 at 18:32:32 UTC, Qian Cai wrote: > At the beginning of setup_64.c, it has, > > #ifdef DEBUG > #define DBG(fmt...) udbg_printf(fmt) > #else > #define DBG(fmt...) > #endif > > where DBG() could be compiled away, and generate warnings, > > arch/powerpc/kernel/setup_64.c: In function 'initialize_cache_info': > arch/powerpc/kernel/setup_64.c:579:49: warning: suggest braces around > empty body in an 'if' statement [-Wempty-body] > DBG("Argh, can't find dcache properties !\n"); > ^ > arch/powerpc/kernel/setup_64.c:582:49: warning: suggest braces around > empty body in an 'if' statement [-Wempty-body] > DBG("Argh, can't find icache properties !\n"); > > Fix it by using the suggestions from Michael: > > "Neither of those sites should use DBG(), that's not really early boot > code, they should just use pr_warn(). > > And the other uses of DBG() in initialize_cache_info() should just be > removed. > > In smp_release_cpus() the entry/exit DBG's should just be removed, and > the spinning_secondaries line should just be pr_debug(). > > That would just leave the two calls in early_setup(). If we taught > udbg_printf() to return early when udbg_putc is NULL, then we could just > call udbg_printf() unconditionally and get rid of the DBG macro > entirely." > > Suggested-by: Michael Ellerman > Signed-off-by: Qian Cai Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/3b9176e9a874a848afa7eb2f6943639eb18b7a17 cheers
Re: [PATCH] powerpc/configs: Add debug config fragment
On Thu, 2019-08-01 at 04:58:55 UTC, Andrew Donnellan wrote: > Add a debug config fragment that we can use to put useful debug options > into. > > Currently we only define a target for powernv[_be]_debug_defconfig, and the > only option included is to enable debugfs SCOM access. > > Suggested-by: Michael Ellerman > Signed-off-by: Andrew Donnellan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/c1bc6f93f95970f917caaac544a374862e84df52 cheers
Re: [PATCH v7 1/2] powerpc/xmon: Allow listing and clearing breakpoints in read-only mode
On Sat, 2019-09-07 at 06:11:23 UTC, "Christopher M. Riedl" wrote: > Read-only mode should not prevent listing and clearing any active > breakpoints. > > Tested-by: Daniel Axtens > Reviewed-by: Daniel Axtens > Signed-off-by: Christopher M. Riedl Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/96664dee5cf1815777286227b09884b4f019727f cheers
Re: [PATCH v2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range
On Tue, 2019-09-17 at 12:38:51 UTC, "Aneesh Kumar K.V" wrote: > With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap") > pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we > can map the memmap area using 16MB hugepage size and that can cover > a memory range of 16G. > > While enabling new pmem namespaces, since memory is added in sub-section > chunks, > before creating a new memmap mapping, kernel should check whether there is an > existing memmap mapping covering the new pmem namespace. Currently, this is > validated by checking whether the section covering the range is already > initialized or not. Considering there can be multiple namespaces in the same > section this can result in wrong validation. Update this to check for > sub-sections in the range. This is done by checking for all pfns in the range > we > are mapping. > > We could optimize this by checking only just one pfn in each sub-section. But > since this is not fast-path we keep this simple. > > Signed-off-by: Aneesh Kumar K.V Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/5f5d6e40a01e70b731df843d8b5a61b4b28b19d9 cheers
Re: [PATCH] powerpc/pkeys: remove unused pkey_allows_readwrite
On Tue, 2019-09-17 at 15:22:30 UTC, Qian Cai wrote: > pkey_allows_readwrite() was first introduced in the commit 5586cf61e108 > ("powerpc: introduce execute-only pkey"), but the usage was removed > entirely in the commit a4fcc877d4e1 ("powerpc/pkeys: Preallocate > execute-only key"). > > Found by the "-Wunused-function" compiler warning flag. > > Fixes: a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key") > Signed-off-by: Qian Cai Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/29674a1c71be710f8418468aa6a8addd6d1aba1c cheers
Re: [PATCH] powerpc/configs: add FADump awareness to skiroot_defconfig
On Wed, 2019-10-09 at 14:04:29 UTC, Hari Bathini wrote: > FADump is supported on PowerNV platform. To fulfill this support, the > petitboot kernel must be FADump aware. Enable config PRESERVE_FA_DUMP > to make the petitboot kernel FADump aware. > > Signed-off-by: Hari Bathini Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/aaa351504449c4babb80753c72920e4b25fbd8a9 cheers
Re: [PATCH] powerpc: make syntax for FADump config options in kernel/Makefile readable
On Wed, 2019-10-09 at 15:27:20 UTC, Hari Bathini wrote: > arch/powerpc/kernel/fadump.c file needs to be compiled in if 'config > FA_DUMP' or 'config PRESERVE_FA_DUMP' is set. The current syntax > achieves that but looks a bit odd. Fix it for better readability. > > Signed-off-by: Hari Bathini Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/cd1d55f16d48d97d681d9534170ce712ac1d09e7 cheers
Re: [PATCH] selftests/powerpc: Reduce sigfuz runtime to ~60s
On Sun, 2019-10-13 at 23:46:43 UTC, Michael Ellerman wrote: > The defaults for the sigfuz test is to run for 4000 iterations, but > that can take quite a while and the test harness may kill the test. > Reduce the number of iterations to 600, which gives a runtime of > roughly 1 minute on a Power8 system. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/4f5c5b76cc00ccf5be89a2b9883feba3baf6eb2e cheers
Re: [PATCH] powerpc/pseries: Mark accumulate_stolen_time() as notrace
On Thu, 2019-10-24 at 05:59:32 UTC, Michael Ellerman wrote: > accumulate_stolen_time() is called prior to interrupt state being > reconciled, which can trip the warning in arch_local_irq_restore(): > > WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 > .arch_local_irq_restore+0x9c/0x130 > ... > NIP .arch_local_irq_restore+0x9c/0x130 > LR .rb_start_commit+0x38/0x80 > Call Trace: > .ring_buffer_lock_reserve+0xe4/0x620 > .trace_function+0x44/0x210 > .function_trace_call+0x148/0x170 > .ftrace_ops_no_ops+0x180/0x1d0 > ftrace_call+0x4/0x8 > .accumulate_stolen_time+0x1c/0xb0 > decrementer_common+0x124/0x160 > > For now just mark it as notrace. We may change the ordering to call it > after interrupt state has been reconciled, but that is a larger > change. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/eb8e20f89093b64f48975c74ccb114e6775cee22 cheers
Re: [PATCH v2 1/3] powerpc/pseries: Don't opencode HPTE_V_BOLTED
On Thu, 2019-10-24 at 09:35:40 UTC, "Aneesh Kumar K.V" wrote: > No functional change in this patch. > > Signed-off-by: Aneesh Kumar K.V Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/82ce028ad26dd075b06285ef61a854a564d910fb cheers
Re: [PATCH v3] powerpc/powernv: Add queue mechanism for early messages
On Mon, 2018-05-21 at 02:04:38 UTC, Deb McLemore wrote: > Problem being solved is when issuing a BMC soft poweroff during IPL, > the poweroff was being lost so the machine would not poweroff. > > Opal messages were being received before the opal-power code > registered its notifiers. > > Alternatives discussed (option #3 was chosen): > > 1 - Have opal_message_init() explicitly call opal_power_control_init() > before it dequeues any OPAL messages (i.e. before we register the > opal-msg IRQ handler). > > 2 - Introduce concept of critical message types and when we register > handlers we track which message types have a registered handler, > then defer the opal-msg IRQ registration until we have a handler > registered for all the critical types. > > 3 - Buffering messages, if we receive a message and do not yet > have a handler for that type, store the message and replay when > a handler for that type is registered. > > Signed-off-by: Deb McLemore Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/a9336ddf448b1cba3080195cec2287af3907236c cheers
Re: [PATCH] powerpc/prom_init: Undo relocation before entering secure mode
On Wed, 2019-09-11 at 16:34:33 UTC, Thiago Jung Bauermann wrote: > The ultravisor will do an integrity check of the kernel image but we > relocated it so the check will fail. Restore the original image by > relocating it back to the kernel virtual base address. > > This works because during build vmlinux is linked with an expected virtual > runtime address of KERNELBASE. > > Fixes: 6a9c930bd775 ("powerpc/prom_init: Add the ESM call to prom_init") > Signed-off-by: Thiago Jung Bauermann Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/05d9a952832cb206a32e3705eff6edebdb2207e7 cheers
Re: [PATCH] powernv/eeh: Fix oops when probing cxl devices
On Wed, 2019-10-16 at 16:28:33 UTC, Frederic Barrat wrote: > Recent cleanup in the way EEH support is added to a device causes a > kernel oops when the cxl driver probes a device and creates virtual > devices discovered on the FPGA: > > BUG: Kernel NULL pointer dereference at 0x00a0 > Faulting instruction address: 0xc0048070 > Oops: Kernel access of bad area, sig: 7 [#1] > ... > NIP [c0048070] eeh_add_device_late.part.9+0x50/0x1e0 > LR [c004805c] eeh_add_device_late.part.9+0x3c/0x1e0 > Call Trace: > [c000200e43983900] [c079e250] _dev_info+0x5c/0x6c (unreliable) > [c000200e43983980] [c00d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0 > [c000200e439839f0] [c00606d0] pcibios_bus_add_device+0x40/0x60 > [c000200e43983a10] [c06aa3a0] pci_bus_add_device+0x30/0x100 > [c000200e43983a80] [c06aa4d4] pci_bus_add_devices+0x64/0xd0 > [c000200e43983ac0] [c0081c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl] > [c000200e43983b00] [c0081c4242ac] cxl_probe+0x504/0x5b0 [cxl] > [c000200e43983bb0] [c06bba1c] local_pci_probe+0x6c/0x110 > [c000200e43983c30] [c0159278] work_for_cpu_fn+0x38/0x60 > > The root cause is that those cxl virtual devices don't have a > representation in the device tree and therefore no associated pci_dn > structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and > we oops. > > We never had explicit support for EEH for those virtual > devices. Instead, EEH events are reported to the (real) pci device and > handled by the cxl driver. Which can then forward to the virtual > devices and handle dependencies. The fact that we try adding EEH > support for the virtual devices is new and a side-effect of the recent > cleanup. > > This patch fixes it by skipping adding EEH support on powernv for > devices which don't have a pci_dn structure. > > The cxl driver doesn't create virtual devices on pseries so this patch > doesn't fix it there intentionally. > > Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug") > Signed-off-by: Frederic Barrat Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/a8a30219ba78b1abb92091102b632f8e9bbdbf03 cheers
Re: [PATCH] powerpc/powernv: Fix CPU idle to be called with IRQs disabled
On Tue, 2019-10-22 at 11:58:14 UTC, Nicholas Piggin wrote: > Commit e78a7614f3876 ("idle: Prevent late-arriving interrupts from > disrupting offline") changes arch_cpu_idle_dead to be called with > interrupts disabled, which triggers the WARN in pnv_smp_cpu_kill_self. > > Fix this by fixing up irq_happened after hard disabling, rather than > requiring there are no pending interrupts, similarly to what was done > done until commit 2525db04d1cc5 ("powerpc/powernv: Simplify lazy IRQ > handling in CPU offline"). > > Fixes: e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting > offline") > Reported-by: Paul Mackerras > Signed-off-by: Nicholas Piggin Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/7d6475051fb3d9339c5c760ed9883bc0a9048b21 cheers
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
> On Oct 30, 2019, at 6:28 AM, Peter Zijlstra wrote: > > It only makes 'wild' guesses when the BIOS is shit and it complains > about that. > > Or do you like you BIOS broken? Agree. It is the garbage in and garbage out. No need to complicate the existing code further.
Re: [PATCH v2 17/23] soc: fsl: qe: make qe_ic_cascade_* static
On 30/10/2019 11.50, Christophe Leroy wrote: > > > Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : >> Now that the references from arch/powerpc/ are gone, these are only >> referenced from inside qe_ic.c, so make them static. > > Why do that in two steps ? > I think patch 9 could remain until here, and then you could squash patch > 9 and patch 17 together here. Agreed, I should rearrange stuff to avoid touching the same lines again and again. Thanks, Rasmus
Re: [RFC PATCH 00/27] current interrupt series plus scv syscall
Hello, On Wed, Oct 02, 2019 at 01:13:52PM +1000, Nicholas Piggin wrote: > Michal Suchánek's on September 24, 2019 7:33 pm: > > Hello, > > > > can you mark the individual patches with RFC rather than the wole > > series? > > Hey, thanks for the reviews. I'll resend all but the last two patches > soon for merge in the next window. Will you resend these or should I cherry-pick the part I need for the !COMPAT? Thanks Michal
Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC
Andrey Ryabinin writes: > On 10/29/19 7:20 AM, Daniel Axtens wrote: >> In the case where KASAN directly allocates memory to back vmalloc >> space, don't map the early shadow page over it. >> >> We prepopulate pgds/p4ds for the range that would otherwise be empty. >> This is required to get it synced to hardware on boot, allowing the >> lower levels of the page tables to be filled dynamically. >> >> Acked-by: Dmitry Vyukov >> Signed-off-by: Daniel Axtens >> >> --- > >> +static void __init kasan_shallow_populate_pgds(void *start, void *end) >> +{ >> +unsigned long addr, next; >> +pgd_t *pgd; >> +void *p; >> +int nid = early_pfn_to_nid((unsigned long)start); > > This doesn't make sense. start is not even a pfn. With linear mapping > we try to identify nid to have the shadow on the same node as memory. But > in this case we don't have memory or the corresponding shadow (yet), > we only install pgd/p4d. > I guess we could just use NUMA_NO_NODE. Ah wow, that's quite the clanger on my part. There are a couple of other invocations of early_pfn_to_nid in that file that use an address directly, but at least they reference actual memory. I'll send a separate patch to fix those up. > The rest looks ok, so with that fixed: > > Reviewed-by: Andrey Ryabinin Thanks heaps! I've fixed up the nit you identifed in the first patch, and I agree that the last patch probably isn't needed. I'll respin the series shortly. Regards, Daniel
Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC
On 10/30/19 4:50 PM, Daniel Axtens wrote: > Andrey Ryabinin writes: > >> On 10/29/19 7:20 AM, Daniel Axtens wrote: >>> In the case where KASAN directly allocates memory to back vmalloc >>> space, don't map the early shadow page over it. >>> >>> We prepopulate pgds/p4ds for the range that would otherwise be empty. >>> This is required to get it synced to hardware on boot, allowing the >>> lower levels of the page tables to be filled dynamically. >>> >>> Acked-by: Dmitry Vyukov >>> Signed-off-by: Daniel Axtens >>> >>> --- >> >>> +static void __init kasan_shallow_populate_pgds(void *start, void *end) >>> +{ >>> + unsigned long addr, next; >>> + pgd_t *pgd; >>> + void *p; >>> + int nid = early_pfn_to_nid((unsigned long)start); >> >> This doesn't make sense. start is not even a pfn. With linear mapping >> we try to identify nid to have the shadow on the same node as memory. But >> in this case we don't have memory or the corresponding shadow (yet), >> we only install pgd/p4d. >> I guess we could just use NUMA_NO_NODE. > > Ah wow, that's quite the clanger on my part. > > There are a couple of other invocations of early_pfn_to_nid in that file > that use an address directly, but at least they reference actual memory. > I'll send a separate patch to fix those up. I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext)) It should be wrapped with PFN_DOWN(). Other usages in map_range() seems to be correct, range->start,end is pfns. > >> The rest looks ok, so with that fixed: >> >> Reviewed-by: Andrey Ryabinin > > Thanks heaps! I've fixed up the nit you identifed in the first patch, > and I agree that the last patch probably isn't needed. I'll respin the > series shortly. > Hold on a sec, just spotted another thing to fix. > @@ -352,9 +397,24 @@ void __init kasan_init(void) > shadow_cpu_entry_end = (void *)round_up( > (unsigned long)shadow_cpu_entry_end, PAGE_SIZE); > > + /* > + * If we're in full vmalloc mode, don't back vmalloc space with early > + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to > + * the global table and we can populate the lower levels on demand. > + */ > +#ifdef CONFIG_KASAN_VMALLOC > + kasan_shallow_populate_pgds( > + kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), This should be VMALLOC_START, there is no point to allocate pgds for the hole between linear mapping and vmalloc, just waste of memory. It make sense to map early shadow for that hole, because if code dereferences address in that hole we will see the page fault on that address instead of fault on the shadow. So something like this might work: kasan_populate_early_shadow( kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), kasan_mem_to_shadow((void *)VMALLOC_START)); if (IS_ENABLED(CONFIG_KASAN_VMALLOC) kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), kasan_mem_to_shadow((void *)VMALLOC_END)) else kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), kasan_mem_to_shadow((void *)VMALLOC_END)); kasan_populate_early_shadow( kasan_mem_to_shadow((void *)VMALLOC_END + 1), shadow_cpu_entry_begin);
Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC
Andrey Ryabinin writes: > On 10/30/19 4:50 PM, Daniel Axtens wrote: >> Andrey Ryabinin writes: >> >>> On 10/29/19 7:20 AM, Daniel Axtens wrote: In the case where KASAN directly allocates memory to back vmalloc space, don't map the early shadow page over it. We prepopulate pgds/p4ds for the range that would otherwise be empty. This is required to get it synced to hardware on boot, allowing the lower levels of the page tables to be filled dynamically. Acked-by: Dmitry Vyukov Signed-off-by: Daniel Axtens --- >>> +static void __init kasan_shallow_populate_pgds(void *start, void *end) +{ + unsigned long addr, next; + pgd_t *pgd; + void *p; + int nid = early_pfn_to_nid((unsigned long)start); >>> >>> This doesn't make sense. start is not even a pfn. With linear mapping >>> we try to identify nid to have the shadow on the same node as memory. But >>> in this case we don't have memory or the corresponding shadow (yet), >>> we only install pgd/p4d. >>> I guess we could just use NUMA_NO_NODE. >> >> Ah wow, that's quite the clanger on my part. >> >> There are a couple of other invocations of early_pfn_to_nid in that file >> that use an address directly, but at least they reference actual memory. >> I'll send a separate patch to fix those up. > > I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext)) > It should be wrapped with PFN_DOWN(). > Other usages in map_range() seems to be correct, range->start,end is pfns. > Oh, right, I didn't realise map_range was already using pfns. > >> >>> The rest looks ok, so with that fixed: >>> >>> Reviewed-by: Andrey Ryabinin >> >> Thanks heaps! I've fixed up the nit you identifed in the first patch, >> and I agree that the last patch probably isn't needed. I'll respin the >> series shortly. >> > > Hold on a sec, just spotted another thing to fix. > >> @@ -352,9 +397,24 @@ void __init kasan_init(void) >> shadow_cpu_entry_end = (void *)round_up( >> (unsigned long)shadow_cpu_entry_end, PAGE_SIZE); >> >> +/* >> + * If we're in full vmalloc mode, don't back vmalloc space with early >> + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to >> + * the global table and we can populate the lower levels on demand. >> + */ >> +#ifdef CONFIG_KASAN_VMALLOC >> +kasan_shallow_populate_pgds( >> +kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), > > This should be VMALLOC_START, there is no point to allocate pgds for the hole > between linear mapping > and vmalloc, just waste of memory. It make sense to map early shadow for that > hole, because if code > dereferences address in that hole we will see the page fault on that address > instead of fault on the shadow. > > So something like this might work: > > kasan_populate_early_shadow( > kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), > kasan_mem_to_shadow((void *)VMALLOC_START)); > > if (IS_ENABLED(CONFIG_KASAN_VMALLOC) > kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), > kasan_mem_to_shadow((void *)VMALLOC_END)) > else > kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), > kasan_mem_to_shadow((void *)VMALLOC_END)); > > kasan_populate_early_shadow( > kasan_mem_to_shadow((void *)VMALLOC_END + 1), > shadow_cpu_entry_begin); Sounds good. It's getting late for me so I'll change and test that and send a respin tomorrow my time. Regards, Daniel
Re: [PATCH v10 1/5] kasan: support backing vmalloc space with real shadow memory
Hello, Daniel > > @@ -1294,14 +1299,19 @@ static bool __purge_vmap_area_lazy(unsigned long > start, unsigned long end) > spin_lock(&free_vmap_area_lock); > llist_for_each_entry_safe(va, n_va, valist, purge_list) { > unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; > + unsigned long orig_start = va->va_start; > + unsigned long orig_end = va->va_end; > > /* >* Finally insert or merge lazily-freed area. It is >* detached and there is no need to "unlink" it from >* anything. >*/ > - merge_or_add_vmap_area(va, > - &free_vmap_area_root, &free_vmap_area_list); > + va = merge_or_add_vmap_area(va, &free_vmap_area_root, > + &free_vmap_area_list); > + > + kasan_release_vmalloc(orig_start, orig_end, > + va->va_start, va->va_end); > I have some questions here. I have not analyzed kasan_releace_vmalloc() logic in detail, sorry for that if i miss something. __purge_vmap_area_lazy() deals with big address space, so not only vmalloc addresses it frees here, basically it can be any, starting from 1 until ULONG_MAX, whereas vmalloc space spans from VMALLOC_START - VMALLOC_END: 1) Should it be checked that vmalloc only address is freed or you handle it somewhere else? if (is_vmalloc_addr(va->va_start)) kasan_release_vmalloc(...) 2) Have you run any bencmarking just to see how much overhead it adds? I am asking, because probably it make sense to add those figures to the backlog(commit message). For example you can run: sudo ./test_vmalloc.sh performance and sudo ./test_vmalloc.sh sequential_test_order=1 Thanks! -- Vlad Rezki
Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic
On 10/23/19 8:47 PM, Nayna Jain wrote: Hi Nayna, process_buffer_measurement() is limited to measuring the kexec boot command line. This patch makes process_buffer_measurement() more generic, allowing it to measure other types of buffer data (e.g. blacklisted binary hashes or key hashes). Now that process_buffer_measurement() is being made generic to measure any buffer, it would be good to add a tag to indicate what type of buffer is being measured. For example, if the buffer is kexec command line the log could look like: "kexec_cmdline: " Similarly, if the buffer is blacklisted binary hash: "blacklist hash: ". If the buffer is key hash: ": key data". This would greatly help the consumer of the IMA log to know the type of data represented in each IMA log entry. thanks, -lakshmi
Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic
On Wed, 2019-10-30 at 08:22 -0700, Lakshmi Ramasubramanian wrote: > On 10/23/19 8:47 PM, Nayna Jain wrote: > > Hi Nayna, > > > process_buffer_measurement() is limited to measuring the kexec boot > > command line. This patch makes process_buffer_measurement() more > > generic, allowing it to measure other types of buffer data (e.g. > > blacklisted binary hashes or key hashes). > > Now that process_buffer_measurement() is being made generic to measure > any buffer, it would be good to add a tag to indicate what type of > buffer is being measured. > > For example, if the buffer is kexec command line the log could look like: > > "kexec_cmdline: " > > Similarly, if the buffer is blacklisted binary hash: > > "blacklist hash: ". > > If the buffer is key hash: > > ": key data". > > This would greatly help the consumer of the IMA log to know the type of > data represented in each IMA log entry. Both the existing kexec command line and the new blacklist buffer measurement pass that information in the eventname. The [PATCH 7/8] "ima: check against blacklisted hashes for files with modsig" patch description includes an example. Mimi
[PATCH 06/11] powerpc/powernv: Remove intermediate variable
Trim the pointless temporary variable. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index b78b5e81f941..319152d30bc3 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1854,9 +1854,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, (pe->device_count == 1 || !pe->pbus) && phb->model == PNV_PHB_MODEL_PHB3) { /* Configure the bypass mode */ - s64 rc = pnv_pci_ioda_dma_64bit_bypass(pe); - if (rc) + if (pnv_pci_ioda_dma_64bit_bypass(pe)) return false; + /* 4GB offset bypasses 32-bit space */ pdev->dev.archdata.dma_offset = (1ULL << 32); -- 1.8.3.1
[PATCH 04/11] powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass()
Rework of pnv_pci_ioda_iommu_bypass_supported() dropped a call to pnv_npu_try_dma_set_bypass(). Reintroduce this call, so that the DMA bypass configuration of a GPU device is propagated to its corresponding NPU devices. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 8849218187d7..70e834635971 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1833,14 +1833,13 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, struct pnv_phb *phb = hose->private_data; struct pci_dn *pdn = pci_get_pdn(pdev); struct pnv_ioda_pe *pe; + bool bypass; if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) return false; pe = &phb->ioda.pe_array[pdn->pe_number]; - - if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask)) - return true; + bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); /* * If the device can't set the TCE bypass bit but still wants @@ -1848,7 +1847,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, * bypass the 32-bit region and be usable for 64-bit DMAs. * The device needs to be able to address all of this space. */ - if (dma_mask >> 32 && + if (!bypass && + dma_mask >> 32 && dma_mask > (memory_hotplug_max() + (1ULL << 32)) && /* pe->pdev should be set if it's a single device, pe->pbus if not */ (pe->device_count == 1 || !pe->pbus) && @@ -1859,10 +1859,14 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; /* 4GB offset bypasses 32-bit space */ pdev->dev.archdata.dma_offset = (1ULL << 32); - return true; + + bypass = true; } - return false; + /* Update peer npu devices */ + pnv_npu_try_dma_set_bypass(pdev, dma_mask); + + return bypass; } static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) -- 1.8.3.1
[PATCH 05/11] powerpc/powernv: Return failure for some uses of dma_set_mask()
Rework of pnv_pci_ioda_dma_set_mask() effectively reverted commit 253fd51e2f53 ("powerpc/powernv/pci: Return failure for some uses of dma_set_mask()"). Reintroduce the desired behavior that an unfulfilled request for a DMA mask between 32 and 64 bits will return error instead of silently falling back to 32 bits. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/kernel/dma-iommu.c | 19 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++-- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c index e486d1d78de2..e1693760654b 100644 --- a/arch/powerpc/kernel/dma-iommu.c +++ b/arch/powerpc/kernel/dma-iommu.c @@ -122,10 +122,21 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask) { struct iommu_table *tbl = get_iommu_table_base(dev); - if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) { - dev->archdata.iommu_bypass = true; - dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n"); - return 1; + if (dev_is_pci(dev)) { + if (dma_iommu_bypass_supported(dev, mask)) { + dev->archdata.iommu_bypass = true; + dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n"); + return 1; + } + + if (mask >> 32) { + dev_err(dev, "Warning: IOMMU dma bypass not supported: mask 0x%016llx\n", + mask); + + /* A 64-bit request falls back to default ops */ + if (mask != DMA_BIT_MASK(64)) + return 0; + } } if (!tbl) { diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 70e834635971..b78b5e81f941 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1863,8 +1863,12 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, bypass = true; } - /* Update peer npu devices */ - pnv_npu_try_dma_set_bypass(pdev, dma_mask); + /* +* Update peer npu devices. We also do this for the special case where +* a 64-bit dma mask can't be fulfilled and falls back to default. +*/ + if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64)) + pnv_npu_try_dma_set_bypass(pdev, dma_mask); return bypass; } -- 1.8.3.1
[PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()
Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by separating the part of the function that determines if bypass is supported from the part that actually attempts to configure it. Move the latter to a controller-specific dma_set_mask() callback. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/Kconfig| 1 + arch/powerpc/platforms/powernv/pci-ioda.c | 30 -- 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 938803eab0ad..6e6e27841764 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -17,6 +17,7 @@ config PPC_POWERNV select PPC_DOORBELL select MMU_NOTIFIER select FORCE_SMP + select ARCH_HAS_DMA_SET_MASK default y config OPAL_PRD diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 57e6a43d9a3a..5291464930ed 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1854,32 +1854,33 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, u64 dma_mask) { struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev); - bool bypass; if (WARN_ON(!pe)) return false; - bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); + return pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask) || + pnv_phb3_iommu_bypass_supported(pe, dma_mask); +} + +static void pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 mask) +{ + struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev); + + if (!pe) + return; - if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) { + if (!pnv_ioda_pe_iommu_bypass_supported(pe, mask) && + pnv_phb3_iommu_bypass_supported(pe, mask)) { /* Configure the bypass mode */ if (pnv_pci_ioda_dma_64bit_bypass(pe)) - return false; + return; /* 4GB offset bypasses 32-bit space */ pdev->dev.archdata.dma_offset = (1ULL << 32); - - bypass = true; } - /* -* Update peer npu devices. We also do this for the special case where -* a 64-bit dma mask can't be fulfilled and falls back to default. -*/ - if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64)) - pnv_npu_try_dma_set_bypass(pdev, dma_mask); - - return bypass; + /* Update peer npu devices */ + pnv_npu_try_dma_set_bypass(pdev, mask); } static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) @@ -3612,6 +3613,7 @@ static void pnv_pci_ioda_shutdown(struct pci_controller *hose) static const struct pci_controller_ops pnv_pci_ioda_controller_ops = { .dma_dev_setup = pnv_pci_dma_dev_setup, .dma_bus_setup = pnv_pci_dma_bus_setup, + .dma_set_mask = pnv_pci_ioda_dma_set_mask, .iommu_bypass_supported = pnv_pci_ioda_iommu_bypass_supported, .setup_msi_irqs = pnv_setup_msi_irqs, .teardown_msi_irqs = pnv_teardown_msi_irqs, -- 1.8.3.1
[PATCH 08/11] powerpc/powernv: Replace open coded pnv_ioda_get_pe()s
Collapse several open coded instances of pnv_ioda_get_pe(). Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/npu-dma.c | 22 +- arch/powerpc/platforms/powernv/pci-ioda.c | 10 +++--- 2 files changed, 8 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index a77ce7d71634..0010b21d45b8 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -97,24 +97,17 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index) static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, struct pci_dev **gpdev) { - struct pnv_phb *phb; - struct pci_controller *hose; struct pci_dev *pdev; struct pnv_ioda_pe *pe; - struct pci_dn *pdn; pdev = pnv_pci_get_gpu_dev(npe->pdev); if (!pdev) return NULL; - pdn = pci_get_pdn(pdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + pe = pnv_ioda_get_pe(pdev); + if (WARN_ON(!pe)) return NULL; - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - pe = &phb->ioda.pe_array[pdn->pe_number]; - if (gpdev) *gpdev = pdev; @@ -261,9 +254,6 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) { struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev); - struct pnv_phb *phb; - struct pci_dn *pdn; - struct pnv_ioda_pe *npe; struct pci_dev *npdev; bool bypass; int i = 0; @@ -275,12 +265,10 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask); while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) { - pdn = pci_get_pdn(npdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) - return; + struct pnv_ioda_pe *npe = pnv_ioda_get_pe(npdev); - phb = pci_bus_to_host(npdev->bus)->private_data; - npe = &phb->ioda.pe_array[pdn->pe_number]; + if (WARN_ON(!npe)) + return; if (bypass) { dev_info(&npdev->dev, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 319152d30bc3..6b932cfc0deb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1829,16 +1829,12 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe) static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, u64 dma_mask) { - struct pci_controller *hose = pci_bus_to_host(pdev->bus); - struct pnv_phb *phb = hose->private_data; - struct pci_dn *pdn = pci_get_pdn(pdev); - struct pnv_ioda_pe *pe; + struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev); bool bypass; - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + if (WARN_ON(!pe)) return false; - pe = &phb->ioda.pe_array[pdn->pe_number]; bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); /* @@ -1852,7 +1848,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, dma_mask > (memory_hotplug_max() + (1ULL << 32)) && /* pe->pdev should be set if it's a single device, pe->pbus if not */ (pe->device_count == 1 || !pe->pbus) && - phb->model == PNV_PHB_MODEL_PHB3) { + pe->phb->model == PNV_PHB_MODEL_PHB3) { /* Configure the bypass mode */ if (pnv_pci_ioda_dma_64bit_bypass(pe)) return false; -- 1.8.3.1
[PATCH 01/11] Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function"
Revert commit b4d37a7b6934 ("powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function") so that this function can be reintegrated. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/platforms/powernv/npu-dma.c | 99 1 file changed, 99 insertions(+) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index b95b9e3c4c98..5a8313654033 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -193,6 +193,105 @@ static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num) return 0; } +/* + * Enables 32 bit DMA on NPU. + */ +static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) +{ + struct pci_dev *gpdev; + struct pnv_ioda_pe *gpe; + int64_t rc; + + /* +* Find the assoicated PCI devices and get the dma window +* information from there. +*/ + if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV)) + return; + + gpe = get_gpu_pci_dev_and_pe(npe, &gpdev); + if (!gpe) + return; + + rc = pnv_npu_set_window(&npe->table_group, 0, + gpe->table_group.tables[0]); + + /* +* NVLink devices use the same TCE table configuration as +* their parent device so drivers shouldn't be doing DMA +* operations directly on these devices. +*/ + set_dma_ops(&npe->pdev->dev, &dma_dummy_ops); +} + +/* + * Enables bypass mode on the NPU. The NPU only supports one + * window per link, so bypass needs to be explicitly enabled or + * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be + * active at the same time. + */ +static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) +{ + struct pnv_phb *phb = npe->phb; + int64_t rc = 0; + phys_addr_t top = memblock_end_of_DRAM(); + + if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev) + return -EINVAL; + + rc = pnv_npu_unset_window(&npe->table_group, 0); + if (rc != OPAL_SUCCESS) + return rc; + + /* Enable the bypass window */ + + top = roundup_pow_of_two(top); + dev_info(&npe->pdev->dev, "Enabling bypass for PE %x\n", + npe->pe_number); + rc = opal_pci_map_pe_dma_window_real(phb->opal_id, + npe->pe_number, npe->pe_number, + 0 /* bypass base */, top); + + if (rc == OPAL_SUCCESS) + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + + return rc; +} + +void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) +{ + int i; + struct pnv_phb *phb; + struct pci_dn *pdn; + struct pnv_ioda_pe *npe; + struct pci_dev *npdev; + + for (i = 0; ; ++i) { + npdev = pnv_pci_get_npu_dev(gpdev, i); + + if (!npdev) + break; + + pdn = pci_get_pdn(npdev); + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + return; + + phb = pci_bus_to_host(npdev->bus)->private_data; + + /* We only do bypass if it's enabled on the linked device */ + npe = &phb->ioda.pe_array[pdn->pe_number]; + + if (bypass) { + dev_info(&npdev->dev, + "Using 64-bit DMA iommu bypass\n"); + pnv_npu_dma_set_bypass(npe); + } else { + dev_info(&npdev->dev, "Using 32-bit DMA via iommu\n"); + pnv_npu_dma_set_32(npe); + } + } +} + /* Switch ownership from platform code to external user (e.g. VFIO) */ static void pnv_npu_take_ownership(struct iommu_table_group *table_group) { -- 1.8.3.1
[PATCH 07/11] powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop
Write this loop more compactly to improve readability. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/npu-dma.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index a6b8c7ad36e4..a77ce7d71634 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -261,12 +261,12 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) { struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev); - int i; struct pnv_phb *phb; struct pci_dn *pdn; struct pnv_ioda_pe *npe; struct pci_dev *npdev; bool bypass; + int i = 0; if (!gpe) return; @@ -274,12 +274,7 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) /* We only do bypass if it's enabled on the linked device */ bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask); - for (i = 0; ; ++i) { - npdev = pnv_pci_get_npu_dev(gpdev, i); - - if (!npdev) - break; - + while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) { pdn = pci_get_pdn(npdev); if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) return; -- 1.8.3.1
[PATCH 10/11] powerpc/powernv: Add pnv_phb3_iommu_bypass_supported()
Move this code to its own function for reuse. As a side benefit, rearrange the comments and spread things out for readability. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/pci-ioda.c | 37 +-- 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 6b932cfc0deb..57e6a43d9a3a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1826,6 +1826,30 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe) return -EIO; } +/* + * If the device can't set the TCE bypass bit but still wants + * to access 4GB or more, on PHB3 we can reconfigure TVE#0 to + * bypass the 32-bit region and be usable for 64-bit DMAs. + */ +static bool pnv_phb3_iommu_bypass_supported(struct pnv_ioda_pe *pe, u64 mask) +{ + if (pe->phb->model != PNV_PHB_MODEL_PHB3) + return false; + + /* pe->pdev should be set if it's a single device, pe->pbus if not */ + if (pe->pbus && pe->device_count != 1) + return false; + + if (!(mask >> 32)) + return false; + + /* The device needs to be able to address all of this space. */ + if (mask <= memory_hotplug_max() + (1ULL << 32)) + return false; + + return true; +} + static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, u64 dma_mask) { @@ -1837,18 +1861,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); - /* -* If the device can't set the TCE bypass bit but still wants -* to access 4GB or more, on PHB3 we can reconfigure TVE#0 to -* bypass the 32-bit region and be usable for 64-bit DMAs. -* The device needs to be able to address all of this space. -*/ - if (!bypass && - dma_mask >> 32 && - dma_mask > (memory_hotplug_max() + (1ULL << 32)) && - /* pe->pdev should be set if it's a single device, pe->pbus if not */ - (pe->device_count == 1 || !pe->pbus) && - pe->phb->model == PNV_PHB_MODEL_PHB3) { + if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) { /* Configure the bypass mode */ if (pnv_pci_ioda_dma_64bit_bypass(pe)) return false; -- 1.8.3.1
[PATCH 02/11] powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported()
This little calculation will be needed in other places. Move it to a convenience function. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++- arch/powerpc/platforms/powernv/pci.h | 8 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c28d0d9b7ee0..8849218187d7 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1838,11 +1838,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; pe = &phb->ioda.pe_array[pdn->pe_number]; - if (pe->tce_bypass_enabled) { - u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1; - if (dma_mask >= top) - return true; - } + + if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask)) + return true; /* * If the device can't set the TCE bypass bit but still wants diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index f914f0b14e4e..41f7dec3aee5 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -4,6 +4,7 @@ #include /* for __printf */ #include +#include #include #include @@ -247,4 +248,11 @@ extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl, void *tce_mem, u64 tce_size, u64 dma_offset, unsigned int page_shift); +static inline bool pnv_ioda_pe_iommu_bypass_supported(struct pnv_ioda_pe *pe, + u64 mask) +{ + return pe->tce_bypass_enabled && + mask >= pe->tce_bypass_base + memblock_end_of_DRAM() - 1; +} + #endif /* __POWERNV_PCI_H */ -- 1.8.3.1
[PATCH 09/11] Revert "powerpc/pci: remove the dma_set_mask pci_controller ops methods"
Bring back the pci controller based hook in dma_set_mask(), as it will have a user again. This reverts commit 662acad4067a ("powerpc/pci: remove the dma_set_mask pci_controller ops methods"). The callback signature has been adjusted with void return to fit its caller. Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/include/asm/pci-bridge.h | 2 ++ arch/powerpc/kernel/dma-mask.c| 9 + 2 files changed, 11 insertions(+) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index ea6ec65970ef..8512dcd053c5 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -43,6 +43,8 @@ struct pci_controller_ops { void(*teardown_msi_irqs)(struct pci_dev *pdev); #endif + void(*dma_set_mask)(struct pci_dev *pdev, u64 dma_mask); + void(*shutdown)(struct pci_controller *hose); }; diff --git a/arch/powerpc/kernel/dma-mask.c b/arch/powerpc/kernel/dma-mask.c index ffbbbc432612..35b5fd1b03a6 100644 --- a/arch/powerpc/kernel/dma-mask.c +++ b/arch/powerpc/kernel/dma-mask.c @@ -2,11 +2,20 @@ #include #include +#include #include void arch_dma_set_mask(struct device *dev, u64 dma_mask) { if (ppc_md.dma_set_mask) ppc_md.dma_set_mask(dev, dma_mask); + + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct pci_controller *phb = pci_bus_to_host(pdev->bus); + + if (phb->controller_ops.dma_set_mask) + phb->controller_ops.dma_set_mask(pdev, dma_mask); + } } EXPORT_SYMBOL(arch_dma_set_mask); -- 1.8.3.1
[PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
With recent kernels, TCE tables for NPU devices are no longer being configured. That task was performed by pnv_npu_try_dma_set_bypass(), a function that got swept away in recent overhauling of dma code. Patches 1-4 here bring the lost function back and reintegrate it with the updated generic iommu bypass infrastructure. Patch 5 fixes a regression in behavior when a requested dma mask can not be fulfilled. Patches 6-8 are cleanup. I put these later in the set because they aren't bisectable until after the restored code is wired back in. Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems wrong for a boolean *_supported() function to have side effects. They reintroduce a pci controller based dma_set_mask() hook. If that's undesirable, these last three patches can be dropped. Reza Arbab (11): Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function" powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported() powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass() powerpc/powernv: Return failure for some uses of dma_set_mask() powerpc/powernv: Remove intermediate variable powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop powerpc/powernv: Replace open coded pnv_ioda_get_pe()s Revert "powerpc/pci: remove the dma_set_mask pci_controller ops methods" powerpc/powernv: Add pnv_phb3_iommu_bypass_supported() powerpc/powernv: Add pnv_pci_ioda_dma_set_mask() arch/powerpc/include/asm/pci-bridge.h | 2 + arch/powerpc/kernel/dma-iommu.c | 19 -- arch/powerpc/kernel/dma-mask.c| 9 +++ arch/powerpc/platforms/powernv/Kconfig| 1 + arch/powerpc/platforms/powernv/npu-dma.c | 106 +++--- arch/powerpc/platforms/powernv/pci-ioda.c | 71 arch/powerpc/platforms/powernv/pci.h | 10 ++- 7 files changed, 177 insertions(+), 41 deletions(-) -- 1.8.3.1
[PATCH 03/11] powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument
To enable simpler calling code, change this function to find the value of bypass instead of taking it as an argument. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/npu-dma.c | 12 +--- arch/powerpc/platforms/powernv/pci.h | 2 +- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 5a8313654033..a6b8c7ad36e4 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -258,13 +258,21 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) return rc; } -void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) +void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) { + struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev); int i; struct pnv_phb *phb; struct pci_dn *pdn; struct pnv_ioda_pe *npe; struct pci_dev *npdev; + bool bypass; + + if (!gpe) + return; + + /* We only do bypass if it's enabled on the linked device */ + bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask); for (i = 0; ; ++i) { npdev = pnv_pci_get_npu_dev(gpdev, i); @@ -277,8 +285,6 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) return; phb = pci_bus_to_host(npdev->bus)->private_data; - - /* We only do bypass if it's enabled on the linked device */ npe = &phb->ioda.pe_array[pdn->pe_number]; if (bypass) { diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 41f7dec3aee5..21db0f4cfb11 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -211,7 +211,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) /* Nvlink functions */ -extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass); +extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask); extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm); extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe); extern struct iommu_table_group *pnv_try_setup_npu_table_group( -- 1.8.3.1
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 11:59:49AM -0500, Reza Arbab wrote: > With recent kernels, TCE tables for NPU devices are no longer being > configured. That task was performed by pnv_npu_try_dma_set_bypass(), a > function that got swept away in recent overhauling of dma code. > > Patches 1-4 here bring the lost function back and reintegrate it with > the updated generic iommu bypass infrastructure. > > Patch 5 fixes a regression in behavior when a requested dma mask can not > be fulfilled. > > Patches 6-8 are cleanup. I put these later in the set because they > aren't bisectable until after the restored code is wired back in. > > Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems > wrong for a boolean *_supported() function to have side effects. They > reintroduce a pci controller based dma_set_mask() hook. If that's > undesirable, these last three patches can be dropped. How do you even use this code? Nothing in the kernel even calls dma_set_mask for NPU devices, as we only suport vfio pass through.
Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()
On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote: > Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by > separating the part of the function that determines if bypass is > supported from the part that actually attempts to configure it. > > Move the latter to a controller-specific dma_set_mask() callback. Nak, the dma_set_mask overrides are going away. But as said in the reply to the cover letter I don't even see how you could end up calling this code.
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote: How do you even use this code? Nothing in the kernel even calls dma_set_mask for NPU devices, as we only suport vfio pass through. You use it by calling dma_set_mask() for the *GPU* device. The purpose of pnv_npu_try_dma_set_bypass() is to then propagate the same bypass configuration to all the NPU devices associated with that GPU. -- Reza Arbab
Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()
On Wed, Oct 30, 2019 at 06:55:18PM +0100, Christoph Hellwig wrote: On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote: Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by separating the part of the function that determines if bypass is supported from the part that actually attempts to configure it. Move the latter to a controller-specific dma_set_mask() callback. Nak, the dma_set_mask overrides are going away. But as said in the reply to the cover letter I don't even see how you could end up calling this code. Okay. As mentioned in the cover letter these last few patches could be omitted if that's the case. -- Reza Arbab
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote: > On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote: >> How do you even use this code? Nothing in the kernel even calls >> dma_set_mask for NPU devices, as we only suport vfio pass through. > > You use it by calling dma_set_mask() for the *GPU* device. The purpose of > pnv_npu_try_dma_set_bypass() is to then propagate the same bypass > configuration to all the NPU devices associated with that GPU. Which in-kernel driver, which PCI ID?
Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote: > > > Le 30/10/2019 à 08:31, Russell Currey a écrit : > > v4 cover letter: > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html > > v3 cover letter: > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html > > > > Changes since v4: > > [1/5]: Addressed review comments from Michael Ellerman (thanks!) > > [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on > >ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid > >STRICT_MODULE_RWX being *on by default* in cases where > >STRICT_KERNEL_RWX is *unavailable* > > [5/5]: split skiroot_defconfig changes out into its own patch > > > > The whole Kconfig situation is really weird and confusing, I believe the > > correct resolution is to change arch/Kconfig but the consequences are so > > minor that I don't think it's worth it, especially given that I expect > > powerpc to have mandatory strict RWX Soon(tm). > > I'm not such strict RWX can be made mandatory due to the impact it has on > some subarches: > - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant > overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which > is not acceptable on systems having very few RAM. > - On hash book3s32, we are able to map the kernel BATs. With a few alignment > constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to > provide exec protection on page granularity. Only on 256Mbytes segments. So > for modules, we have to have the vmspace X. It is also not possible to have > a kernel area RO. Only user areas can be made RO. As I understand it, the idea was for it to be mandatory (or at least default-on) only for the subarches where it wasn't totally insane to accomplish. :) (I'm not familiar with all the details on the subarchs, but it sounded like the more modern systems would be the targets for this?) -- Kees Cook
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 07:13:59PM +0100, Christoph Hellwig wrote: On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote: On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote: How do you even use this code? Nothing in the kernel even calls dma_set_mask for NPU devices, as we only suport vfio pass through. You use it by calling dma_set_mask() for the *GPU* device. The purpose of pnv_npu_try_dma_set_bypass() is to then propagate the same bypass configuration to all the NPU devices associated with that GPU. Which in-kernel driver, which PCI ID? Aha, it's this again. Didn't catch your meaning at first. Point taken. -- Reza Arbab
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 01:32:01PM -0500, Reza Arbab wrote: > Aha, it's this again. Didn't catch your meaning at first. Point taken. It's not _me_. It that you (plural) keep ignoring how Linux development works.
Re: [PATCH v2] powerpc/imc: Dont create debugfs files for cpu-less nodes
On Tue, 2019-07-23 at 16:57 +0530, Anju T Sudhakar wrote: > Hi Qian, > > On 7/16/19 12:11 AM, Qian Cai wrote: > > On Thu, 2019-07-11 at 14:53 +1000, Michael Ellerman wrote: > > > Hi Maddy, > > > > > > Madhavan Srinivasan writes: > > > > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c > > > > b/arch/powerpc/platforms/powernv/opal-imc.c > > > > index 186109bdd41b..e04b20625cb9 100644 > > > > --- a/arch/powerpc/platforms/powernv/opal-imc.c > > > > +++ b/arch/powerpc/platforms/powernv/opal-imc.c > > > > @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct > > > > device_node > > > > *node, > > > > if (of_property_read_u32(node, "cb_offset", &cb_offset)) > > > > cb_offset = IMC_CNTL_BLK_OFFSET; > > > > > > > > - for_each_node(nid) { > > > > - loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset; > > > > + while (ptr->vbase != NULL) { > > > > > > This means you'll bail out as soon as you find a node with no vbase, but > > > it's possible we could have a CPU-less node intermingled with other > > > nodes. > > > > > > So I think you want to keep the for loop, but continue if you see a NULL > > > vbase? > > > > Not sure if this will also takes care of some of those messages during the > > boot > > on today's linux-next even without this patch. > > > > > > [ 18.077780][T1] debugfs: Directory 'imc' with parent 'powerpc' > > already > > present! > > > > > > This is introduced by a recent commit: c33d442328f55 (debugfs: make > error message a bit more verbose). > > So basically, the debugfs imc_* file is created per node, and is created > by the first nest unit which is > > being registered. For the subsequent nest units, debugfs_create_dir() > will just return since the imc_* file already > > exist. > > The commit "c33d442328f55 (debugfs: make error message a bit more > verbose)", prints > > a message if the debugfs file already exists in debugfs_create_dir(). > That is why we are encountering these > > messages now. > > > This patch (i.e, powerpc/imc: Dont create debugfs files for cpu-less > nodes) will address the initial issue, i.e > > "numa crash while reading imc_* debugfs files for cpu less nodes", and > will not address these debugfs messages. > > > But yeah this is a good catch. We can have some checks to avoid these > debugfs messages. Anju, do you still plan to fix those "Directory 'imc' with parent 'powerpc' already present!" warnings as they are still there in the latest linux-next? > > > Hi Michael, > > Do we need to have a separate patch to address these debugfs messages, > or can we address the same > > in the next version of this patch itself? > > > Thanks, > > Anju > > > >
Re: [PATCH v4 0/4] Implement STRICT_MODULE_RWX for powerpc
On Wed, Oct 30, 2019 at 11:16:22AM +1100, Michael Ellerman wrote: > Kees Cook writes: > > On Mon, Oct 14, 2019 at 04:13:16PM +1100, Russell Currey wrote: > >> v3 cover letter here: > >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html > >> > >> Only minimal changes since then: > >> > >> - patch 2/4 commit message update thanks to Andrew Donnellan > >> - patch 3/4 made neater thanks to Christophe Leroy > >> - patch 3/4 updated Kconfig description thanks to Daniel Axtens > > > > I continue to be excited about this work. :) Is there anything holding > > it back from landing in linux-next? > > I had some concerns, which I stupidly posted in reply to v3: > > https://lore.kernel.org/linuxppc-dev/87pnio5fva@mpe.ellerman.id.au/ Ah-ha! Thanks; I missed that. :) -- Kees Cook
Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
Le 30/10/2019 à 19:30, Kees Cook a écrit : On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote: Le 30/10/2019 à 08:31, Russell Currey a écrit : v4 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html v3 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html Changes since v4: [1/5]: Addressed review comments from Michael Ellerman (thanks!) [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid STRICT_MODULE_RWX being *on by default* in cases where STRICT_KERNEL_RWX is *unavailable* [5/5]: split skiroot_defconfig changes out into its own patch The whole Kconfig situation is really weird and confusing, I believe the correct resolution is to change arch/Kconfig but the consequences are so minor that I don't think it's worth it, especially given that I expect powerpc to have mandatory strict RWX Soon(tm). I'm not such strict RWX can be made mandatory due to the impact it has on some subarches: - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which is not acceptable on systems having very few RAM. - On hash book3s32, we are able to map the kernel BATs. With a few alignment constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to provide exec protection on page granularity. Only on 256Mbytes segments. So for modules, we have to have the vmspace X. It is also not possible to have a kernel area RO. Only user areas can be made RO. As I understand it, the idea was for it to be mandatory (or at least default-on) only for the subarches where it wasn't totally insane to accomplish. :) (I'm not familiar with all the details on the subarchs, but it sounded like the more modern systems would be the targets for this?) Yes I guess so. I was just afraid by "I expect powerpc to have mandatory strict RWX" By the way, we have an open issue #223 namely 'Make strict kernel RWX the default on 64-bit', so no worry as 32-bit is aside for now. Christophe
Section mismatch warnings on powerpc
Still see those, WARNING: vmlinux.o(.text+0x2d04): Section mismatch in reference from the variable __boot_from_prom to the function .init.text:prom_init() The function __boot_from_prom() references the function __init prom_init(). This is often because __boot_from_prom lacks a __init annotation or the annotation of prom_init is wrong. WARNING: vmlinux.o(.text+0x2ec8): Section mismatch in reference from the variable start_here_common to the function .init.text:start_kernel() The function start_here_common() references the function __init start_kernel(). This is often because start_here_common lacks a __init annotation or the annotation of start_kernel is wrong. There is a patch around, http://patchwork.ozlabs.org/patch/895442/ Does it still wait for Michael to come with some better names?
[PATCH 00/12] Convert cpu_up/down to device_online/offline
Using cpu_up/down directly to bring cpus online/offline loses synchronization with sysfs and could suffer from a race similar to what is described in commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM"). cpu_up/down seem to be more of a internal implementation detail for the cpu subsystem to use to boot up cpus, perform suspend/resume and low level hotplug operations. Users outside of the cpu subsystem would be better using the device core API to bring a cpu online/offline which is the interface used to hotplug memory and other system devices. Several users have already migrated to use the device core API, this series converts the remaining users and hides cpu_up/down from internal users at the end. I still need to update the documentation to remove references to cpu_up/down and advocate for device_online/offline instead if this series will make its way through. I noticed this problem while working on a hack to disable offlining a particular CPU but noticed that setting the offline_disabled attribute in the device struct isn't enough because users can easily bypass the device core. While my hack isn't a valid use case but it did highlight the inconsistency in the way cpus are being onlined/offlined and this attempt hopefully improves on this. The first 6 patches fixes arch users. The next 5 patches fixes generic code users. Particularly creating a new special exported API for the device core to use instead of cpu_up/down. Maybe we can do something more restrictive than that. The last patch removes cpu_up/down from cpu.h and unexport the functions. In some cases where the use of cpu_up/down seemed legitimate, I encapsulated the logic in a higher level - special purposed function; and converted the code to use that instead. I did run the rcu torture, lock torture and psci checker tests and no problem was noticed. I did perform build tests on all arch affected except for parisc. Hopefully I got the CC list right for all the patches. Apologies in advance if some people were omitted from some patches but they should have been CCed. CC: Armijn Hemel CC: Benjamin Herrenschmidt CC: Bjorn Helgaas CC: Borislav Petkov CC: Boris Ostrovsky CC: Catalin Marinas CC: Christophe Leroy CC: Daniel Lezcano CC: Davidlohr Bueso CC: "David S. Miller" CC: Eiichi Tsukata CC: Enrico Weigelt CC: Fenghua Yu CC: Greg Kroah-Hartman CC: Helge Deller CC: "H. Peter Anvin" CC: Ingo Molnar CC: "James E.J. Bottomley" CC: James Morse CC: Jiri Kosina CC: Josh Poimboeuf CC: Josh Triplett CC: Juergen Gross CC: Lorenzo Pieralisi CC: Mark Rutland CC: Michael Ellerman CC: Nadav Amit CC: Nicholas Piggin CC: "Paul E. McKenney" CC: Paul Mackerras CC: Pavankumar Kondeti CC: "Peter Zijlstra (Intel)" CC: "Rafael J. Wysocki" CC: Ram Pai CC: Richard Fontana CC: Sakari Ailus CC: Stefano Stabellini CC: Steve Capper CC: Thiago Jung Bauermann CC: Thomas Gleixner CC: Tony Luck CC: Will Deacon CC: Zhenzhong Duan CC: linux-arm-ker...@lists.infradead.org CC: linux-i...@vger.kernel.org CC: linux-ker...@vger.kernel.org CC: linux-par...@vger.kernel.org CC: linuxppc-dev@lists.ozlabs.org CC: sparcli...@vger.kernel.org CC: x...@kernel.org CC: xen-de...@lists.xenproject.org Qais Yousef (12): arm64: hibernate.c: create a new function to handle cpu_up(sleep_cpu) x86: Replace cpu_up/down with devcie_online/offline powerpc: Replace cpu_up/down with device_online/offline ia64: Replace cpu_down with freeze_secondary_cpus sparc: Replace cpu_up/down with device_online/offline parisc: Replace cpu_up/down with device_online/offline driver: base: cpu: export device_online/offline driver: xen: Replace cpu_up/down with device_online/offline firmware: psci: Replace cpu_up/down with device_online/offline torture: Replace cpu_up/down with device_online/offline smp: Create a new function to bringup nonboot cpus online cpu: Hide cpu_up/down arch/arm64/kernel/hibernate.c | 13 +++ arch/ia64/kernel/process.c | 8 +--- arch/parisc/kernel/processor.c | 4 +- arch/powerpc/kernel/machine_kexec_64.c | 4 +- arch/sparc/kernel/ds.c | 8 +++- arch/x86/kernel/topology.c | 4 +- arch/x86/mm/mmio-mod.c | 8 +++- arch/x86/xen/smp.c | 4 +- drivers/base/core.c| 4 ++ drivers/base/cpu.c | 4 +- drivers/firmware/psci/psci_checker.c | 6 ++- drivers/xen/cpu_hotplug.c | 2 +- include/linux/cpu.h| 6 ++- kernel/cpu.c | 53 -- kernel/smp.c | 9 + kernel/torture.c | 15 ++-- 16 files changed, 106 insertions(+), 46 deletions(-) -- 2.17.1
[PATCH 03/12] powerpc: Replace cpu_up/down with device_online/offline
The core device API performs extra housekeeping bits that are missing from directly calling cpu_up/down. See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM") for an example description of what might go wrong. This also prepares to make cpu_up/down a private interface for anything but the cpu subsystem. Signed-off-by: Qais Yousef CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Enrico Weigelt CC: Ram Pai CC: Nicholas Piggin CC: Thiago Jung Bauermann CC: Christophe Leroy CC: Thomas Gleixner CC: linuxppc-dev@lists.ozlabs.org CC: linux-ker...@vger.kernel.org --- arch/powerpc/kernel/machine_kexec_64.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/machine_kexec_64.c b/arch/powerpc/kernel/machine_kexec_64.c index 04a7cba58eff..ebf8cc7acc4d 100644 --- a/arch/powerpc/kernel/machine_kexec_64.c +++ b/arch/powerpc/kernel/machine_kexec_64.c @@ -208,13 +208,15 @@ static void wake_offline_cpus(void) { int cpu = 0; + lock_device_hotplug(); for_each_present_cpu(cpu) { if (!cpu_online(cpu)) { printk(KERN_INFO "kexec: Waking offline cpu %d.\n", cpu); - WARN_ON(cpu_up(cpu)); + WARN_ON(device_online(get_cpu_device(cpu))); } } + unlock_device_hotplug(); } static void kexec_prepare_cpus(void) -- 2.17.1
Re: [PATCH RFC 1/5] dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
On Mon, Oct 14, 2019 at 08:31:03PM +0200, Nicolas Saenz Julienne wrote: > Some architectures, notably ARM, are interested in tweaking this > depending on their runtime DMA addressing limitations. > > Signed-off-by: Nicolas Saenz Julienne Do you want me to pick this up for the 5.5 dma-mapping tree, or do you want me to wait for the rest to settle?
Re: [PATCH v3 3/8] powerpc: Fix vDSO clock_getres()
Hi, [This is an automated email] This commit has been processed because it contains a "Fixes:" tag, fixing commit: a7f290dad32ee [PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel. The bot has tested the following trees: v5.3.8, v4.19.81, v4.14.151, v4.9.198, v4.4.198. v5.3.8: Build OK! v4.19.81: Build OK! v4.14.151: Failed to apply! Possible dependencies: 5c929885f1bb4 ("powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE") b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038") v4.9.198: Failed to apply! Possible dependencies: 4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c") 5c929885f1bb4 ("powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE") 5d451a87e5ebb ("powerpc/64: Retrieve number of L1 cache sets from device-tree") 7c5b06cadf274 ("KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9") 83677f551e0a6 ("KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9") 902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per task") b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038") bd067f83b0840 ("powerpc/64: Fix naming of cache block vs. cache line") e2827fe5c1566 ("powerpc/64: Clean up ppc64_caches using a struct per cache") e9cf1e085647b ("KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs") f4c51f841d2ac ("KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests") v4.4.198: Failed to apply! Possible dependencies: 153086644fd1f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI") 3eb5d5888dc68 ("powerpc: Add ppc_strict_facility_enable boot option") 4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c") 579e633e764e6 ("powerpc: create flush_all_to_thread()") 5c929885f1bb4 ("powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE") 70fe3d980f5f1 ("powerpc: Restore FPU/VEC/VSX if previously used") 85baa095497f3 ("powerpc/livepatch: Add live patching support on ppc64le") 902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per task") b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038") bf76f73c5f655 ("powerpc: enable UBSAN support") c208505900b23 ("powerpc: create giveup_all()") d1e1cf2e38def ("powerpc: clean up asm/switch_to.h") dc4fbba11e466 ("powerpc: Create disable_kernel_{fp,altivec,vsx,spe}()") f17c4e01e906c ("powerpc/module: Mark module stubs with a magic value") NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch? -- Thanks, Sasha
[PATCH 01/19] mm/gup: pass flags arg to __gup_device_* functions
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions. Also placate checkpatch.pl by shortening a nearby line. Cc: Kirill A. Shutemov Signed-off-by: John Hubbard --- mm/gup.c | 28 ++-- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 8f236a335ae9..85caf76b3012 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static int __gup_device_huge(unsigned long pfn, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { int nr_start = *nr; struct dev_pagemap *pgmap = NULL; @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, } static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { unsigned long fault_pfn; int nr_start = *nr; fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr)) + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, } static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { unsigned long fault_pfn; int nr_start = *nr; fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr)) + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) { @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, } #else static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { BUILD_BUG(); return 0; } static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { BUILD_BUG(); return 0; @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr); + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags, +pages, nr); } refs = 0; @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, } static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct page *head, *page; int refs; @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr); + return __gup_device_huge_pud(orig, pudp, addr, end, flags, +pages, nr); } refs = 0; -- 2.23.0
[PATCH 00/19] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM
Hi, This applies cleanly to linux-next and mmotm, and also to linux.git if linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB case") is first applied there. This provides tracking of dma-pinned pages. This is a prerequisite to solving the larger problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], and in a remarkable number of email threads since about 2017. :) A new internal gup flag, FOLL_PIN is introduced, and thoroughly documented in the last patch's Documentation/vm/pin_user_pages.rst. I believe that this will provide a good starting point for doing the layout lease work that Ira Weiny has been working on. That's because these new wrapper functions provide a clean, constrained, systematically named set of functionality that, again, is required in order to even know if a page is "dma-pinned". In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) put_user_page() Because there are interdependencies with FOLL_LONGTERM, a similar conversion as for FOLL_PIN, was applied. The change was from this: get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET) put_page() to this: pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM) put_user_page() Patch summary: * Patches 1-4: refactoring and preparatory cleanup, independent fixes (Patch 4: V4L2-core bug fix (can be separately applied)) * Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional changes yet * Patches 6-11: Convert existing put_user_page() callers, to use the new pin*() * Patch 12: Activate tracking of FOLL_PIN pages. * Patches 13-15: convert FOLL_LONGTERM callers * Patches: 16-17: gup_benchmark and run_vmtests support * Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag * Patch 19: Documentation/vm/pin_user_pages.rst Testing: * I've done some overall kernel testing (LTP, and a few other goodies), and some directed testing to exercise some of the changes. And as you can see, gup_benchmark is enhanced to exercise this. Basically, I've been able to runtime test the core get_user_pages() and pin_user_pages() and related routines, but not so much on several of the call sites--but those are generally just a couple of lines changed, each. Not much of the kernel is actually using this, which on one hand reduces risk quite a lot. But on the other hand, testing coverage is low. So I'd love it if, in particular, the Infiniband and PowerPC folks could do a smoke test of this series for me. Also, my runtime testing for the call sites so far is very weak: * io_uring: Some directed tests from liburing exercise this, and they pass. * process_vm_access.c: A small directed test passes. * gup_benchmark: the enhanced version hits the new gup.c code, and passes. * infiniband (still only have crude "IB pingpong" working, on a good day: it's not exercising my conversions at runtime...) * VFIO: compiles (I'm vowing to set up a run time test soon, but it's not ready just yet) * powerpc: it compiles... * drm/via: compiles... * goldfish: compiles... * net/xdp: compiles... * media/v4l2: compiles... Next: * Get the block/bio_vec sites converted to use pin_user_pages(). * Work with Ira and Dave Chinner to weave this together with the layout lease stuff. [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ John Hubbard (19): mm/gup: pass flags arg to __gup_device_* functions mm/gup: factor out duplicate code from four routines goldish_pipe: rename local pin_user_pages() routine media/v4l2-core: set pages dirty upon releasing DMA buffers mm/gup: introduce pin_user_pages*() and FOLL_PIN goldish_pipe: convert to pin_user_pages() and put_user_page() infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() drm/via: set FOLL_PIN via pin_user_pages_fast() fs/io_uring: set FOLL_PIN via pin_user_pages() net/xdp: set FOLL_PIN via pin_user_pages() mm/gup: track FOLL_PIN pages media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversio
[PATCH 02/19] mm/gup: factor out duplicate code from four routines
There are four locations in gup.c that have a fair amount of code duplication. This means that changing one requires making the same changes in four places, not to mention reading the same code four times, and wondering if there are subtle differences. Factor out the common code into static functions, thus reducing the overall line count and the code's complexity. Also, take the opportunity to slightly improve the efficiency of the error cases, by doing a mass subtraction of the refcount, surrounded by get_page()/put_page(). Also, further simplify (slightly), by waiting until the the successful end of each routine, to increment *nr. Cc: Christoph Hellwig Cc: Aneesh Kumar K.V Signed-off-by: John Hubbard --- mm/gup.c | 113 ++- 1 file changed, 46 insertions(+), 67 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 85caf76b3012..8fb0d9cdfaf5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1969,6 +1969,35 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, } #endif +static int __record_subpages(struct page *page, unsigned long addr, +unsigned long end, struct page **pages, int nr) +{ + int nr_recorded_pages = 0; + + do { + pages[nr] = page; + nr++; + page++; + nr_recorded_pages++; + } while (addr += PAGE_SIZE, addr != end); + return nr_recorded_pages; +} + +static void __remove_refs_from_head(struct page *page, int refs) +{ + /* Do a get_page() first, in case refs == page->_refcount */ + get_page(page); + page_ref_sub(page, refs); + put_page(page); +} + +static int __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr) +{ + *nr += nr_recorded_pages; + SetPageReferenced(head); + return 1; +} + #ifdef CONFIG_ARCH_HAS_HUGEPD static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz) @@ -1998,34 +2027,19 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, /* hugepages are never "special" */ VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - refs = 0; head = pte_page(pte); - page = head + ((addr & (sz-1)) >> PAGE_SHIFT); - do { - VM_BUG_ON(compound_head(page) != head); - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = __record_subpages(page, addr, end, pages, *nr); head = try_get_compound_head(head, refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pte_val(pte) != pte_val(*ptep))) { - /* Could be optimized better */ - *nr -= refs; - while (refs--) - put_page(head); + __remove_refs_from_head(head, refs); return 0; } - - SetPageReferenced(head); - return 1; + return __huge_pt_done(head, refs, nr); } static int gup_huge_pd(hugepd_t hugepd, unsigned long addr, @@ -2071,30 +2085,18 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, pages, nr); } - refs = 0; page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = __record_subpages(page, addr, end, pages, *nr); head = try_get_compound_head(pmd_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - *nr -= refs; - while (refs--) - put_page(head); + __remove_refs_from_head(head, refs); return 0; } - - SetPageReferenced(head); - return 1; + return __huge_pt_done(head, refs, nr); } static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, @@ -2114,30 +2116,18 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, pages, nr); } - refs = 0; page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = __record_subpages(page, addr, end, pages, *nr); head = try_get_compound_head(pud_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pud_val(orig) != pud
[PATCH 03/19] goldish_pipe: rename local pin_user_pages() routine
1. Avoid naming conflicts: rename local static function from "pin_user_pages()" to "pin_goldfish_pages()". An upcoming patch will introduce a global pin_user_pages() function. Signed-off-by: John Hubbard --- drivers/platform/goldfish/goldfish_pipe.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c index cef0133aa47a..7ed2a21a0bac 100644 --- a/drivers/platform/goldfish/goldfish_pipe.c +++ b/drivers/platform/goldfish/goldfish_pipe.c @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status) } } -static int pin_user_pages(unsigned long first_page, - unsigned long last_page, - unsigned int last_page_size, - int is_write, - struct page *pages[MAX_BUFFERS_PER_COMMAND], - unsigned int *iter_last_page_size) +static int pin_goldfish_pages(unsigned long first_page, + unsigned long last_page, + unsigned int last_page_size, + int is_write, + struct page *pages[MAX_BUFFERS_PER_COMMAND], + unsigned int *iter_last_page_size) { int ret; int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1; @@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe, if (mutex_lock_interruptible(&pipe->lock)) return -ERESTARTSYS; - pages_count = pin_user_pages(first_page, last_page, -last_page_size, is_write, -pipe->pages, &iter_last_page_size); + pages_count = pin_goldfish_pages(first_page, last_page, +last_page_size, is_write, +pipe->pages, &iter_last_page_size); if (pages_count < 0) { mutex_unlock(&pipe->lock); return pages_count; -- 2.23.0
[PATCH 04/19] media/v4l2-core: set pages dirty upon releasing DMA buffers
After DMA is complete, and the device and CPU caches are synchronized, it's still required to mark the CPU pages as dirty, if the data was coming from the device. However, this driver was just issuing a bare put_page() call, without any set_page_dirty*() call. Fix the problem, by calling set_page_dirty_lock() if the CPU pages were potentially receiving data from the device. Cc: Mauro Carvalho Chehab Signed-off-by: John Hubbard --- drivers/media/v4l2-core/videobuf-dma-sg.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 66a6c6c236a7..28262190c3ab 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma) BUG_ON(dma->sglen); if (dma->pages) { - for (i = 0; i < dma->nr_pages; i++) + for (i = 0; i < dma->nr_pages; i++) { + if (dma->direction == DMA_FROM_DEVICE) + set_page_dirty_lock(dma->pages[i]); put_page(dma->pages[i]); + } kfree(dma->pages); dma->pages = NULL; } -- 2.23.0
[PATCH 06/19] goldish_pipe: convert to pin_user_pages() and put_user_page()
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages(). 2. As required by pin_user_pages(), release these pages via put_user_page(). In this case, do so via put_user_pages_dirty_lock(). That has the side effect of calling set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] Another side effect is that the release code is simplified because the page[] loop is now in gup.c instead of here, so just delete the local release_user_pages() entirely, and call put_user_pages_dirty_lock() directly, instead. [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de Signed-off-by: John Hubbard --- drivers/platform/goldfish/goldfish_pipe.c | 17 +++-- 1 file changed, 3 insertions(+), 14 deletions(-) diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c index 7ed2a21a0bac..635a8bc1b480 100644 --- a/drivers/platform/goldfish/goldfish_pipe.c +++ b/drivers/platform/goldfish/goldfish_pipe.c @@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page, *iter_last_page_size = last_page_size; } - ret = get_user_pages_fast(first_page, requested_pages, + ret = pin_user_pages_fast(first_page, requested_pages, !is_write ? FOLL_WRITE : 0, pages); if (ret <= 0) @@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page, return ret; } -static void release_user_pages(struct page **pages, int pages_count, - int is_write, s32 consumed_size) -{ - int i; - - for (i = 0; i < pages_count; i++) { - if (!is_write && consumed_size > 0) - set_page_dirty(pages[i]); - put_page(pages[i]); - } -} - /* Populate the call parameters, merging adjacent pages together */ static void populate_rw_params(struct page **pages, int pages_count, @@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe, *consumed_size = pipe->command_buffer->rw_params.consumed_size; - release_user_pages(pipe->pages, pages_count, is_write, *consumed_size); + put_user_pages_dirty_lock(pipe->pages, pages_count, + !is_write && *consumed_size > 0); mutex_unlock(&pipe->lock); return 0; -- 2.23.0
[PATCH 07/19] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
Convert infiniband to use the new wrapper calls, and stop explicitly setting FOLL_LONGTERM at the call sites. The new pin_longterm_*() calls replace get_user_pages*() calls, and set both FOLL_LONGTERM and a new FOLL_PIN flag. The FOLL_PIN flag requires that the caller must return the pages via put_user_page*() calls, but infiniband was already doing that as part of an earlier commit. Signed-off-by: John Hubbard --- drivers/infiniband/core/umem.c | 5 ++--- drivers/infiniband/core/umem_odp.c | 10 +- drivers/infiniband/hw/hfi1/user_pages.c | 4 ++-- drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-- drivers/infiniband/hw/qib/qib_user_pages.c | 8 drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +- drivers/infiniband/hw/usnic/usnic_uiom.c| 9 - drivers/infiniband/sw/siw/siw_mem.c | 5 ++--- 8 files changed, 21 insertions(+), 25 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 24244a2f68cc..c5a78d3e674b 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr, while (npages) { down_read(&mm->mmap_sem); - ret = get_user_pages(cur_base, + ret = pin_longterm_pages(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -gup_flags | FOLL_LONGTERM, -page_list, NULL); +gup_flags, page_list, NULL); if (ret < 0) { up_read(&mm->mmap_sem); goto umem_release; diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 163ff7ba92b7..a38b67b83db5 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -534,7 +534,7 @@ static int ib_umem_odp_map_dma_single_page( } else if (umem_odp->page_list[page_index] == page) { umem_odp->dma_list[page_index] |= access_mask; } else { - pr_err("error: got different pages in IB device and from get_user_pages. IB device page: %p, gup page: %p\n", + pr_err("error: got different pages in IB device and from pin_longterm_pages. IB device page: %p, gup page: %p\n", umem_odp->page_list[page_index], page); /* Better remove the mapping now, to prevent any further * damage. */ @@ -639,11 +639,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt, /* * Note: this might result in redundent page getting. We can * avoid this by checking dma_list to be 0 before calling -* get_user_pages. However, this make the code much more -* complex (and doesn't gain us much performance in most use -* cases). +* pin_longterm_pages. However, this makes the code much +* more complex (and doesn't gain us much performance in most +* use cases). */ - npages = get_user_pages_remote(owning_process, owning_mm, + npages = pin_longterm_pages_remote(owning_process, owning_mm, user_virt, gup_num_pages, flags, local_page_list, NULL, NULL); up_read(&owning_mm->mmap_sem); diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c index 469acb961fbd..9b55b0a73e29 100644 --- a/drivers/infiniband/hw/hfi1/user_pages.c +++ b/drivers/infiniband/hw/hfi1/user_pages.c @@ -104,9 +104,9 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np bool writable, struct page **pages) { int ret; - unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0); + unsigned int gup_flags = (writable ? FOLL_WRITE : 0); - ret = get_user_pages_fast(vaddr, npages, gup_flags, pages); + ret = pin_longterm_pages_fast(vaddr, npages, gup_flags, pages); if (ret < 0) return ret; diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index edccfd6e178f..beec7e4b8a96 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, goto out; } - ret = get_user_pages_fast(uaddr & PAGE_MASK, 1, - FOLL_WRITE | FOLL_LONGTERM, pages); + ret = pin_longterm_pages_fast(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages); if (ret < 0)
[PATCH 05/19] mm/gup: introduce pin_user_pages*() and FOLL_PIN
Introduce pin_user_pages*() variations of get_user_pages*() calls, and also pin_longterm_pages*() variations. These variants all set FOLL_PIN, which is also introduced, and basically documented. (An upcoming patch provides more extensive documentation.) The second set (pin_longterm*) also sets FOLL_LONGTERM: pin_user_pages() pin_user_pages_remote() pin_user_pages_fast() pin_longterm_pages() pin_longterm_pages_remote() pin_longterm_pages_fast() All pages that are pinned via the above calls, must be unpinned via put_user_page(). The underlying rules are: * These are gup-internal flags, so the call sites should not directly set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with assertions, for the new FOLL_PIN flag. However, for the pre-existing FOLL_LONGTERM flag, which has some call sites that still directly set FOLL_LONGTERM, there is no assertion yet. * Call sites that want to indicate that they are going to do DirectIO ("DIO") or something with similar characteristics, should call a get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers will: * Start with "pin_user_pages" instead of "get_user_pages". That makes it easy to find and audit the call sites. * Set FOLL_PIN * For pages that are received via FOLL_PIN, those pages must be returned via put_user_page(). Signed-off-by: John Hubbard --- include/linux/mm.h | 53 - mm/gup.c | 284 + 2 files changed, 311 insertions(+), 26 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index cc292273e6ba..62c838a3e6c7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked); +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked); long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas); +long pin_user_pages(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); +long pin_longterm_pages(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, @@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); +int pin_user_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages); +int pin_longterm_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages); int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, @@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x1 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */ +#define FOLL_PIN 0x4 /* pages must be released via put_user_page() */ /* - * NOTE on FOLL_LONGTERM: + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each + * other. Here is what they mean, and how to use them: * * FOLL_LONGTERM indicates that the page will be held for an indefinite time - * period _often_ under userspace control. This is contrasted with - * iov_iter_get_pages() where usages which are transient. + * period _often_ under userspace control. This is in contrast to + * iov_iter_get_pages(), where usages which are transient. * * FIXME: For pages
[PATCH 08/19] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
Convert process_vm_access to use the new pin_user_pages_remote() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Also, release the pages via put_user_page*(). Also, rename "pages" to "pinned_pages", as this makes for easier reading of process_vm_rw_single_vec(). Signed-off-by: John Hubbard --- mm/process_vm_access.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7bef6c0..fd20ab675b85 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages, if (copy > len) copy = len; - if (vm_write) { + if (vm_write) copied = copy_page_from_iter(page, offset, copy, iter); - set_page_dirty_lock(page); - } else { + else copied = copy_page_to_iter(page, offset, copy, iter); - } + len -= copied; if (copied < copy && iov_iter_count(iter)) return -EFAULT; @@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr, flags |= FOLL_WRITE; while (!rc && nr_pages && iov_iter_count(iter)) { - int pages = min(nr_pages, max_pages_per_loop); + int pinned_pages = min(nr_pages, max_pages_per_loop); int locked = 1; size_t bytes; @@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr, * current/current->mm */ down_read(&mm->mmap_sem); - pages = get_user_pages_remote(task, mm, pa, pages, flags, - process_pages, NULL, &locked); + pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages, +flags, process_pages, +NULL, &locked); if (locked) up_read(&mm->mmap_sem); - if (pages <= 0) + if (pinned_pages <= 0) return -EFAULT; - bytes = pages * PAGE_SIZE - start_offset; + bytes = pinned_pages * PAGE_SIZE - start_offset; if (bytes > len) bytes = len; @@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr, vm_write); len -= bytes; start_offset = 0; - nr_pages -= pages; - pa += pages * PAGE_SIZE; - while (pages) - put_page(process_pages[--pages]); + nr_pages -= pinned_pages; + pa += pinned_pages * PAGE_SIZE; + + /* If vm_write is set, the pages need to be made dirty: */ + put_user_pages_dirty_lock(process_pages, pinned_pages, + vm_write); } return rc; -- 2.23.0
[PATCH 10/19] fs/io_uring: set FOLL_PIN via pin_user_pages()
Convert fs/io_uring to use the new pin_user_pages() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages, and therefore for any code that calls put_user_page(). Signed-off-by: John Hubbard --- fs/io_uring.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index a30c4f622cb3..d3924b1760eb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3431,9 +3431,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, ret = 0; down_read(¤t->mm->mmap_sem); - pret = get_user_pages(ubuf, nr_pages, - FOLL_WRITE | FOLL_LONGTERM, - pages, vmas); + pret = pin_longterm_pages(ubuf, nr_pages, FOLL_WRITE, pages, + vmas); if (pret == nr_pages) { /* don't support file backed memory */ for (j = 0; j < nr_pages; j++) { -- 2.23.0
[PATCH 09/19] drm/via: set FOLL_PIN via pin_user_pages_fast()
Convert drm/via to use the new pin_user_pages_fast() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages, and therefore for any code that calls put_user_page(). Signed-off-by: John Hubbard --- drivers/gpu/drm/via/via_dmablit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c index 3db000aacd26..37c5e572993a 100644 --- a/drivers/gpu/drm/via/via_dmablit.c +++ b/drivers/gpu/drm/via/via_dmablit.c @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg, drm_via_dmablit_t *xfer) vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages)); if (NULL == vsg->pages) return -ENOMEM; - ret = get_user_pages_fast((unsigned long)xfer->mem_addr, + ret = pin_user_pages_fast((unsigned long)xfer->mem_addr, vsg->num_pages, vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0, vsg->pages); -- 2.23.0
[PATCH 11/19] net/xdp: set FOLL_PIN via pin_user_pages()
Convert net/xdp to use the new pin_longterm_pages() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Signed-off-by: John Hubbard --- net/xdp/xdp_umem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 16d5f353163a..4d56dfb1139a 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -285,8 +285,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem) return -ENOMEM; down_read(¤t->mm->mmap_sem); - npgs = get_user_pages(umem->address, umem->npgs, - gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL); + npgs = pin_longterm_pages(umem->address, umem->npgs, gup_flags, + &umem->pgs[0], NULL); up_read(¤t->mm->mmap_sem); if (npgs != umem->npgs) { -- 2.23.0
[PATCH 12/19] mm/gup: track FOLL_PIN pages
Add tracking of pages that were pinned via FOLL_PIN. As mentioned in the FOLL_PIN documentation, callers who effectively set FOLL_PIN are required to ultimately free such pages via put_user_page(). The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use". Pages that have been pinned via FOLL_PIN are identifiable via a new function call: bool page_dma_pinned(struct page *page); What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1]. This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). This also has a couple of trivial, non-functional change fixes to try_get_compound_head(). That function got moved to the top of the file. This includes the following fix from Ira Weiny: DAX requires detection of a page crossing to a ref count of 1. Fix this for GUP pages by introducing put_devmap_managed_user_page() which accounts for GUP_PIN_COUNTING_BIAS now used by GUP. [1] https://lwn.net/Articles/784574/ "Some slow progress on get_user_pages()" Suggested-by: Jan Kara Suggested-by: Jérôme Glisse Signed-off-by: Ira Weiny Signed-off-by: John Hubbard --- include/linux/mm.h | 80 +++ include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 ++ mm/gup.c | 213 +++ mm/huge_memory.c | 32 +- mm/hugetlb.c | 28 - mm/memremap.c| 4 +- mm/vmstat.c | 2 + 8 files changed, 300 insertions(+), 71 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 62c838a3e6c7..882fda919c81 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page) #endif #ifdef CONFIG_DEV_PAGEMAP_OPS -void __put_devmap_managed_page(struct page *page); +void __put_devmap_managed_page(struct page *page, int count); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool put_devmap_managed_page(struct page *page) + +static inline bool page_is_devmap_managed(struct page *page) { if (!static_branch_unlikely(&devmap_managed_key)) return false; @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page) switch (page->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_FS_DAX: - __put_devmap_managed_page(page); return true; default: break; @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page) return false; } +static inline bool put_devmap_managed_page(struct page *page) +{ + bool is_devmap = page_is_devmap_managed(page); + + if (is_devmap) { + int count = page_ref_dec_return(page); + + __put_devmap_managed_page(page, count); + } + + return is_devmap; +} + #else /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool put_devmap_managed_page(struct page *page) { @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page) return true; } +__must_check bool user_page_ref_inc(struct page *page); + static inline void put_page(struct page *page) { page = compound_head(page); @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page) __put_page(page); } -/** - * put_user_page() - release a gup-pinned page - * @page:pointer to page to be released +/* + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload + * the page's refcount so that two separate items are tracked: the original page + * reference count, and also a new count of how many get_user_pages() calls were + * made against the page. ("gup-pinned" is another term for the latter). + * + * With this scheme, get_user_pages() becomes special: such pages are marked + * as distinct from normal pages. As such, the new put_user_page() call (and + * its variants) must be used in order to release gup-pinned pages. + * + * Choice of value: * - * Pages that were pinned via get_user_pages*() must be released via - * either put_user_page(), or one of the put_user_pages*() routines - * below. This is so that eventually, pages that are pinned via - * get_user_pages*() can be separately tracked and uniquely handled. In - * particular, interactions with RDMA and filesystems need special - * handling. + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference + * counts with respect to get_user_pages() and put_user_page() becomes simpler, + * due to the fact that adding an even power of two to the page refcount has + * the effect of using only the upper N bits, for the code that counts up using + * the bias value. This means that the lower bits are left for the exclusive + * use of the original code that increments and decrements by one (or at least, + * by much smaller values than the bias value). * -
[PATCH 13/19] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN. 2. Because all FOLL_PIN-acquired pages must be released via put_user_page(), also convert the put_page() call over to put_user_pages_dirty_lock(). Cc: Mauro Carvalho Chehab Signed-off-by: John Hubbard --- drivers/media/v4l2-core/videobuf-dma-sg.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 28262190c3ab..9b9c5b37bf59 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages); - err = get_user_pages(data & PAGE_MASK, dma->nr_pages, -flags | FOLL_LONGTERM, dma->pages, NULL); + err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages, +flags, dma->pages, NULL); if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; - dprintk(1, "get_user_pages: err=%d [%d]\n", err, + dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err, dma->nr_pages); return err < 0 ? err : -EINVAL; } @@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma) BUG_ON(dma->sglen); if (dma->pages) { - for (i = 0; i < dma->nr_pages; i++) { - if (dma->direction == DMA_FROM_DEVICE) - set_page_dirty_lock(dma->pages[i]); - put_page(dma->pages[i]); - } + put_user_pages_dirty_lock(dma->pages, dma->nr_pages, + dma->direction == DMA_FROM_DEVICE); kfree(dma->pages); dma->pages = NULL; } -- 2.23.0
[PATCH 14/19] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
This also fixes one or two likely bugs. 1. Change vfio from get_user_pages(FOLL_LONGTERM), to pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN. Note that this is a change in behavior, because the get_user_pages_remote() call was not setting FOLL_LONGTERM, but the new pin_user_pages_remote() call that replaces it, *is* setting FOLL_LONGTERM. It is important to set FOLL_LONGTERM, because the DMA case requires it. Please see the FOLL_PIN documentation in include/linux/mm.h, and Documentation/pin_user_pages.rst for details. 2. Because all FOLL_PIN-acquired pages must be released via put_user_page(), also convert the put_page() call over to put_user_pages(). Note that this effectively changes the code's behavior in vfio_iommu_type1.c: put_pfn(): it now ultimately calls set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de Cc: Alex Williamson Signed-off-by: John Hubbard --- drivers/vfio/vfio_iommu_type1.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d864277ea16f..795e13f3ef08 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot) { if (!is_invalid_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); - if (prot & IOMMU_WRITE) - SetPageDirty(page); - put_page(page); + + put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE); return 1; } return 0; @@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, down_read(&mm->mmap_sem); if (mm == current->mm) { - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page, -vmas); + ret = pin_longterm_pages(vaddr, 1, flags, page, vmas); } else { - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, - vmas, NULL); + ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1, + flags, page, vmas, + NULL); /* * The lifetime of a vaddr_get_pfn() page pin is * userspace-controlled. In the fs-dax case this could @@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, */ if (ret > 0 && vma_is_fsdax(vmas[0])) { ret = -EOPNOTSUPP; - put_page(page[0]); + put_user_page(page[0]); } } up_read(&mm->mmap_sem); -- 2.23.0
[PATCH 18/19] mm/gup: remove support for gup(FOLL_LONGTERM)
Now that all other kernel callers of get_user_pages(FOLL_LONGTERM) have been converted to pin_longterm_pages(), lock it down: 1) Add an assertion to get_user_pages(), preventing callers from passing FOLL_LONGTERM (in addition to the existing assertion that prevents FOLL_PIN). 2) Remove the associated GUP_LONGTERM_BENCHMARK test. Signed-off-by: John Hubbard --- mm/gup.c | 8 mm/gup_benchmark.c | 9 + tools/testing/selftests/vm/gup_benchmark.c | 7 ++- 3 files changed, 7 insertions(+), 17 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index e51b3820a995..9a28935a2cb1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1744,11 +1744,11 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, struct vm_area_struct **vmas) { /* -* As detailed above, FOLL_PIN must only be set internally by the -* pin_user_page*() and pin_longterm_*() APIs, never directly by the -* caller, so enforce that with an assertion: +* As detailed above, FOLL_PIN and FOLL_LONGTERM must only be set +* internally by the pin_user_page*() and pin_longterm_*() APIs, never +* directly by the caller, so enforce that with an assertion: */ - if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM))) return -EINVAL; return __gup_longterm_locked(current, current->mm, start, nr_pages, diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 2bb0f5df4803..de6941855b7e 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -6,7 +6,7 @@ #include #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) +/* Command 2 has been deleted. */ #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) #define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) #define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) @@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages) switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: for (i = 0; i < nr_pages; i++) put_page(pages[i]); @@ -94,11 +93,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages_fast(addr, nr, gup->flags & 1, pages + i); break; - case GUP_LONGTERM_BENCHMARK: - nr = get_user_pages(addr, nr, - (gup->flags & 1) | FOLL_LONGTERM, - pages + i, NULL); - break; case GUP_BENCHMARK: nr = get_user_pages(addr, nr, gup->flags & 1, pages + i, NULL); @@ -157,7 +151,6 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: case PIN_FAST_BENCHMARK: case PIN_LONGTERM_BENCHMARK: diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index c5c934c0f402..5ef3cf8f3da5 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -15,7 +15,7 @@ #define PAGE_SIZE sysconf(_SC_PAGESIZE) #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) +/* Command 2 has been deleted. */ #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) /* @@ -46,7 +46,7 @@ int main(int argc, char **argv) char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) { switch (opt) { case 'a': cmd = PIN_FAST_BENCHMARK; @@ -72,9 +72,6 @@ int main(int argc, char **argv) case 'T': thp = 0; break; - case 'L': - cmd = GUP_LONGTERM_BENCHMARK; - break; case 'U': cmd = GUP_BENCHMARK; break; -- 2.23.0
[PATCH 16/19] mm/gup_benchmark: support pin_user_pages() and related calls
Up until now, gup_benchmark supported testing of the following kernel functions: * get_user_pages(): via the '-U' command line option * get_user_pages_longterm(): via the '-L' command line option * get_user_pages_fast(): as the default (no options required) Add test coverage for the new corresponding pin_*() functions: * pin_user_pages(): via the '-c' command line option * pin_longterm_pages(): via the '-b' command line option * pin_user_pages_fast(): via the '-a' command line option Also, add an option for clarity: '-u' for what is now (still) the default choice: get_user_pages_fast(). Also, for the three commands that set FOLL_PIN, verify that the pages really are dma-pinned, via the new is_dma_pinned() routine. Those commands are: PIN_FAST_BENCHMARK : calls pin_user_pages_fast() PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages() PIN_BENCHMARK : calls pin_user_pages() In between the calls to pin_*() and put_user_pages(), check each page: if page_dma_pinned() returns false, then WARN and return. Do this outside of the benchmark timestamps, so that it doesn't affect reported times. Signed-off-by: John Hubbard --- mm/gup_benchmark.c | 74 -- tools/testing/selftests/vm/gup_benchmark.c | 23 ++- 2 files changed, 91 insertions(+), 6 deletions(-) diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 7dd602d7f8db..2bb0f5df4803 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -8,6 +8,9 @@ #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 6, struct gup_benchmark) struct gup_benchmark { __u64 get_delta_usec; @@ -19,6 +22,44 @@ struct gup_benchmark { __u64 expansion[10];/* For future use */ }; +static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages) +{ + int i; + + switch (cmd) { + case GUP_FAST_BENCHMARK: + case GUP_LONGTERM_BENCHMARK: + case GUP_BENCHMARK: + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); + break; + + case PIN_FAST_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: + case PIN_BENCHMARK: + put_user_pages(pages, nr_pages); + break; + } +} + +static void verify_dma_pinned(int cmd, struct page **pages, + unsigned long nr_pages) +{ + int i; + + switch (cmd) { + case PIN_FAST_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: + case PIN_BENCHMARK: + for (i = 0; i < nr_pages; i++) { + if (WARN(!page_dma_pinned(pages[i]), +"pages[%d] is NOT dma-pinned\n", i)) + break; + } + break; + } +} + static int __gup_benchmark_ioctl(unsigned int cmd, struct gup_benchmark *gup) { @@ -62,6 +103,19 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages(addr, nr, gup->flags & 1, pages + i, NULL); break; + case PIN_FAST_BENCHMARK: + nr = pin_user_pages_fast(addr, nr, gup->flags & 1, +pages + i); + break; + case PIN_LONGTERM_BENCHMARK: + nr = pin_longterm_pages(addr, nr, + (gup->flags & 1), + pages + i, NULL); + break; + case PIN_BENCHMARK: + nr = pin_user_pages(addr, nr, gup->flags & 1, pages + i, + NULL); + break; default: return -1; } @@ -72,15 +126,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd, } end_time = ktime_get(); + /* Shifting the meaning of nr_pages: now it is actual number pinned: */ + nr_pages = i; + gup->get_delta_usec = ktime_us_delta(end_time, start_time); gup->size = addr - gup->addr; + /* +* Take an un-benchmark-timed moment to verify DMA pinned +* state: print a warning if any non-dma-pinned pages are found: +*/ + verify_dma_pinned(cmd, pages, nr_pages); + start_time = ktime_get(); - for (i = 0; i < nr_pages; i++) { - if (!pages[i]) - break; - put_page(pages[i]); - } + + put_back_pages(cmd, pages, nr_pages); + end_ti
[PATCH 15/19] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
1. Convert from get_user_pages(FOLL_LONGTERM) to pin_longterm_pages(). 2. As required by pin_user_pages(), release these pages via put_user_page(). In this case, do so via put_user_pages_dirty_lock(). That has the side effect of calling set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] 3. Release each page in mem->hpages[] (instead of mem->hpas[]), because that is the array that pin_longterm_pages() filled in. This is more accurate and should be a little safer from a maintenance point of view. [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de Signed-off-by: John Hubbard --- arch/powerpc/mm/book3s64/iommu_api.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c index 56cc84520577..69d79cb50d47 100644 --- a/arch/powerpc/mm/book3s64/iommu_api.c +++ b/arch/powerpc/mm/book3s64/iommu_api.c @@ -103,9 +103,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, for (entry = 0; entry < entries; entry += chunk) { unsigned long n = min(entries - entry, chunk); - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n, - FOLL_WRITE | FOLL_LONGTERM, - mem->hpages + entry, NULL); + ret = pin_longterm_pages(ua + (entry << PAGE_SHIFT), n, +FOLL_WRITE, mem->hpages + entry, NULL); if (ret == n) { pinned += n; continue; @@ -167,9 +166,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, return 0; free_exit: - /* free the reference taken */ - for (i = 0; i < pinned; i++) - put_page(mem->hpages[i]); + /* free the references taken */ + put_user_pages(mem->hpages, pinned); vfree(mem->hpas); kfree(mem); @@ -212,10 +210,9 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem) if (!page) continue; - if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY) - SetPageDirty(page); + put_user_pages_dirty_lock(&mem->hpages[i], 1, + MM_IOMMU_TABLE_GROUP_PAGE_DIRTY); - put_page(page); mem->hpas[i] = 0; } } -- 2.23.0
[PATCH 19/19] Documentation/vm: add pin_user_pages.rst
Document the new pin_user_pages() and related calls and behavior. Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in this documentation. (I've reworded it and expanded on it slightly.) Cc: Jonathan Corbet Signed-off-by: John Hubbard --- Documentation/vm/index.rst | 1 + Documentation/vm/pin_user_pages.rst | 213 2 files changed, 214 insertions(+) create mode 100644 Documentation/vm/pin_user_pages.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index e8d943b21cf9..7194efa3554a 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -44,6 +44,7 @@ descriptions of data structures and algorithms. page_migration page_frags page_owner + pin_user_pages remap_file_pages slub split_page_table_lock diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst new file mode 100644 index ..7110bca3f188 --- /dev/null +++ b/Documentation/vm/pin_user_pages.rst @@ -0,0 +1,213 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +pin_user_pages() and related calls + + +.. contents:: :local: + +Overview + + +This document describes the following functions: :: + + pin_user_pages + pin_user_pages_fast + pin_user_pages_remote + + pin_longterm_pages + pin_longterm_pages_fast + pin_longterm_pages_remote + +Basic description of FOLL_PIN += + +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has +significant interactions and interdependencies with FOLL_LONGTERM, so both are +covered here. + +Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither +FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows +the associated wrapper functions (pin_user_pages and others) to set the correct +combination of these flags, and to check for problems as well. + +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, +multiple threads and call sites are free to pin the same struct pages, via both +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the +other, not the struct page(s). + +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN +uses a different reference counting technique. + +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is, +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. + +Which flags are set by each wrapper +=== + +Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to +whatever flags the caller provides:: + + Functiongup flags (FOLL_PIN or FOLL_LONGTERM only) + -- + pin_user_pages FOLL_PIN + pin_user_pages_fast FOLL_PIN + pin_user_pages_remote FOLL_PIN + + pin_longterm_pages FOLL_PIN | FOLL_LONGTERM + pin_longterm_pages_fast FOLL_PIN | FOLL_LONGTERM + pin_longterm_pages_remote FOLL_PIN | FOLL_LONGTERM + +Tracking dma-pinned pages += + +Some of the key design constraints, and solutions, for tracking dma-pinned +pages: + +* An actual reference count, per struct page, is required. This is because + multiple processes may pin and unpin a page. + +* False positives (reporting that a page is dma-pinned, when in fact it is not) + are acceptable, but false negatives are not. + +* struct page may not be increased in size for this, and all fields are already + used. + +* Given the above, we can overload the page->_refcount field by using, sort of, + the upper bits in that field for a dma-pinned count. "Sort of", means that, + rather than dividing page->_refcount into bit fields, we simple add a medium- + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to + page->_refcount. This provides fuzzy behavior: if a page has get_page() called + on it 1024 times, then it will appear to have a single dma-pinned count. + And again, that's acceptable. + +This also leads to limitations: there are only 32-10==22 bits available for a +counter that increments 10 bits at a time. + +TODO: for 1GB and larger huge pages, this is cutting it close. That's because +when pin_user_pages() follows such pages, it increments the head page by "1" +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for +pin_user_pages()) for each tail page. So if you have a 1GB huge page: + +* There are 256K (18 bits) worth of 4 KB tail pages. +* There are 22 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, + 10 bits at a time) +* There are 22 - 18 == 4 bits available to count. Except that there aren't, + because you need to allow for a few normal get_page() calls on the head page, + as well. Fortunately, the approach of us