date:20191030

[PATCH v5 2/5] powerpc/kprobes: Mark newly allocated probes as RO

2019-10-30 Thread Russell Currey

With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one
W+X page at boot by default.  This can be tested with
CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the
kernel log during boot.

powerpc doesn't implement its own alloc() for kprobes like other
architectures do, but we couldn't immediately mark RO anyway since we do
a memcpy to the page we allocate later.  After that, nothing should be
allowed to modify the page, and write permissions are removed well
before the kprobe is armed.

Thus mark newly allocated probes as read-only once it's safe to do so.

Signed-off-by: Russell Currey 
---
 arch/powerpc/kernel/kprobes.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 2d27ec4feee4..2610496de7c7 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
@@ -131,6 +132,8 @@ int arch_prepare_kprobe(struct kprobe *p)
(unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t));
}
 
+   set_memory_ro((unsigned long)p->ainsn.insn, 1);
+
p->ainsn.boostable = 0;
return ret;
 }
-- 
2.23.0

[PATCH v5 1/5] powerpc/mm: Implement set_memory() routines

2019-10-30 Thread Russell Currey

The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
and are generally useful primitives to have.  This implementation is
designed to be completely generic across powerpc's many MMUs.

It's possible that this could be optimised to be faster for specific
MMUs, but the focus is on having a generic and safe implementation for
now.

This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered.  These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them, so that is left as a TODO for now.

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/set_memory.h | 32 +++
 arch/powerpc/mm/Makefile  |  1 +
 arch/powerpc/mm/pageattr.c| 77 +++
 4 files changed, 111 insertions(+)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16e..8f7005f0d097 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -133,6 +133,7 @@ config PPC
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
+   select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
new file mode 100644
index ..5230ddb2fefd
--- /dev/null
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SET_MEMORY_H
+#define _ASM_POWERPC_SET_MEMORY_H
+
+#define SET_MEMORY_RO  1
+#define SET_MEMORY_RW  2
+#define SET_MEMORY_NX  3
+#define SET_MEMORY_X   4
+
+int change_memory_attr(unsigned long addr, int numpages, int action);
+
+static inline int set_memory_ro(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RO);
+}
+
+static inline int set_memory_rw(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RW);
+}
+
+static inline int set_memory_nx(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_NX);
+}
+
+static inline int set_memory_x(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_X);
+}
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 5e147986400d..d0a0bcbc9289 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o
 obj-$(CONFIG_PPC_COPRO_BASE)   += copro_fault.o
 obj-$(CONFIG_PPC_PTDUMP)   += ptdump/
 obj-$(CONFIG_KASAN)+= kasan/
+obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pageattr.o
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
new file mode 100644
index ..aedd79173a44
--- /dev/null
+++ b/arch/powerpc/mm/pageattr.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * MMU-generic set_memory implementation for powerpc
+ *
+ * Copyright 2019, IBM Corporation.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+
+/*
+ * Updates the attributes of a page in three steps:
+ *
+ * 1. invalidate the page table entry
+ * 2. flush the TLB
+ * 3. install the new entry with the updated attributes
+ *
+ * This is unsafe if the caller is attempting to change the mapping of the
+ * page it is executing from, or if another CPU is concurrently using the
+ * page being altered.
+ *
+ * TODO make the implementation resistant to this.
+ */
+static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
+{
+   int action = *((int *)data);
+   pte_t pte_val;
+   int ret = 0;
+
+   spin_lock(&init_mm.page_table_lock);
+
+   // invalidate the PTE so it's safe to modify
+   pte_val = ptep_get_and_clear(&init_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+   // modify the PTE bits as desired, then apply
+   switch (action) {
+   case SET_MEMORY_RO:
+   pte_val = pte_wrprotect(pte_val);
+   break;
+   case SET_MEMORY_RW:
+   pte_val = pte_mkwrite(pte_val);
+   break;
+   case SET_MEMORY_NX:
+   pte_val = pte_exprotect(pte_val);
+   break;
+   case SET_MEMORY_X:
+   pte_val = pte_mkexec(pte_val);
+   break;
+   default:
+   WARN_ON(true);
+   ret = -EINVAL;
+   goto out;
+

[PATCH v5 3/5] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime

2019-10-30 Thread Russell Currey

Very rudimentary, just

echo 1 > [debugfs]/check_wx_pages

and check the kernel log.  Useful for testing strict module RWX.

Updated the Kconfig entry to reflect this.

Also fixed a typo.

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig.debug  |  6 --
 arch/powerpc/mm/ptdump/ptdump.c | 21 -
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index c59920920ddc..dcfe83d4c211 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -370,7 +370,7 @@ config PPC_PTDUMP
  If you are unsure, say N.
 
 config PPC_DEBUG_WX
-   bool "Warn on W+X mappings at boot"
+   bool "Warn on W+X mappings at boot & enable manual checks at runtime"
depends on PPC_PTDUMP
help
  Generate a warning if any W+X mappings are found at boot.
@@ -384,7 +384,9 @@ config PPC_DEBUG_WX
  of other unfixed kernel bugs easier.
 
  There is no runtime or memory usage effect of this option
- once the kernel has booted up - it's a one time check.
+ once the kernel has booted up, it only automatically checks once.
+
+ Enables the "check_wx_pages" debugfs entry for checking at runtime.
 
  If in doubt, say "Y".
 
diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index 2f9ddc29c535..b6cba29ae4a0 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -4,7 +4,7 @@
  *
  * This traverses the kernel pagetables and dumps the
  * information about the used sections of memory to
- * /sys/kernel/debug/kernel_pagetables.
+ * /sys/kernel/debug/kernel_page_tables.
  *
  * Derived from the arm64 implementation:
  * Copyright (c) 2014, The Linux Foundation, Laura Abbott.
@@ -409,6 +409,25 @@ void ptdump_check_wx(void)
else
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
 }
+
+static int check_wx_debugfs_set(void *data, u64 val)
+{
+   if (val != 1ULL)
+   return -EINVAL;
+
+   ptdump_check_wx();
+
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(check_wx_fops, NULL, check_wx_debugfs_set, "%llu\n");
+
+static int ptdump_check_wx_init(void)
+{
+   return debugfs_create_file("check_wx_pages", 0200, NULL,
+  NULL, &check_wx_fops) ? 0 : -ENOMEM;
+}
+device_initcall(ptdump_check_wx_init);
 #endif
 
 static int ptdump_init(void)
-- 
2.23.0

[PATCH v5 5/5] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

2019-10-30 Thread Russell Currey

skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX
enabled, and if you want memory protection for kernel text you'd want it
for modules too, so enable STRICT_MODULE_RWX there.

Signed-off-by: Russell Currey 
---
 arch/powerpc/configs/skiroot_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/skiroot_defconfig 
b/arch/powerpc/configs/skiroot_defconfig
index 1253482a67c0..719d899081b3 100644
--- a/arch/powerpc/configs/skiroot_defconfig
+++ b/arch/powerpc/configs/skiroot_defconfig
@@ -31,6 +31,7 @@ CONFIG_PERF_EVENTS=y
 CONFIG_SLAB_FREELIST_HARDENED=y
 CONFIG_JUMP_LABEL=y
 CONFIG_STRICT_KERNEL_RWX=y
+CONFIG_STRICT_MODULE_RWX=y
 CONFIG_MODULES=y
 CONFIG_MODULE_UNLOAD=y
 CONFIG_MODULE_SIG=y
-- 
2.23.0

[PATCH v5 4/5] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

2019-10-30 Thread Russell Currey

To enable strict module RWX on powerpc, set:

CONFIG_STRICT_MODULE_RWX=y

You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
security benefit.

ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
makes STRICT_MODULE_RWX *on by default* in configurations where
STRICT_KERNEL_RWX is *unavailable*.

Since this doesn't make much sense, and module RWX without kernel RWX
doesn't make much sense, having the same dependencies as kernel RWX
works around this problem.

Signed-off-by: Russell Currey 
---
This means that Daniel's test on book3e64 is pretty useless since we've
gone from being unable to turn it *off* to being unable to turn it *on*.

I think this is the right course of action for now.

 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f7005f0d097..52028e27f2d3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,6 +135,7 @@ config PPC
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
+   select ARCH_HAS_STRICT_MODULE_RWX   if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UACCESS_MCSAFE  if PPC64
-- 
2.23.0

[PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Russell Currey

v4 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
v3 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html

Changes since v4:
[1/5]: Addressed review comments from Michael Ellerman (thanks!)
[4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
   ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
   STRICT_MODULE_RWX being *on by default* in cases where
   STRICT_KERNEL_RWX is *unavailable*
[5/5]: split skiroot_defconfig changes out into its own patch

The whole Kconfig situation is really weird and confusing, I believe the
correct resolution is to change arch/Kconfig but the consequences are so
minor that I don't think it's worth it, especially given that I expect
powerpc to have mandatory strict RWX Soon(tm).

Russell Currey (5):
  powerpc/mm: Implement set_memory() routines
  powerpc/kprobes: Mark newly allocated probes as RO
  powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
  powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
  powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

 arch/powerpc/Kconfig   |  2 +
 arch/powerpc/Kconfig.debug |  6 +-
 arch/powerpc/configs/skiroot_defconfig |  1 +
 arch/powerpc/include/asm/set_memory.h  | 32 +++
 arch/powerpc/kernel/kprobes.c  |  3 +
 arch/powerpc/mm/Makefile   |  1 +
 arch/powerpc/mm/pageattr.c | 77 ++
 arch/powerpc/mm/ptdump/ptdump.c| 21 ++-
 8 files changed, 140 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

-- 
2.23.0

Re: [PATCH 6/6] s390x: Mark archrandom.h functions __must_check

2019-10-30 Thread Harald Freudenberger

On 29.10.19 14:18, Richard Henderson wrote:
> On 10/29/19 8:26 AM, Harald Freudenberger wrote:
>> Fine with me, Thanks, reviewed, build and tested.
>> You may add my reviewed-by: Harald Freudenberger 
>> However, will this go into the kernel tree via crypto or s390 subsystem ?
> That's an excellent question.
>
> As an API decision, perhaps going via crypto makes more sense,
> but none of the patches are dependent on one another, so they
> could go through separate architecture trees.
>
> It has been a long time since I have done much kernel work;
> I'm open to suggestions on the subject.
>
>
> r~
Since the change needs to be done in include/linux/random.h
and in parallel with all the arch files in arch/xxx/include/asm/archrandom.h
it should go in one shot. I'd suggest to post the patch series to linux-crypto
and let Herbert Xu handle this.

Re: [PATCH v5 1/5] powerpc/mm: Implement set_memory() routines

2019-10-30 Thread Christophe Leroy





Le 30/10/2019 à 08:31, Russell Currey a écrit :

The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
and are generally useful primitives to have.  This implementation is
designed to be completely generic across powerpc's many MMUs.

It's possible that this could be optimised to be faster for specific
MMUs, but the focus is on having a generic and safe implementation for
now.

This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered.  These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them, so that is left as a TODO for now.

Signed-off-by: Russell Currey 
---
  arch/powerpc/Kconfig  |  1 +
  arch/powerpc/include/asm/set_memory.h | 32 +++
  arch/powerpc/mm/Makefile  |  1 +
  arch/powerpc/mm/pageattr.c| 77 +++
  4 files changed, 111 insertions(+)
  create mode 100644 arch/powerpc/include/asm/set_memory.h
  create mode 100644 arch/powerpc/mm/pageattr.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16e..8f7005f0d097 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -133,6 +133,7 @@ config PPC
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
+   select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
new file mode 100644
index ..5230ddb2fefd
--- /dev/null
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SET_MEMORY_H
+#define _ASM_POWERPC_SET_MEMORY_H
+
+#define SET_MEMORY_RO  1
+#define SET_MEMORY_RW  2
+#define SET_MEMORY_NX  3
+#define SET_MEMORY_X   4
+
+int change_memory_attr(unsigned long addr, int numpages, int action);
+
+static inline int set_memory_ro(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RO);
+}
+
+static inline int set_memory_rw(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RW);
+}
+
+static inline int set_memory_nx(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_NX);
+}
+
+static inline int set_memory_x(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_X);
+}
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 5e147986400d..d0a0bcbc9289 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o
  obj-$(CONFIG_PPC_COPRO_BASE)  += copro_fault.o
  obj-$(CONFIG_PPC_PTDUMP)  += ptdump/
  obj-$(CONFIG_KASAN)   += kasan/
+obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pageattr.o
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
new file mode 100644
index ..aedd79173a44
--- /dev/null
+++ b/arch/powerpc/mm/pageattr.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * MMU-generic set_memory implementation for powerpc
+ *
+ * Copyright 2019, IBM Corporation.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+
+/*
+ * Updates the attributes of a page in three steps:
+ *
+ * 1. invalidate the page table entry
+ * 2. flush the TLB
+ * 3. install the new entry with the updated attributes
+ *
+ * This is unsafe if the caller is attempting to change the mapping of the
+ * page it is executing from, or if another CPU is concurrently using the
+ * page being altered.
+ *
+ * TODO make the implementation resistant to this.
+ */
+static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)


I don't like too much the way you are making this function more complex 
and less readable with local var, goto, etc ...


You could just keep the v4 version of change_page_attr(), rename it 
__change_page_attr(), then add:


static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
{
int ret;

spin_lock(&init_mm.page_table_lock);
ret = __change_page_attr(ptep, addr, data);
spin_unlock(&init_mm.page_table_lock);
return ret;
}

Christophe


+{
+   int action = *((int *)data);
+   pte_t pte_val;
+   int ret = 0;
+
+   spin_lock(&init_mm.page_table_lock);
+
+   // invalidate the PTE so it's safe to modify
+   pte_val = ptep_get_and_clear(&init_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+   // modify the PTE

Re: [PATCH 2/2] ASoC: fsl_asrc: Add support for imx8qm

2019-10-30 Thread Nicolin Chen

On Wed, Oct 30, 2019 at 03:20:35AM +, S.j. Wang wrote:
> Hi
> 
> > 
> > On Tue, Oct 29, 2019 at 05:17:09PM +0800, Shengjiu Wang wrote:
> > > There are two asrc module in imx8qm, each module has different clock
> > > configuration, and the DMA type is EDMA.
> > >
> > > So in this patch, we define the new clocks, refine the clock map, and
> > > include struct fsl_asrc_soc_data for different soc usage.
> > >
> > > The EDMA channel is fixed with each dma request, one dma request
> > > corresponding to one dma channel. So we need to request dma channel
> > > with dma request of asrc module.
> > >
> > > Signed-off-by: Shengjiu Wang 
> > > ---
> > >  sound/soc/fsl/fsl_asrc.c | 91 +---
> > >  sound/soc/fsl/fsl_asrc.h | 65 +-
> > >  sound/soc/fsl/fsl_asrc_dma.c | 39 
> > >  3 files changed, 167 insertions(+), 28 deletions(-)
> > 
> > > diff --git a/sound/soc/fsl/fsl_asrc_dma.c
> > > b/sound/soc/fsl/fsl_asrc_dma.c index d6146de9acd2..dbb07a486504
> > 100644
> > > --- a/sound/soc/fsl/fsl_asrc_dma.c
> > > +++ b/sound/soc/fsl/fsl_asrc_dma.c
> > > @@ -199,19 +199,40 @@ static int fsl_asrc_dma_hw_params(struct
> > > snd_soc_component *component,
> > >
> > >   /* Get DMA request of Back-End */
> > >   tmp_chan = dma_request_slave_channel(dev_be, tx ? "tx" : "rx");
> > > - tmp_data = tmp_chan->private;
> > > - pair->dma_data.dma_request = tmp_data->dma_request;
> > > - dma_release_channel(tmp_chan);
> > > + /* tmp_chan may be NULL for it is already allocated by Back-End */
> > > + if (tmp_chan) {
> > > + tmp_data = tmp_chan->private;
> > > + if (tmp_data)
> > > + pair->dma_data.dma_request =
> > > + tmp_data->dma_request;
> > 
> > If this patch is supposed to add a !tmp_chan case for EDMA, we probably
> > shouldn't mute the !tmp_data case because dma_request will be NULL,
> > although the code previously didn't have a check either. I mean we might
> > need to error-out for !tmp_chan. Or...
> > is this intentional?
> > 
> 
> Yes, intentional. May be we can change to 
> 
> if (!asrc_priv->soc->use_edma) {
> /* Get DMA request of Back-End */
> tmp_chan = dma_request_slave_channel(dev_be, tx ? "tx" : 
> "rx");
> tmp_data = tmp_chan->private;
> pair->dma_data.dma_request = tmp_data->dma_request;
> dma_release_channel(tmp_chan);
> 
> /* Get DMA request of Front-End */
> tmp_chan = fsl_asrc_get_dma_channel(pair, dir);
> tmp_data = tmp_chan->private;
> pair->dma_data.dma_request2 = tmp_data->dma_request;
> pair->dma_data.peripheral_type = tmp_data->peripheral_type;
> pair->dma_data.priority = tmp_data->priority;
> dma_release_channel(tmp_chan);
> }

Oh...now I understand..yea, I think this would be better.

Would you please change it in v2?

I am fine with other places, so may add:

Acked-by: Nicolin Chen 

Thanks

Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Christophe Leroy





Le 30/10/2019 à 08:31, Russell Currey a écrit :

v4 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
v3 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html

Changes since v4:
[1/5]: Addressed review comments from Michael Ellerman (thanks!)
[4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
   ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
   STRICT_MODULE_RWX being *on by default* in cases where
   STRICT_KERNEL_RWX is *unavailable*
[5/5]: split skiroot_defconfig changes out into its own patch

The whole Kconfig situation is really weird and confusing, I believe the
correct resolution is to change arch/Kconfig but the consequences are so
minor that I don't think it's worth it, especially given that I expect
powerpc to have mandatory strict RWX Soon(tm).


I'm not such strict RWX can be made mandatory due to the impact it has 
on some subarches:
- On the 8xx, unless all areas are 8Mbytes aligned, there is a 
significant overhead on TLB misses. And Aligning everthing to 8M is a 
waste of RAM which is not acceptable on systems having very few RAM.
- On hash book3s32, we are able to map the kernel BATs. With a few 
alignment constraints, we are able to provide STRICT_KERNEL_RWX. But we 
are unable to provide exec protection on page granularity. Only on 
256Mbytes segments. So for modules, we have to have the vmspace X. It is 
also not possible to have a kernel area RO. Only user areas can be made RO.


Christophe



Russell Currey (5):
   powerpc/mm: Implement set_memory() routines
   powerpc/kprobes: Mark newly allocated probes as RO
   powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
   powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
   powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

  arch/powerpc/Kconfig   |  2 +
  arch/powerpc/Kconfig.debug |  6 +-
  arch/powerpc/configs/skiroot_defconfig |  1 +
  arch/powerpc/include/asm/set_memory.h  | 32 +++
  arch/powerpc/kernel/kprobes.c  |  3 +
  arch/powerpc/mm/Makefile   |  1 +
  arch/powerpc/mm/pageattr.c | 77 ++
  arch/powerpc/mm/ptdump/ptdump.c| 21 ++-
  8 files changed, 140 insertions(+), 3 deletions(-)
  create mode 100644 arch/powerpc/include/asm/set_memory.h
  create mode 100644 arch/powerpc/mm/pageattr.c

[PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Yunsheng Lin

When passing the return value of dev_to_node() to cpumask_of_node()
without checking if the device's node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id as
NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Also there is a debugging version of node_to_cpumask_map() for x86 and
arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().

[1] https://lkml.org/lkml/2019/9/11/66
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Acked-by: Paul Burton  # MIPS bits
---
V7: replace -1 with NUMA_NO_NODE for mips ip27 as suggested by Paul.
V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
little controversial, may need deeper investigation, and rebased
on the latest linux-next.
V5: Drop unsigned "fix" change for x86/arm64, and change comment log
according to Michal's comment.
V4: Have all these changes in a single patch.
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
V2: make the node id checking change to other arches too.
---
 arch/arm64/include/asm/numa.h| 3 +++
 arch/arm64/mm/numa.c | 3 +++
 arch/mips/include/asm/mach-ip27/topology.h   | 2 +-
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/s390/include/asm/topology.h | 3 +++
 arch/x86/include/asm/topology.h  | 3 +++
 arch/x86/mm/numa.c   | 3 +++
 7 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4decf16..5ae7eea 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
if (WARN_ON(node >= nr_node_ids))
return cpu_none_mask;
 
diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
b/arch/mips/include/asm/mach-ip27/topology.h
index 965f079..db293cf 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -15,7 +15,7 @@ struct cpuinfo_ip27 {
 extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
 
 #define cpu_to_node(cpu)   (sn_cpu_info[(cpu)].p_nodeid)
-#define cpumask_of_node(node)  ((node) == -1 ? \
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
 cpu_all_mask : \
 &hub_data(node)->h_cpus)
 struct pci_bus;
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..e78daa6 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -5,7 +5,9 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   (cpu_logical_map(cpu) >> 2)
-#define cpumask_of_node(node)  (&__node_data[(node)]->cpumask)
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+cpu_online_mask :  \
+&__node_data[(node)]->cpumask)
 
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index cca406f..1bd2e73 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -78,6 +78,9 @@ static inline int cpu_to_node(int cpu)
 #define cpumask_of_node cpumask_of_node
 static inline const struct cpumask

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Peter Zijlstra

On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> 
> [1] https://lkml.org/lkml/2019/9/11/66
> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> Acked-by: Michal Hocko 
> Acked-by: Paul Burton  # MIPS bits

Still:

Nacked-by: Peter Zijlstra (Intel)

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Michal Hocko

On Wed 30-10-19 17:34:28, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> >From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> 
> [1] https://lkml.org/lkml/2019/9/11/66

Please do not use lkml.org links. They tend to break quite often.
Use http://lkml.kernel.org/r/$msg_id or lore.kernel.org

> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> Acked-by: Michal Hocko 
> Acked-by: Paul Burton  # MIPS bits
> ---
> V7: replace -1 with NUMA_NO_NODE for mips ip27 as suggested by Paul.
> V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
> little controversial, may need deeper investigation, and rebased
> on the latest linux-next.
> V5: Drop unsigned "fix" change for x86/arm64, and change comment log
> according to Michal's comment.
> V4: Have all these changes in a single patch.
> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
> for NUMA_NO_NODE case, and change the commit log to better justify
> the change.
> V2: make the node id checking change to other arches too.
> ---
>  arch/arm64/include/asm/numa.h| 3 +++
>  arch/arm64/mm/numa.c | 3 +++
>  arch/mips/include/asm/mach-ip27/topology.h   | 2 +-
>  arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
>  arch/s390/include/asm/topology.h | 3 +++
>  arch/x86/include/asm/topology.h  | 3 +++
>  arch/x86/mm/numa.c   | 3 +++
>  7 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> index 626ad01..c8a4b31 100644
> --- a/arch/arm64/include/asm/numa.h
> +++ b/arch/arm64/include/asm/numa.h
> @@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>  static inline const struct cpumask *cpumask_of_node(int node)
>  {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;
> +
>   return node_to_cpumask_map[node];
>  }
>  #endif
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index 4decf16..5ae7eea 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;
> +
>   if (WARN_ON(node >= nr_node_ids))
>   return cpu_none_mask;
>  
> diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
> b/arch/mips/include/asm/mach-ip27/topology.h
> index 965f079..db293cf 100644
> --- a/arch/mips/include/asm/mach-ip27/topology.h
> +++ b/arch/mips/include/asm/mach-ip27/topology.h
> @@ -15,7 +15,7 @@ struct cpuinfo_ip27 {
>  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
>  
>  #define cpu_to_node(cpu) (sn_cpu_info[(cpu)].p_nodeid)
> -#define cpumask_of_node(node)((node) == -1 ? 
> \
> +#define cpumask_of_node(node)((node) == NUMA_NO_NODE ?   
> \
>cpu_all_mask : \
>&hub_data(node)->h_cpus)
>  struct pci_bus;
> diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
> b/arch/mips/include/asm/mach-loongson64/topology.h
> index 7ff819a..e78daa6 100644
> --- a/arch/mips/include/asm/mach-loongson64/topology.h
> +++ b/arch/mips/include/asm/mach-loongson64/topology.h
> @@ -5,7 +5,9 @@
>  #ifdef CONFIG_NUMA
>  
>  #define cpu_to_node(cpu) (cpu_logical_map(cpu) >> 2)
> -#define cpumask_of_node(node)(&__node_data[(node)]->cpumask)
> +#define cpumask_of_node(node)((node) == NUMA_NO_NODE ?   
> \
> +  cpu_online_mask :  \
> +  &__node_data[(node)]->cpumask)
>

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Michal Hocko

On Wed 30-10-19 11:14:49, Peter Zijlstra wrote:
> On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote:
> > When passing the return value of dev_to_node() to cpumask_of_node()
> > without checking if the device's node id is NUMA_NO_NODE, there is
> > global-out-of-bounds detected by KASAN.
> > 
> > From the discussion [1], NUMA_NO_NODE really means no node affinity,
> > which also means all cpus should be usable. So the cpumask_of_node()
> > should always return all cpus online when user passes the node id as
> > NUMA_NO_NODE, just like similar semantic that page allocator handles
> > NUMA_NO_NODE.
> > 
> > But we cannot really copy the page allocator logic. Simply because the
> > page allocator doesn't enforce the near node affinity. It just picks it
> > up as a preferred node but then it is free to fallback to any other numa
> > node. This is not the case here and node_to_cpumask_map will only restrict
> > to the particular node's cpus which would have really non deterministic
> > behavior depending on where the code is executed. So in fact we really
> > want to return cpu_online_mask for NUMA_NO_NODE.
> > 
> > Also there is a debugging version of node_to_cpumask_map() for x86 and
> > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> > 
> > [1] https://lkml.org/lkml/2019/9/11/66
> > Signed-off-by: Yunsheng Lin 
> > Suggested-by: Michal Hocko 
> > Acked-by: Michal Hocko 
> > Acked-by: Paul Burton  # MIPS bits
> 
> Still:
> 
> Nacked-by: Peter Zijlstra (Intel) 

Do you have any other proposal that doesn't make any wild guesses about
which node to use instead of the undefined one?

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v2 07/23] soc: fsl: qe: merge qe_ic.h into qe_ic.c

2019-10-30 Thread Christophe Leroy





Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :

The local qe_ic.h header is only used by qe_ic.c, so merge its
contents into the .c file. This is preparation for moving the driver
to drivers/irqchip/. It also avoids confusion between this header and
the one at include/soc/fsl/qe/qe_ic.h, which is included from a number
of places (qe_ic.c among others).

Signed-off-by: Rasmus Villemoes 
---
  drivers/soc/fsl/qe/qe_ic.c |  91 ++-
  drivers/soc/fsl/qe/qe_ic.h | 108 -
  2 files changed, 90 insertions(+), 109 deletions(-)
  delete mode 100644 drivers/soc/fsl/qe/qe_ic.h

diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c
index d420492b4c23..7b1870d2866a 100644
--- a/drivers/soc/fsl/qe/qe_ic.c
+++ b/drivers/soc/fsl/qe/qe_ic.c
@@ -26,7 +26,96 @@
  #include 
  #include 
  
-#include "qe_ic.h"

+#define NR_QE_IC_INTS  64
+
+/* QE IC registers offset */
+#define QEIC_CICR  0x00
+#define QEIC_CIVEC 0x04
+#define QEIC_CRIPNR0x08
+#define QEIC_CIPNR 0x0c
+#define QEIC_CIPXCC0x10
+#define QEIC_CIPYCC0x14
+#define QEIC_CIPWCC0x18
+#define QEIC_CIPZCC0x1c
+#define QEIC_CIMR  0x20
+#define QEIC_CRIMR 0x24
+#define QEIC_CICNR 0x28
+#define QEIC_CIPRTA0x30
+#define QEIC_CIPRTB0x34
+#define QEIC_CRICR 0x3c
+#define QEIC_CHIVEC0x60
+
+/* Interrupt priority registers */
+#define CIPCC_SHIFT_PRI0   29
+#define CIPCC_SHIFT_PRI1   26
+#define CIPCC_SHIFT_PRI2   23
+#define CIPCC_SHIFT_PRI3   20
+#define CIPCC_SHIFT_PRI4   13
+#define CIPCC_SHIFT_PRI5   10
+#define CIPCC_SHIFT_PRI6   7
+#define CIPCC_SHIFT_PRI7   4


I think you should drop all unused consts and only keep the ones that 
are used.



+
+/* CICR priority modes */
+#define CICR_GWCC  0x0004
+#define CICR_GXCC  0x0002
+#define CICR_GYCC  0x0001
+#define CICR_GZCC  0x0008
+#define CICR_GRTA  0x0020
+#define CICR_GRTB  0x0040
+#define CICR_HPIT_SHIFT8
+#define CICR_HPIT_MASK 0x0300
+#define CICR_HP_SHIFT  24
+#define CICR_HP_MASK   0x3f00
+
+/* CICNR */
+#define CICNR_WCC1T_SHIFT  20
+#define CICNR_ZCC1T_SHIFT  28
+#define CICNR_YCC1T_SHIFT  12
+#define CICNR_XCC1T_SHIFT  4


Same here


+
+/* CRICR */
+#define CRICR_RTA1T_SHIFT  20
+#define CRICR_RTB1T_SHIFT  28


Same



+
+/* Signal indicator */
+#define SIGNAL_MASK3
+#define SIGNAL_HIGH2
+#define SIGNAL_LOW 0


Only SIGNAL_HIGH seems to be used.

Christophe


+
+struct qe_ic {
+   /* Control registers offset */
+   u32 __iomem *regs;
+
+   /* The remapper for this QEIC */
+   struct irq_domain *irqhost;
+
+   /* The "linux" controller struct */
+   struct irq_chip hc_irq;
+
+   /* VIRQ numbers of QE high/low irqs */
+   unsigned int virq_high;
+   unsigned int virq_low;
+};
+
+/*
+ * QE interrupt controller internal structure
+ */
+struct qe_ic_info {
+   /* Location of this source at the QIMR register */
+   u32 mask;
+
+   /* Mask register offset */
+   u32 mask_reg;
+
+   /*
+* For grouped interrupts sources - the interrupt code as
+* appears at the group priority register
+*/
+   u8  pri_code;
+
+   /* Group priority register offset */
+   u32 pri_reg;
+};
  
  static DEFINE_RAW_SPINLOCK(qe_ic_lock);
  
diff --git a/drivers/soc/fsl/qe/qe_ic.h b/drivers/soc/fsl/qe/qe_ic.h

deleted file mode 100644
index 29b4d768e4a8..
--- a/drivers/soc/fsl/qe/qe_ic.h
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/*
- * drivers/soc/fsl/qe/qe_ic.h
- *
- * QUICC ENGINE Interrupt Controller Header
- *
- * Copyright (C) 2006 Freescale Semiconductor, Inc. All rights reserved.
- *
- * Author: Li Yang 
- * Based on code from Shlomi Gridish 
- */
-#ifndef _POWERPC_SYSDEV_QE_IC_H
-#define _POWERPC_SYSDEV_QE_IC_H
-
-#include 
-
-#define NR_QE_IC_INTS  64
-
-/* QE IC registers offset */
-#define QEIC_CICR  0x00
-#define QEIC_CIVEC 0x04
-#define QEIC_CRIPNR0x08
-#define QEIC_CIPNR 0x0c
-#define QEIC_CIPXCC0x10
-#define QEIC_CIPYCC0x14
-#define QEIC_CIPWCC0x18
-#define QEIC_CIPZCC0x1c
-#define QEIC_CIMR  0x20
-#define QEIC_CRIMR 0x24
-#define QEIC_CICNR 0x28
-#define QEIC_CIPRTA0x30
-#define QEIC_CIPRTB0x34
-#define QEIC_CRICR 0x3c
-#define QEIC_CHIVEC0x60
-
-/* Interrupt priority registers */
-#define CIPCC_SHIFT_PRI0   29
-#define CIPCC_SHIFT_PRI1   26
-#define CIPCC_SHIFT_PRI2   23
-#define

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Peter Zijlstra

On Wed, Oct 30, 2019 at 11:22:29AM +0100, Michal Hocko wrote:
> On Wed 30-10-19 11:14:49, Peter Zijlstra wrote:
> > On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote:
> > > When passing the return value of dev_to_node() to cpumask_of_node()
> > > without checking if the device's node id is NUMA_NO_NODE, there is
> > > global-out-of-bounds detected by KASAN.
> > > 
> > > From the discussion [1], NUMA_NO_NODE really means no node affinity,
> > > which also means all cpus should be usable. So the cpumask_of_node()
> > > should always return all cpus online when user passes the node id as
> > > NUMA_NO_NODE, just like similar semantic that page allocator handles
> > > NUMA_NO_NODE.
> > > 
> > > But we cannot really copy the page allocator logic. Simply because the
> > > page allocator doesn't enforce the near node affinity. It just picks it
> > > up as a preferred node but then it is free to fallback to any other numa
> > > node. This is not the case here and node_to_cpumask_map will only restrict
> > > to the particular node's cpus which would have really non deterministic
> > > behavior depending on where the code is executed. So in fact we really
> > > want to return cpu_online_mask for NUMA_NO_NODE.
> > > 
> > > Also there is a debugging version of node_to_cpumask_map() for x86 and
> > > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> > > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> > > 
> > > [1] https://lkml.org/lkml/2019/9/11/66
> > > Signed-off-by: Yunsheng Lin 
> > > Suggested-by: Michal Hocko 
> > > Acked-by: Michal Hocko 
> > > Acked-by: Paul Burton  # MIPS bits
> > 
> > Still:
> > 
> > Nacked-by: Peter Zijlstra (Intel) 
> 
> Do you have any other proposal that doesn't make any wild guesses about
> which node to use instead of the undefined one?

It only makes 'wild' guesses when the BIOS is shit and it complains
about that.

Or do you like you BIOS broken?

Re: [PATCH v3] drm/radeon: Fix EEH during kexec

2019-10-30 Thread Michael Ellerman

Hi Kyle,

KyleMahlkuch  writes:
> From: Kyle Mahlkuch 
>
> During kexec some adapters hit an EEH since they are not properly
> shut down in the radeon_pci_shutdown() function. Adding
> radeon_suspend_kms() fixes this issue.
> Enabled only on PPC because this patch causes issues on some other
> boards.

Which adapters hit the issues?

And do we know why they're not shut down correctly in
radeon_pci_shutdown()? That seems like the root cause no?


> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c 
> b/drivers/gpu/drm/radeon/radeon_drv.c
> index 9e55076..4528f4d 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -379,11 +379,25 @@ static int radeon_pci_probe(struct pci_dev *pdev,
>  static void
>  radeon_pci_shutdown(struct pci_dev *pdev)
>  {
> +#ifdef CONFIG_PPC64
> + struct drm_device *ddev = pci_get_drvdata(pdev);
> +#endif

This local serves no real purpose and could be avoided, which would also
avoid this ifdef.

>   /* if we are running in a VM, make sure the device
>* torn down properly on reboot/shutdown
>*/
>   if (radeon_device_is_virtual())
>   radeon_pci_remove(pdev);
> +
> +#ifdef CONFIG_PPC64
> + /* Some adapters need to be suspended before a

AFAIK drm uses normal kernel comment style, so this should be:

/*
 * Some adapters need to be suspended before a
> +  * shutdown occurs in order to prevent an error
> +  * during kexec.
> +  * Make this power specific becauase it breaks
> +  * some non-power boards.
> +  */
> + radeon_suspend_kms(ddev, true, true, false);

ie, instead do:

radeon_suspend_kms(pci_get_drvdata(pdev), true, true, false);

> +#endif
>  }
>  
>  static int radeon_pmops_suspend(struct device *dev)
> -- 
> 1.8.3.1

cheers

Re: [PATCH v2 09/23] soc: fsl: qe: move qe_ic_cascade_* functions to qe_ic.c

2019-10-30 Thread Christophe Leroy





Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :

These functions are only ever called through a function pointer, and
therefore it makes no sense for them to be "static inline" - gcc has
no choice but to emit a copy in each translation unit that takes the
address of one of these (currently various platform code under
arch/powerpc/). So move them into qe_ic.c and leave ordinary extern
declarations in the header file.


What is the point in moving fonctions that you will drop in the next 
patch (qe_ic_cascade_low_ipic() and qe_ic_cascade_high_ipic())

Only move the ones that will remain.

Christophe




Signed-off-by: Rasmus Villemoes 
---
  drivers/soc/fsl/qe/qe_ic.c | 58 +++
  include/soc/fsl/qe/qe_ic.h | 62 +++---
  2 files changed, 63 insertions(+), 57 deletions(-)

diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c
index 7b1870d2866a..a847b2672e90 100644
--- a/drivers/soc/fsl/qe/qe_ic.c
+++ b/drivers/soc/fsl/qe/qe_ic.c
@@ -402,6 +402,64 @@ unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic)
return irq_linear_revmap(qe_ic->irqhost, irq);
  }
  
+void qe_ic_cascade_low_ipic(struct irq_desc *desc)

+{
+   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
+   unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic);
+
+   if (cascade_irq != NO_IRQ)
+   generic_handle_irq(cascade_irq);
+}
+
+void qe_ic_cascade_high_ipic(struct irq_desc *desc)
+{
+   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
+   unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic);
+
+   if (cascade_irq != NO_IRQ)
+   generic_handle_irq(cascade_irq);
+}
+
+void qe_ic_cascade_low_mpic(struct irq_desc *desc)
+{
+   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
+   unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic);
+   struct irq_chip *chip = irq_desc_get_chip(desc);
+
+   if (cascade_irq != NO_IRQ)
+   generic_handle_irq(cascade_irq);
+
+   chip->irq_eoi(&desc->irq_data);
+}
+
+void qe_ic_cascade_high_mpic(struct irq_desc *desc)
+{
+   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
+   unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic);
+   struct irq_chip *chip = irq_desc_get_chip(desc);
+
+   if (cascade_irq != NO_IRQ)
+   generic_handle_irq(cascade_irq);
+
+   chip->irq_eoi(&desc->irq_data);
+}
+
+void qe_ic_cascade_muxed_mpic(struct irq_desc *desc)
+{
+   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
+   unsigned int cascade_irq;
+   struct irq_chip *chip = irq_desc_get_chip(desc);
+
+   cascade_irq = qe_ic_get_high_irq(qe_ic);
+   if (cascade_irq == NO_IRQ)
+   cascade_irq = qe_ic_get_low_irq(qe_ic);
+
+   if (cascade_irq != NO_IRQ)
+   generic_handle_irq(cascade_irq);
+
+   chip->irq_eoi(&desc->irq_data);
+}
+
  void __init qe_ic_init(struct device_node *node, unsigned int flags,
   void (*low_handler)(struct irq_desc *desc),
   void (*high_handler)(struct irq_desc *desc))
diff --git a/include/soc/fsl/qe/qe_ic.h b/include/soc/fsl/qe/qe_ic.h
index 714a9b890d8d..f3492eb13052 100644
--- a/include/soc/fsl/qe/qe_ic.h
+++ b/include/soc/fsl/qe/qe_ic.h
@@ -74,62 +74,10 @@ void qe_ic_set_highest_priority(unsigned int virq, int 
high);
  int qe_ic_set_priority(unsigned int virq, unsigned int priority);
  int qe_ic_set_high_priority(unsigned int virq, unsigned int priority, int 
high);
  
-static inline void qe_ic_cascade_low_ipic(struct irq_desc *desc)

-{
-   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
-   unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic);
-
-   if (cascade_irq != NO_IRQ)
-   generic_handle_irq(cascade_irq);
-}
-
-static inline void qe_ic_cascade_high_ipic(struct irq_desc *desc)
-{
-   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
-   unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic);
-
-   if (cascade_irq != NO_IRQ)
-   generic_handle_irq(cascade_irq);
-}
-
-static inline void qe_ic_cascade_low_mpic(struct irq_desc *desc)
-{
-   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
-   unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic);
-   struct irq_chip *chip = irq_desc_get_chip(desc);
-
-   if (cascade_irq != NO_IRQ)
-   generic_handle_irq(cascade_irq);
-
-   chip->irq_eoi(&desc->irq_data);
-}
-
-static inline void qe_ic_cascade_high_mpic(struct irq_desc *desc)
-{
-   struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
-   unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic);
-   struct irq_chip *chip = irq_desc_get_chip(desc);
-
-   if (cascade_irq != NO_IRQ)
-   generic_handle_irq(cascade_irq);
-
-   chip->irq_eoi(&desc->irq_data);
-}
-
-static inline void qe_ic_cascade_muxed_mpic(struct irq_desc *desc)
-{
-   struct qe_ic *qe_ic = irq_desc_ge

Re: [PATCH v2 11/23] soc: fsl: qe: rename qe_ic_cascade_low_mpic -> qe_ic_cascade_low

2019-10-30 Thread Christophe Leroy





Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :

The qe_ic_cascade_{low,high}_mpic functions are now used as handlers
both when the interrupt parent is mpic as well as ipic, so remove the
_mpic suffix.


Here you are modifying code that you drop in patch 14. That's pointless. 
You should perform name change after patch 14.


Christophe



Signed-off-by: Rasmus Villemoes 
---
  arch/powerpc/platforms/83xx/misc.c| 2 +-
  arch/powerpc/platforms/85xx/corenet_generic.c | 4 ++--
  arch/powerpc/platforms/85xx/mpc85xx_mds.c | 4 ++--
  arch/powerpc/platforms/85xx/mpc85xx_rdb.c | 4 ++--
  arch/powerpc/platforms/85xx/twr_p102x.c   | 4 ++--
  drivers/soc/fsl/qe/qe_ic.c| 4 ++--
  include/soc/fsl/qe/qe_ic.h| 4 ++--
  7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/misc.c 
b/arch/powerpc/platforms/83xx/misc.c
index 779791c0570f..835d082218ae 100644
--- a/arch/powerpc/platforms/83xx/misc.c
+++ b/arch/powerpc/platforms/83xx/misc.c
@@ -100,7 +100,7 @@ void __init mpc83xx_qe_init_IRQ(void)
if (!np)
return;
}
-   qe_ic_init(np, 0, qe_ic_cascade_low_mpic, qe_ic_cascade_high_mpic);
+   qe_ic_init(np, 0, qe_ic_cascade_low, qe_ic_cascade_high);
of_node_put(np);
  }
  
diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c b/arch/powerpc/platforms/85xx/corenet_generic.c

index 7ee2c6628f64..2ed9e84ca03a 100644
--- a/arch/powerpc/platforms/85xx/corenet_generic.c
+++ b/arch/powerpc/platforms/85xx/corenet_generic.c
@@ -50,8 +50,8 @@ void __init corenet_gen_pic_init(void)
  
  	np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic");

if (np) {
-   qe_ic_init(np, 0, qe_ic_cascade_low_mpic,
-   qe_ic_cascade_high_mpic);
+   qe_ic_init(np, 0, qe_ic_cascade_low,
+   qe_ic_cascade_high);
of_node_put(np);
}
  }
diff --git a/arch/powerpc/platforms/85xx/mpc85xx_mds.c 
b/arch/powerpc/platforms/85xx/mpc85xx_mds.c
index 5ca254256c47..24211a1787b2 100644
--- a/arch/powerpc/platforms/85xx/mpc85xx_mds.c
+++ b/arch/powerpc/platforms/85xx/mpc85xx_mds.c
@@ -288,8 +288,8 @@ static void __init mpc85xx_mds_qeic_init(void)
}
  
  	if (machine_is(p1021_mds))

-   qe_ic_init(np, 0, qe_ic_cascade_low_mpic,
-   qe_ic_cascade_high_mpic);
+   qe_ic_init(np, 0, qe_ic_cascade_low,
+   qe_ic_cascade_high);
else
qe_ic_init(np, 0, qe_ic_cascade_muxed_mpic, NULL);
of_node_put(np);
diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c 
b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
index d3c540ee558f..093867879081 100644
--- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
+++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
@@ -66,8 +66,8 @@ void __init mpc85xx_rdb_pic_init(void)
  #ifdef CONFIG_QUICC_ENGINE
np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic");
if (np) {
-   qe_ic_init(np, 0, qe_ic_cascade_low_mpic,
-   qe_ic_cascade_high_mpic);
+   qe_ic_init(np, 0, qe_ic_cascade_low,
+   qe_ic_cascade_high);
of_node_put(np);
  
  	} else

diff --git a/arch/powerpc/platforms/85xx/twr_p102x.c 
b/arch/powerpc/platforms/85xx/twr_p102x.c
index 720b0c0f03ba..2e0fb23854c0 100644
--- a/arch/powerpc/platforms/85xx/twr_p102x.c
+++ b/arch/powerpc/platforms/85xx/twr_p102x.c
@@ -45,8 +45,8 @@ static void __init twr_p1025_pic_init(void)
  #ifdef CONFIG_QUICC_ENGINE
np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic");
if (np) {
-   qe_ic_init(np, 0, qe_ic_cascade_low_mpic,
-   qe_ic_cascade_high_mpic);
+   qe_ic_init(np, 0, qe_ic_cascade_low,
+   qe_ic_cascade_high);
of_node_put(np);
} else
pr_err("Could not find qe-ic node\n");
diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c
index 0ff802816c0c..f3659c312e13 100644
--- a/drivers/soc/fsl/qe/qe_ic.c
+++ b/drivers/soc/fsl/qe/qe_ic.c
@@ -402,7 +402,7 @@ unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic)
return irq_linear_revmap(qe_ic->irqhost, irq);
  }
  
-void qe_ic_cascade_low_mpic(struct irq_desc *desc)

+void qe_ic_cascade_low(struct irq_desc *desc)
  {
struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic);
@@ -415,7 +415,7 @@ void qe_ic_cascade_low_mpic(struct irq_desc *desc)
chip->irq_eoi(&desc->irq_data);
  }
  
-void qe_ic_cascade_high_mpic(struct irq_desc *desc)

+void qe_ic_cascade_high(struct irq_desc *desc)
  {
struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic);
diff --git a/include/soc/fsl/qe/q

Re: [PATCH v2 17/23] soc: fsl: qe: make qe_ic_cascade_* static

2019-10-30 Thread Christophe Leroy





Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :

Now that the references from arch/powerpc/ are gone, these are only
referenced from inside qe_ic.c, so make them static.


Why do that in two steps ?
I think patch 9 could remain until here, and then you could squash patch 
9 and patch 17 together here.


Christophe



Signed-off-by: Rasmus Villemoes 
---
  drivers/soc/fsl/qe/qe_ic.c | 6 +++---
  include/soc/fsl/qe/qe_ic.h | 4 
  2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c
index 545eb67094d1..e20f1205c0df 100644
--- a/drivers/soc/fsl/qe/qe_ic.c
+++ b/drivers/soc/fsl/qe/qe_ic.c
@@ -402,7 +402,7 @@ unsigned int qe_ic_get_high_irq(struct qe_ic *qe_ic)
return irq_linear_revmap(qe_ic->irqhost, irq);
  }
  
-void qe_ic_cascade_low(struct irq_desc *desc)

+static void qe_ic_cascade_low(struct irq_desc *desc)
  {
struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
unsigned int cascade_irq = qe_ic_get_low_irq(qe_ic);
@@ -415,7 +415,7 @@ void qe_ic_cascade_low(struct irq_desc *desc)
chip->irq_eoi(&desc->irq_data);
  }
  
-void qe_ic_cascade_high(struct irq_desc *desc)

+static void qe_ic_cascade_high(struct irq_desc *desc)
  {
struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
unsigned int cascade_irq = qe_ic_get_high_irq(qe_ic);
@@ -428,7 +428,7 @@ void qe_ic_cascade_high(struct irq_desc *desc)
chip->irq_eoi(&desc->irq_data);
  }
  
-void qe_ic_cascade_muxed_mpic(struct irq_desc *desc)

+static void qe_ic_cascade_muxed_mpic(struct irq_desc *desc)
  {
struct qe_ic *qe_ic = irq_desc_get_handler_data(desc);
unsigned int cascade_irq;
diff --git a/include/soc/fsl/qe/qe_ic.h b/include/soc/fsl/qe/qe_ic.h
index 8ec21a3bd859..43e4ce95c6a0 100644
--- a/include/soc/fsl/qe/qe_ic.h
+++ b/include/soc/fsl/qe/qe_ic.h
@@ -67,8 +67,4 @@ void qe_ic_set_highest_priority(unsigned int virq, int high);
  int qe_ic_set_priority(unsigned int virq, unsigned int priority);
  int qe_ic_set_high_priority(unsigned int virq, unsigned int priority, int 
high);
  
-void qe_ic_cascade_low(struct irq_desc *desc);

-void qe_ic_cascade_high(struct irq_desc *desc);
-void qe_ic_cascade_muxed_mpic(struct irq_desc *desc);
-
  #endif /* _ASM_POWERPC_QE_IC_H */

Re: [PATCH v2 19/23] net: ethernet: freescale: make UCC_GETH explicitly depend on PPC32

2019-10-30 Thread Christophe Leroy





Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :

Currently, QUICC_ENGINE depends on PPC32, so this in itself does not
change anything. In order to allow removing the PPC32 dependency from
QUICC_ENGINE and avoid allmodconfig build failures, add this explicit
dependency.

Signed-off-by: Rasmus Villemoes 
---
  drivers/net/ethernet/freescale/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/freescale/Kconfig 
b/drivers/net/ethernet/freescale/Kconfig
index 6a7e8993119f..97d27c7740d4 100644
--- a/drivers/net/ethernet/freescale/Kconfig
+++ b/drivers/net/ethernet/freescale/Kconfig
@@ -75,6 +75,7 @@ config FSL_XGMAC_MDIO
  config UCC_GETH
tristate "Freescale QE Gigabit Ethernet"
depends on QUICC_ENGINE
+   depends on PPC32


I think it would be more obvious to have:
depends on QUICC_ENGINE && PPC32

Christophe


select FSL_PQ_MDIO
select PHYLIB
---help---

Re: [PATCH v2 20/23] serial: make SERIAL_QE depend on PPC32

2019-10-30 Thread Christophe Leroy





Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :

Currently SERIAL_QE depends on QUICC_ENGINE, which in turn depends on
PPC32, so this doesn't add any extra dependency. However, the QUICC
Engine IP block also exists on some arm boards, so this serves as
preparation for removing the PPC32 dependency from QUICC_ENGINE and
build the QE support in drivers/soc/fsl/qe, while preventing
allmodconfig/randconfig failures due to SERIAL_QE not being supported
yet.

Signed-off-by: Rasmus Villemoes 
---
  drivers/tty/serial/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/tty/serial/Kconfig b/drivers/tty/serial/Kconfig
index 67a9eb3f94ce..78246f535809 100644
--- a/drivers/tty/serial/Kconfig
+++ b/drivers/tty/serial/Kconfig
@@ -1056,6 +1056,7 @@ config SERIAL_LANTIQ
  config SERIAL_QE
tristate "Freescale QUICC Engine serial port support"
depends on QUICC_ENGINE
+   depends on PPC32


Same, would be more obvious as
depends on QUICC_ENGINE && PPC32

Christophe


select SERIAL_CORE
select FW_LOADER
help

[PATCH] powerpc: Add build-time check of ptrace PT_xx defines

2019-10-30 Thread Michael Ellerman

As part of the uapi we export a lot of PT_xx defines for each register
in struct pt_regs. These are expressed as an index from gpr[0], in
units of unsigned long.

Currently there's nothing tying the values of those defines to the
actual layout of the struct.

But we *don't* want to change the uapi defines to derive the PT_xx
values based on the layout of the struct, those values are ABI and
must never change.

Instead we want to do the reverse, make sure that the layout of the
struct never changes vs the PT_xx defines. So add build time checks of
that.

This probably seems paranoid, but at least once in the past someone
has sent a patch that would have broken the ABI if it hadn't been
spotted. Although it probably would have been detected via testing,
it's preferable to just quash any issues at the source.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/ptrace.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92febf5f44..abd888bada03 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3398,4 +3398,67 @@ void __init pt_regs_check(void)
 offsetof(struct user_pt_regs, result));
 
BUILD_BUG_ON(sizeof(struct user_pt_regs) > sizeof(struct pt_regs));
+
+   // Now check that the pt_regs offsets match the uapi #defines
+   #define CHECK_REG(_pt, _reg) \
+   BUILD_BUG_ON(_pt != (offsetof(struct user_pt_regs, _reg) / \
+sizeof(unsigned long)));
+
+   CHECK_REG(PT_R0,  gpr[0]);
+   CHECK_REG(PT_R1,  gpr[1]);
+   CHECK_REG(PT_R2,  gpr[2]);
+   CHECK_REG(PT_R3,  gpr[3]);
+   CHECK_REG(PT_R4,  gpr[4]);
+   CHECK_REG(PT_R5,  gpr[5]);
+   CHECK_REG(PT_R6,  gpr[6]);
+   CHECK_REG(PT_R7,  gpr[7]);
+   CHECK_REG(PT_R8,  gpr[8]);
+   CHECK_REG(PT_R9,  gpr[9]);
+   CHECK_REG(PT_R10, gpr[10]);
+   CHECK_REG(PT_R11, gpr[11]);
+   CHECK_REG(PT_R12, gpr[12]);
+   CHECK_REG(PT_R13, gpr[13]);
+   CHECK_REG(PT_R14, gpr[14]);
+   CHECK_REG(PT_R15, gpr[15]);
+   CHECK_REG(PT_R16, gpr[16]);
+   CHECK_REG(PT_R17, gpr[17]);
+   CHECK_REG(PT_R18, gpr[18]);
+   CHECK_REG(PT_R19, gpr[19]);
+   CHECK_REG(PT_R20, gpr[20]);
+   CHECK_REG(PT_R21, gpr[21]);
+   CHECK_REG(PT_R22, gpr[22]);
+   CHECK_REG(PT_R23, gpr[23]);
+   CHECK_REG(PT_R24, gpr[24]);
+   CHECK_REG(PT_R25, gpr[25]);
+   CHECK_REG(PT_R26, gpr[26]);
+   CHECK_REG(PT_R27, gpr[27]);
+   CHECK_REG(PT_R28, gpr[28]);
+   CHECK_REG(PT_R29, gpr[29]);
+   CHECK_REG(PT_R30, gpr[30]);
+   CHECK_REG(PT_R31, gpr[31]);
+   CHECK_REG(PT_NIP, nip);
+   CHECK_REG(PT_MSR, msr);
+   CHECK_REG(PT_ORIG_R3, orig_gpr3);
+   CHECK_REG(PT_CTR, ctr);
+   CHECK_REG(PT_LNK, link);
+   CHECK_REG(PT_XER, xer);
+   CHECK_REG(PT_CCR, ccr);
+#ifdef CONFIG_PPC64
+   CHECK_REG(PT_SOFTE, softe);
+#else
+   CHECK_REG(PT_MQ, mq);
+#endif
+   CHECK_REG(PT_TRAP, trap);
+   CHECK_REG(PT_DAR, dar);
+   CHECK_REG(PT_DSISR, dsisr);
+   CHECK_REG(PT_RESULT, result);
+   #undef CHECK_REG
+
+   BUILD_BUG_ON(PT_REGS_COUNT != sizeof(struct user_pt_regs) / 
sizeof(unsigned long));
+
+   /*
+* PT_DSCR isn't a real reg, but it's important that it doesn't overlap 
the
+* real registers.
+*/
+   BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned 
long));
 }
-- 
2.21.0

Re: [PATCH] powerpc: Add build-time check of ptrace PT_xx defines

2019-10-30 Thread Christophe Leroy





Le 30/10/2019 à 12:12, Michael Ellerman a écrit :

As part of the uapi we export a lot of PT_xx defines for each register
in struct pt_regs. These are expressed as an index from gpr[0], in
units of unsigned long.

Currently there's nothing tying the values of those defines to the
actual layout of the struct.

But we *don't* want to change the uapi defines to derive the PT_xx
values based on the layout of the struct, those values are ABI and
must never change.

Instead we want to do the reverse, make sure that the layout of the
struct never changes vs the PT_xx defines. So add build time checks of
that.

This probably seems paranoid, but at least once in the past someone
has sent a patch that would have broken the ABI if it hadn't been
spotted. Although it probably would have been detected via testing,
it's preferable to just quash any issues at the source.


While you are playing with pt_regs_check(), could you also take the 
patch from Mathieu https://patchwork.ozlabs.org/patch/1009816/ ?


Christophe



Signed-off-by: Michael Ellerman 
---
  arch/powerpc/kernel/ptrace.c | 63 
  1 file changed, 63 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92febf5f44..abd888bada03 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3398,4 +3398,67 @@ void __init pt_regs_check(void)
 offsetof(struct user_pt_regs, result));
  
  	BUILD_BUG_ON(sizeof(struct user_pt_regs) > sizeof(struct pt_regs));

+
+   // Now check that the pt_regs offsets match the uapi #defines
+   #define CHECK_REG(_pt, _reg) \
+   BUILD_BUG_ON(_pt != (offsetof(struct user_pt_regs, _reg) / \
+sizeof(unsigned long)));
+
+   CHECK_REG(PT_R0,  gpr[0]);
+   CHECK_REG(PT_R1,  gpr[1]);
+   CHECK_REG(PT_R2,  gpr[2]);
+   CHECK_REG(PT_R3,  gpr[3]);
+   CHECK_REG(PT_R4,  gpr[4]);
+   CHECK_REG(PT_R5,  gpr[5]);
+   CHECK_REG(PT_R6,  gpr[6]);
+   CHECK_REG(PT_R7,  gpr[7]);
+   CHECK_REG(PT_R8,  gpr[8]);
+   CHECK_REG(PT_R9,  gpr[9]);
+   CHECK_REG(PT_R10, gpr[10]);
+   CHECK_REG(PT_R11, gpr[11]);
+   CHECK_REG(PT_R12, gpr[12]);
+   CHECK_REG(PT_R13, gpr[13]);
+   CHECK_REG(PT_R14, gpr[14]);
+   CHECK_REG(PT_R15, gpr[15]);
+   CHECK_REG(PT_R16, gpr[16]);
+   CHECK_REG(PT_R17, gpr[17]);
+   CHECK_REG(PT_R18, gpr[18]);
+   CHECK_REG(PT_R19, gpr[19]);
+   CHECK_REG(PT_R20, gpr[20]);
+   CHECK_REG(PT_R21, gpr[21]);
+   CHECK_REG(PT_R22, gpr[22]);
+   CHECK_REG(PT_R23, gpr[23]);
+   CHECK_REG(PT_R24, gpr[24]);
+   CHECK_REG(PT_R25, gpr[25]);
+   CHECK_REG(PT_R26, gpr[26]);
+   CHECK_REG(PT_R27, gpr[27]);
+   CHECK_REG(PT_R28, gpr[28]);
+   CHECK_REG(PT_R29, gpr[29]);
+   CHECK_REG(PT_R30, gpr[30]);
+   CHECK_REG(PT_R31, gpr[31]);
+   CHECK_REG(PT_NIP, nip);
+   CHECK_REG(PT_MSR, msr);
+   CHECK_REG(PT_ORIG_R3, orig_gpr3);
+   CHECK_REG(PT_CTR, ctr);
+   CHECK_REG(PT_LNK, link);
+   CHECK_REG(PT_XER, xer);
+   CHECK_REG(PT_CCR, ccr);
+#ifdef CONFIG_PPC64
+   CHECK_REG(PT_SOFTE, softe);
+#else
+   CHECK_REG(PT_MQ, mq);
+#endif
+   CHECK_REG(PT_TRAP, trap);
+   CHECK_REG(PT_DAR, dar);
+   CHECK_REG(PT_DSISR, dsisr);
+   CHECK_REG(PT_RESULT, result);
+   #undef CHECK_REG
+
+   BUILD_BUG_ON(PT_REGS_COUNT != sizeof(struct user_pt_regs) / 
sizeof(unsigned long));
+
+   /*
+* PT_DSCR isn't a real reg, but it's important that it doesn't overlap 
the
+* real registers.
+*/
+   BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned 
long));
  }

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Michal Hocko

On Wed 30-10-19 11:28:00, Peter Zijlstra wrote:
> On Wed, Oct 30, 2019 at 11:22:29AM +0100, Michal Hocko wrote:
> > On Wed 30-10-19 11:14:49, Peter Zijlstra wrote:
> > > On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote:
> > > > When passing the return value of dev_to_node() to cpumask_of_node()
> > > > without checking if the device's node id is NUMA_NO_NODE, there is
> > > > global-out-of-bounds detected by KASAN.
> > > > 
> > > > From the discussion [1], NUMA_NO_NODE really means no node affinity,
> > > > which also means all cpus should be usable. So the cpumask_of_node()
> > > > should always return all cpus online when user passes the node id as
> > > > NUMA_NO_NODE, just like similar semantic that page allocator handles
> > > > NUMA_NO_NODE.
> > > > 
> > > > But we cannot really copy the page allocator logic. Simply because the
> > > > page allocator doesn't enforce the near node affinity. It just picks it
> > > > up as a preferred node but then it is free to fallback to any other numa
> > > > node. This is not the case here and node_to_cpumask_map will only 
> > > > restrict
> > > > to the particular node's cpus which would have really non deterministic
> > > > behavior depending on where the code is executed. So in fact we really
> > > > want to return cpu_online_mask for NUMA_NO_NODE.
> > > > 
> > > > Also there is a debugging version of node_to_cpumask_map() for x86 and
> > > > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, 
> > > > this
> > > > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> > > > 
> > > > [1] https://lkml.org/lkml/2019/9/11/66
> > > > Signed-off-by: Yunsheng Lin 
> > > > Suggested-by: Michal Hocko 
> > > > Acked-by: Michal Hocko 
> > > > Acked-by: Paul Burton  # MIPS bits
> > > 
> > > Still:
> > > 
> > > Nacked-by: Peter Zijlstra (Intel) 
> > 
> > Do you have any other proposal that doesn't make any wild guesses about
> > which node to use instead of the undefined one?
> 
> It only makes 'wild' guesses when the BIOS is shit and it complains
> about that.

I really do not see how this is any better than simply using the online
cpu mask in the same "broken" situation. We are effectivelly talking
about a suboptimal path for suboptimal setups. I haven't heard any
actual technical argument why cpu_online_mask is any worse than adding
some sort of failover guessing which node to use as a replacement.

I completely do you point about complaining loud about broken BIOS/fw.
It seems we just disagree where we should workaround those issues
because as of now we simply do generate semi random behavior because of
an uninitialized memory access.

> Or do you like you BIOS broken?

I do not see anything like that in my response nor in my previous
communication. Moreover a patch to warn about this should be on the way
to get merged AFAIK.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 5/6] powerpc: Mark archrandom.h functions __must_check

2019-10-30 Thread Michael Ellerman

Richard Henderson  writes:
> We cannot use the pointer output without validating the
> success of the random read.

You _can_, but you must not. 

> Signed-off-by: Richard Henderson 
> ---
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/archrandom.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Acked-by: Michael Ellerman 

cheers

> diff --git a/arch/powerpc/include/asm/archrandom.h 
> b/arch/powerpc/include/asm/archrandom.h
> index f8a887c8b7f8..ee214b153a71 100644
> --- a/arch/powerpc/include/asm/archrandom.h
> +++ b/arch/powerpc/include/asm/archrandom.h
> @@ -6,17 +6,17 @@
>  
>  #include 
>  
> -static inline bool arch_get_random_long(unsigned long *v)
> +static inline bool __must_check arch_get_random_long(unsigned long *v)
>  {
>   return false;
>  }
>  
> -static inline bool arch_get_random_int(unsigned int *v)
> +static inline bool __must_check arch_get_random_int(unsigned int *v)
>  {
>   return false;
>  }
>  
> -static inline bool arch_get_random_seed_long(unsigned long *v)
> +static inline bool __must_check arch_get_random_seed_long(unsigned long *v)
>  {
>   if (ppc_md.get_random_seed)
>   return ppc_md.get_random_seed(v);
> @@ -24,7 +24,7 @@ static inline bool arch_get_random_seed_long(unsigned long 
> *v)
>   return false;
>  }
>  
> -static inline bool arch_get_random_seed_int(unsigned int *v)
> +static inline bool __must_check arch_get_random_seed_int(unsigned int *v)
>  {
>   unsigned long val;
>   bool rc;
> -- 
> 2.17.1

[PATCH V2 1/2] ASoC: dt-bindings: fsl_asrc: add compatible string for imx8qm

2019-10-30 Thread Shengjiu Wang

In order to support the two asrc modules in imx8qm, we need to
add compatible string "fsl,imx8qm-asrc0" and "fsl,imx8qm-asrc1"

Signed-off-by: Shengjiu Wang 
---
 Documentation/devicetree/bindings/sound/fsl,asrc.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/sound/fsl,asrc.txt 
b/Documentation/devicetree/bindings/sound/fsl,asrc.txt
index 1d4d9f938689..cd2bd3daa7e1 100644
--- a/Documentation/devicetree/bindings/sound/fsl,asrc.txt
+++ b/Documentation/devicetree/bindings/sound/fsl,asrc.txt
@@ -8,7 +8,8 @@ three substreams within totally 10 channels.
 
 Required properties:
 
-  - compatible : Contains "fsl,imx35-asrc" or "fsl,imx53-asrc".
+  - compatible : Contains "fsl,imx35-asrc", "fsl,imx53-asrc",
+ "fsl,imx8qm-asrc0" or "fsl,imx8qm-asrc1".
 
   - reg: Offset and length of the register set for the 
device.
 
-- 
2.21.0

[PATCH V2 2/2] ASoC: fsl_asrc: Add support for imx8qm

2019-10-30 Thread Shengjiu Wang

There are two asrc module in imx8qm, each module has different
clock configuration, and the DMA type is EDMA.

So in this patch, we define the new clocks, refine the clock map,
and include struct fsl_asrc_soc_data for different soc usage.

The EDMA channel is fixed with each dma request, one dma request
corresponding to one dma channel. So we need to request dma
channel with dma request of asrc module.

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
---
Changes in v2
- use !use_edma to wrap code in fsl_asrc_dma
- add Acked-by: Nicolin Chen

 sound/soc/fsl/fsl_asrc.c | 91 +---
 sound/soc/fsl/fsl_asrc.h | 65 +-
 sound/soc/fsl/fsl_asrc_dma.c | 42 +++--
 3 files changed, 166 insertions(+), 32 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index a3cfceea7d2f..9dc5b5a93fb0 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -43,24 +43,55 @@ static struct snd_pcm_hw_constraint_list 
fsl_asrc_rate_constraints = {
  */
 static unsigned char input_clk_map_imx35[] = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf,
+   3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+   3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 };
 
 static unsigned char output_clk_map_imx35[] = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf,
+   3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+   3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 };
 
 /* i.MX53 uses the same map for input and output */
 static unsigned char input_clk_map_imx53[] = {
 /* 0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0xa  0xb  0xc  0xd  
0xe  0xf */
0x0, 0x1, 0x2, 0x7, 0x4, 0x5, 0x6, 0x3, 0x8, 0x9, 0xa, 0xb, 0xc, 0xf, 
0xe, 0xd,
+   0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 
0x7, 0x7,
+   0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 
0x7, 0x7,
 };
 
 static unsigned char output_clk_map_imx53[] = {
 /* 0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0xa  0xb  0xc  0xd  
0xe  0xf */
0x8, 0x9, 0xa, 0x7, 0xc, 0x5, 0x6, 0xb, 0x0, 0x1, 0x2, 0x3, 0x4, 0xf, 
0xe, 0xd,
+   0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 
0x7, 0x7,
+   0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 
0x7, 0x7,
 };
 
-static unsigned char *clk_map[2];
+/* i.MX8QM uses the same map for input and output */
+static unsigned char input_clk_map_imx8_0[] = {
+   0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0x0,
+   0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 
0xe, 0xf,
+   0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0xf,
+};
+
+static unsigned char output_clk_map_imx8_0[] = {
+   0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0x0,
+   0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 
0xe, 0xf,
+   0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0xf,
+};
+
+static unsigned char input_clk_map_imx8_1[] = {
+   0xf, 0xf, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0x0,
+   0x0, 0x1, 0x2, 0x3, 0xb, 0xc, 0xf, 0xf, 0xd, 0xe, 0xf, 0xf, 0xf, 0xf, 
0xf, 0xf,
+   0x4, 0x5, 0x6, 0xf, 0x8, 0x9, 0xa, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0xf,
+};
+
+static unsigned char output_clk_map_imx8_1[] = {
+   0xf, 0xf, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0x0,
+   0x0, 0x1, 0x2, 0x3, 0xb, 0xc, 0xf, 0xf, 0xd, 0xe, 0xf, 0xf, 0xf, 0xf, 
0xf, 0xf,
+   0x4, 0x5, 0x6, 0xf, 0x8, 0x9, 0xa, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 
0xf, 0xf,
+};
 
 /**
  * Select the pre-processing and post-processing options
@@ -353,8 +384,8 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair, 
bool use_ideal_rate)
}
 
/* Validate input and output clock sources */
-   clk_index[IN] = clk_map[IN][config->inclk];
-   clk_index[OUT] = clk_map[OUT][config->outclk];
+   clk_index[IN] = asrc_priv->soc->inclk_map[config->inclk];
+   clk_index[OUT] = asrc_priv->soc->outclk_map[config->outclk];
 
/* We only have output clock for ideal ratio mode */
clk = asrc_priv->asrck_clk[clk_index[ideal ? OUT : IN]];
@@ -398,13 +429,13 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair 
*pair, bool use_ideal_rate)
/* Set the channel number */
channels = config->channel_num;
 
-   if (asrc_priv->channel_bits < 4)
+   if (asrc_priv->soc->channel_bits < 4)
channels /= 2;
 
/* Update channels for current pair */
regmap_update_bits(asrc_priv->regmap, REG_ASRCNCR,
-  ASRCNCR_ANCi_MASK(index, asrc_priv->channel_bits),
-  ASRCNCR_ANCi(index, channels, 
asrc_priv->channel_bits));
+  ASRCNCR_ANCi_M

Re: [PATCH] powerpc/tools: Don't quote $objdump in scripts

2019-10-30 Thread Michael Ellerman

Segher Boessenkool  writes:
> On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote:
>> Some of our scripts are passed $objdump and then call it as
>> "$objdump". This doesn't work if it contains spaces because we're
>> using ccache, for example you get errors such as:
>> 
>>   ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No 
>> such file or directory
>>   ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache 
>> ppc64le-objdump: No such file or directory
>> 
>> Fix it by not quoting the string when we expand it, allowing the shell
>> to do the right thing for us.
>
> This breaks things for people with spaces in their paths.

Spaces in their what? Who does that? :)

Also we don't support it:

  $ pwd
  $ /home/michael/foo bar
  $ make clean
  Makefile:147: *** source directory cannot contain spaces or colons.  Stop.

cheers

Re: [PATCH] powerpc/configs: Rename foo_basic_defconfig to foo_base.config

2019-10-30 Thread Michael Ellerman

On Tue, 2019-05-28 at 08:16:14 UTC, Michael Ellerman wrote:
> We have several "defconfigs" that are not actually full defconfigs
> they are just a base set of options which are then merged with other
> fragments to produce a working defconfig.
> 
> The most obvious example is corenet_basic_defconfig which only
> contains one symbol CONFIG_CORENET_GENERIC=y. But there is also
> mpc85xx_base_defconfig which doesn't actually enable CONFIG_PPC_85xx.
> 
> To avoid confusion, rename these config fragments to "foo_base.config"
> to make it clearer that they are not full defconfigs.
> 
> Reported-by: Christophe Leroy 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/58b12eb28e34d3dd8a2d6743c26bf941ca1fbf37

cheers

Re: [PATCH v4] powerpc/setup_64: fix -Wempty-body warnings

2019-10-30 Thread Michael Ellerman

On Mon, 2019-07-15 at 18:32:32 UTC, Qian Cai wrote:
> At the beginning of setup_64.c, it has,
> 
>   #ifdef DEBUG
>   #define DBG(fmt...) udbg_printf(fmt)
>   #else
>   #define DBG(fmt...)
>   #endif
> 
> where DBG() could be compiled away, and generate warnings,
> 
> arch/powerpc/kernel/setup_64.c: In function 'initialize_cache_info':
> arch/powerpc/kernel/setup_64.c:579:49: warning: suggest braces around
> empty body in an 'if' statement [-Wempty-body]
> DBG("Argh, can't find dcache properties !\n");
>  ^
> arch/powerpc/kernel/setup_64.c:582:49: warning: suggest braces around
> empty body in an 'if' statement [-Wempty-body]
> DBG("Argh, can't find icache properties !\n");
> 
> Fix it by using the suggestions from Michael:
> 
> "Neither of those sites should use DBG(), that's not really early boot
> code, they should just use pr_warn().
> 
> And the other uses of DBG() in initialize_cache_info() should just be
> removed.
> 
> In smp_release_cpus() the entry/exit DBG's should just be removed, and
> the spinning_secondaries line should just be pr_debug().
> 
> That would just leave the two calls in early_setup(). If we taught
> udbg_printf() to return early when udbg_putc is NULL, then we could just
> call udbg_printf() unconditionally and get rid of the DBG macro
> entirely."
> 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Qian Cai 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3b9176e9a874a848afa7eb2f6943639eb18b7a17

cheers

Re: [PATCH] powerpc/configs: Add debug config fragment

2019-10-30 Thread Michael Ellerman

On Thu, 2019-08-01 at 04:58:55 UTC, Andrew Donnellan wrote:
> Add a debug config fragment that we can use to put useful debug options
> into.
> 
> Currently we only define a target for powernv[_be]_debug_defconfig, and the
> only option included is to enable debugfs SCOM access.
> 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Andrew Donnellan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c1bc6f93f95970f917caaac544a374862e84df52

cheers

Re: [PATCH v7 1/2] powerpc/xmon: Allow listing and clearing breakpoints in read-only mode

2019-10-30 Thread Michael Ellerman

On Sat, 2019-09-07 at 06:11:23 UTC, "Christopher M. Riedl" wrote:
> Read-only mode should not prevent listing and clearing any active
> breakpoints.
> 
> Tested-by: Daniel Axtens 
> Reviewed-by: Daniel Axtens 
> Signed-off-by: Christopher M. Riedl 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/96664dee5cf1815777286227b09884b4f019727f

cheers

Re: [PATCH v2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range

2019-10-30 Thread Michael Ellerman

On Tue, 2019-09-17 at 12:38:51 UTC, "Aneesh Kumar K.V" wrote:
> With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
> pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
> can map the memmap area using 16MB hugepage size and that can cover
> a memory range of 16G.
> 
> While enabling new pmem namespaces, since memory is added in sub-section 
> chunks,
> before creating a new memmap mapping, kernel should check whether there is an
> existing memmap mapping covering the new pmem namespace. Currently, this is
> validated by checking whether the section covering the range is already
> initialized or not. Considering there can be multiple namespaces in the same
> section this can result in wrong validation. Update this to check for
> sub-sections in the range. This is done by checking for all pfns in the range 
> we
> are mapping.
> 
> We could optimize this by checking only just one pfn in each sub-section. But
> since this is not fast-path we keep this simple.
> 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/5f5d6e40a01e70b731df843d8b5a61b4b28b19d9

cheers

Re: [PATCH] powerpc/pkeys: remove unused pkey_allows_readwrite

2019-10-30 Thread Michael Ellerman

On Tue, 2019-09-17 at 15:22:30 UTC, Qian Cai wrote:
> pkey_allows_readwrite() was first introduced in the commit 5586cf61e108
> ("powerpc: introduce execute-only pkey"), but the usage was removed
> entirely in the commit a4fcc877d4e1 ("powerpc/pkeys: Preallocate
> execute-only key").
> 
> Found by the "-Wunused-function" compiler warning flag.
> 
> Fixes: a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key")
> Signed-off-by: Qian Cai 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/29674a1c71be710f8418468aa6a8addd6d1aba1c

cheers

Re: [PATCH] powerpc/configs: add FADump awareness to skiroot_defconfig

2019-10-30 Thread Michael Ellerman

On Wed, 2019-10-09 at 14:04:29 UTC, Hari Bathini wrote:
> FADump is supported on PowerNV platform. To fulfill this support, the
> petitboot kernel must be FADump aware. Enable config PRESERVE_FA_DUMP
> to make the petitboot kernel FADump aware.
> 
> Signed-off-by: Hari Bathini 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/aaa351504449c4babb80753c72920e4b25fbd8a9

cheers

Re: [PATCH] powerpc: make syntax for FADump config options in kernel/Makefile readable

2019-10-30 Thread Michael Ellerman

On Wed, 2019-10-09 at 15:27:20 UTC, Hari Bathini wrote:
> arch/powerpc/kernel/fadump.c file needs to be compiled in if 'config
> FA_DUMP' or 'config PRESERVE_FA_DUMP' is set. The current syntax
> achieves that but looks a bit odd. Fix it for better readability.
> 
> Signed-off-by: Hari Bathini 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cd1d55f16d48d97d681d9534170ce712ac1d09e7

cheers

Re: [PATCH] selftests/powerpc: Reduce sigfuz runtime to ~60s

2019-10-30 Thread Michael Ellerman

On Sun, 2019-10-13 at 23:46:43 UTC, Michael Ellerman wrote:
> The defaults for the sigfuz test is to run for 4000 iterations, but
> that can take quite a while and the test harness may kill the test.
> Reduce the number of iterations to 600, which gives a runtime of
> roughly 1 minute on a Power8 system.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/4f5c5b76cc00ccf5be89a2b9883feba3baf6eb2e

cheers

Re: [PATCH] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-10-30 Thread Michael Ellerman

On Thu, 2019-10-24 at 05:59:32 UTC, Michael Ellerman wrote:
> accumulate_stolen_time() is called prior to interrupt state being
> reconciled, which can trip the warning in arch_local_irq_restore():
> 
>   WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
> .arch_local_irq_restore+0x9c/0x130
>   ...
>   NIP .arch_local_irq_restore+0x9c/0x130
>   LR  .rb_start_commit+0x38/0x80
>   Call Trace:
> .ring_buffer_lock_reserve+0xe4/0x620
> .trace_function+0x44/0x210
> .function_trace_call+0x148/0x170
> .ftrace_ops_no_ops+0x180/0x1d0
> ftrace_call+0x4/0x8
> .accumulate_stolen_time+0x1c/0xb0
> decrementer_common+0x124/0x160
> 
> For now just mark it as notrace. We may change the ordering to call it
> after interrupt state has been reconciled, but that is a larger
> change.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/eb8e20f89093b64f48975c74ccb114e6775cee22

cheers

Re: [PATCH v2 1/3] powerpc/pseries: Don't opencode HPTE_V_BOLTED

2019-10-30 Thread Michael Ellerman

On Thu, 2019-10-24 at 09:35:40 UTC, "Aneesh Kumar K.V" wrote:
> No functional change in this patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/82ce028ad26dd075b06285ef61a854a564d910fb

cheers

Re: [PATCH v3] powerpc/powernv: Add queue mechanism for early messages

2019-10-30 Thread Michael Ellerman

On Mon, 2018-05-21 at 02:04:38 UTC, Deb McLemore wrote:
> Problem being solved is when issuing a BMC soft poweroff during IPL,
> the poweroff was being lost so the machine would not poweroff.
> 
> Opal messages were being received before the opal-power code
> registered its notifiers.
> 
> Alternatives discussed (option #3 was chosen):
> 
> 1 - Have opal_message_init() explicitly call opal_power_control_init()
> before it dequeues any OPAL messages (i.e. before we register the
> opal-msg IRQ handler).
> 
> 2 - Introduce concept of critical message types and when we register
> handlers we track which message types have a registered handler,
> then defer the opal-msg IRQ registration until we have a handler
> registered for all the critical types.
> 
> 3 - Buffering messages, if we receive a message and do not yet
> have a handler for that type, store the message and replay when
> a handler for that type is registered.
> 
> Signed-off-by: Deb McLemore 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a9336ddf448b1cba3080195cec2287af3907236c

cheers

Re: [PATCH] powerpc/prom_init: Undo relocation before entering secure mode

2019-10-30 Thread Michael Ellerman

On Wed, 2019-09-11 at 16:34:33 UTC, Thiago Jung Bauermann wrote:
> The ultravisor will do an integrity check of the kernel image but we
> relocated it so the check will fail. Restore the original image by
> relocating it back to the kernel virtual base address.
> 
> This works because during build vmlinux is linked with an expected virtual
> runtime address of KERNELBASE.
> 
> Fixes: 6a9c930bd775 ("powerpc/prom_init: Add the ESM call to prom_init")
> Signed-off-by: Thiago Jung Bauermann 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/05d9a952832cb206a32e3705eff6edebdb2207e7

cheers

Re: [PATCH] powernv/eeh: Fix oops when probing cxl devices

2019-10-30 Thread Michael Ellerman

On Wed, 2019-10-16 at 16:28:33 UTC, Frederic Barrat wrote:
> Recent cleanup in the way EEH support is added to a device causes a
> kernel oops when the cxl driver probes a device and creates virtual
> devices discovered on the FPGA:
> 
> BUG: Kernel NULL pointer dereference at 0x00a0
> Faulting instruction address: 0xc0048070
> Oops: Kernel access of bad area, sig: 7 [#1]
> ...
> NIP [c0048070] eeh_add_device_late.part.9+0x50/0x1e0
> LR [c004805c] eeh_add_device_late.part.9+0x3c/0x1e0
> Call Trace:
> [c000200e43983900] [c079e250] _dev_info+0x5c/0x6c (unreliable)
> [c000200e43983980] [c00d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
> [c000200e439839f0] [c00606d0] pcibios_bus_add_device+0x40/0x60
> [c000200e43983a10] [c06aa3a0] pci_bus_add_device+0x30/0x100
> [c000200e43983a80] [c06aa4d4] pci_bus_add_devices+0x64/0xd0
> [c000200e43983ac0] [c0081c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
> [c000200e43983b00] [c0081c4242ac] cxl_probe+0x504/0x5b0 [cxl]
> [c000200e43983bb0] [c06bba1c] local_pci_probe+0x6c/0x110
> [c000200e43983c30] [c0159278] work_for_cpu_fn+0x38/0x60
> 
> The root cause is that those cxl virtual devices don't have a
> representation in the device tree and therefore no associated pci_dn
> structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
> we oops.
> 
> We never had explicit support for EEH for those virtual
> devices. Instead, EEH events are reported to the (real) pci device and
> handled by the cxl driver. Which can then forward to the virtual
> devices and handle dependencies. The fact that we try adding EEH
> support for the virtual devices is new and a side-effect of the recent
> cleanup.
> 
> This patch fixes it by skipping adding EEH support on powernv for
> devices which don't have a pci_dn structure.
> 
> The cxl driver doesn't create virtual devices on pseries so this patch
> doesn't fix it there intentionally.
> 
> Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
> Signed-off-by: Frederic Barrat 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/a8a30219ba78b1abb92091102b632f8e9bbdbf03

cheers

Re: [PATCH] powerpc/powernv: Fix CPU idle to be called with IRQs disabled

2019-10-30 Thread Michael Ellerman

On Tue, 2019-10-22 at 11:58:14 UTC, Nicholas Piggin wrote:
> Commit e78a7614f3876 ("idle: Prevent late-arriving interrupts from
> disrupting offline") changes arch_cpu_idle_dead to be called with
> interrupts disabled, which triggers the WARN in pnv_smp_cpu_kill_self.
> 
> Fix this by fixing up irq_happened after hard disabling, rather than
> requiring there are no pending interrupts, similarly to what was done
> done until commit 2525db04d1cc5 ("powerpc/powernv: Simplify lazy IRQ
> handling in CPU offline").
> 
> Fixes: e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting 
> offline")
> Reported-by: Paul Mackerras 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7d6475051fb3d9339c5c760ed9883bc0a9048b21

cheers

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Qian Cai




> On Oct 30, 2019, at 6:28 AM, Peter Zijlstra  wrote:
> 
> It only makes 'wild' guesses when the BIOS is shit and it complains
> about that.
> 
> Or do you like you BIOS broken?

Agree. It is the garbage in and garbage out. No need to complicate the existing 
code further.

Re: [PATCH v2 17/23] soc: fsl: qe: make qe_ic_cascade_* static

2019-10-30 Thread Rasmus Villemoes

On 30/10/2019 11.50, Christophe Leroy wrote:
> 
> 
> Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :
>> Now that the references from arch/powerpc/ are gone, these are only
>> referenced from inside qe_ic.c, so make them static.
> 
> Why do that in two steps ?
> I think patch 9 could remain until here, and then you could squash patch
> 9 and patch 17 together here.

Agreed, I should rearrange stuff to avoid touching the same lines again
and again.

Thanks,
Rasmus

Re: [RFC PATCH 00/27] current interrupt series plus scv syscall

2019-10-30 Thread Michal Suchánek

Hello,

On Wed, Oct 02, 2019 at 01:13:52PM +1000, Nicholas Piggin wrote:
> Michal Suchánek's on September 24, 2019 7:33 pm:
> > Hello,
> > 
> > can you mark the individual patches with RFC rather than the wole
> > series?
> 
> Hey, thanks for the reviews. I'll resend all but the last two patches
> soon for merge in the next window.

Will you resend these or should I cherry-pick the part I need for the
!COMPAT?

Thanks

Michal

Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-30 Thread Daniel Axtens

Andrey Ryabinin  writes:

> On 10/29/19 7:20 AM, Daniel Axtens wrote:
>> In the case where KASAN directly allocates memory to back vmalloc
>> space, don't map the early shadow page over it.
>> 
>> We prepopulate pgds/p4ds for the range that would otherwise be empty.
>> This is required to get it synced to hardware on boot, allowing the
>> lower levels of the page tables to be filled dynamically.
>> 
>> Acked-by: Dmitry Vyukov 
>> Signed-off-by: Daniel Axtens 
>> 
>> ---
>
>> +static void __init kasan_shallow_populate_pgds(void *start, void *end)
>> +{
>> +unsigned long addr, next;
>> +pgd_t *pgd;
>> +void *p;
>> +int nid = early_pfn_to_nid((unsigned long)start);
>
> This doesn't make sense. start is not even a pfn. With linear mapping 
> we try to identify nid to have the shadow on the same node as memory. But 
> in this case we don't have memory or the corresponding shadow (yet),
> we only install pgd/p4d.
> I guess we could just use NUMA_NO_NODE.

Ah wow, that's quite the clanger on my part.

There are a couple of other invocations of early_pfn_to_nid in that file
that use an address directly, but at least they reference actual memory.
I'll send a separate patch to fix those up.

> The rest looks ok, so with that fixed:
>
> Reviewed-by: Andrey Ryabinin 

Thanks heaps! I've fixed up the nit you identifed in the first patch,
and I agree that the last patch probably isn't needed. I'll respin the
series shortly.

Regards,
Daniel

Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-30 Thread Andrey Ryabinin




On 10/30/19 4:50 PM, Daniel Axtens wrote:
> Andrey Ryabinin  writes:
> 
>> On 10/29/19 7:20 AM, Daniel Axtens wrote:
>>> In the case where KASAN directly allocates memory to back vmalloc
>>> space, don't map the early shadow page over it.
>>>
>>> We prepopulate pgds/p4ds for the range that would otherwise be empty.
>>> This is required to get it synced to hardware on boot, allowing the
>>> lower levels of the page tables to be filled dynamically.
>>>
>>> Acked-by: Dmitry Vyukov 
>>> Signed-off-by: Daniel Axtens 
>>>
>>> ---
>>
>>> +static void __init kasan_shallow_populate_pgds(void *start, void *end)
>>> +{
>>> +   unsigned long addr, next;
>>> +   pgd_t *pgd;
>>> +   void *p;
>>> +   int nid = early_pfn_to_nid((unsigned long)start);
>>
>> This doesn't make sense. start is not even a pfn. With linear mapping 
>> we try to identify nid to have the shadow on the same node as memory. But 
>> in this case we don't have memory or the corresponding shadow (yet),
>> we only install pgd/p4d.
>> I guess we could just use NUMA_NO_NODE.
> 
> Ah wow, that's quite the clanger on my part.
> 
> There are a couple of other invocations of early_pfn_to_nid in that file
> that use an address directly, but at least they reference actual memory.
> I'll send a separate patch to fix those up.

I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext))
It should be wrapped with PFN_DOWN().
Other usages in map_range() seems to be correct, range->start,end is pfns.


> 
>> The rest looks ok, so with that fixed:
>>
>> Reviewed-by: Andrey Ryabinin 
> 
> Thanks heaps! I've fixed up the nit you identifed in the first patch,
> and I agree that the last patch probably isn't needed. I'll respin the
> series shortly.
> 

Hold on a sec, just spotted another thing to fix.

> @@ -352,9 +397,24 @@ void __init kasan_init(void)
>   shadow_cpu_entry_end = (void *)round_up(
>   (unsigned long)shadow_cpu_entry_end, PAGE_SIZE);
>  
> + /*
> +  * If we're in full vmalloc mode, don't back vmalloc space with early
> +  * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to
> +  * the global table and we can populate the lower levels on demand.
> +  */
> +#ifdef CONFIG_KASAN_VMALLOC
> + kasan_shallow_populate_pgds(
> + kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),

This should be VMALLOC_START, there is no point to allocate pgds for the hole 
between linear mapping
and vmalloc, just waste of memory. It make sense to map early shadow for that 
hole, because if code
dereferences address in that hole we will see the page fault on that address 
instead of fault on the shadow.

So something like this might work:

kasan_populate_early_shadow(
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)VMALLOC_START));

if (IS_ENABLED(CONFIG_KASAN_VMALLOC)
kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), 
kasan_mem_to_shadow((void *)VMALLOC_END))
else
kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), 
kasan_mem_to_shadow((void *)VMALLOC_END));

kasan_populate_early_shadow(
kasan_mem_to_shadow((void *)VMALLOC_END + 1),
shadow_cpu_entry_begin);

Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-30 Thread Daniel Axtens

Andrey Ryabinin  writes:

> On 10/30/19 4:50 PM, Daniel Axtens wrote:
>> Andrey Ryabinin  writes:
>> 
>>> On 10/29/19 7:20 AM, Daniel Axtens wrote:
 In the case where KASAN directly allocates memory to back vmalloc
 space, don't map the early shadow page over it.

 We prepopulate pgds/p4ds for the range that would otherwise be empty.
 This is required to get it synced to hardware on boot, allowing the
 lower levels of the page tables to be filled dynamically.

 Acked-by: Dmitry Vyukov 
 Signed-off-by: Daniel Axtens 

 ---
>>>
 +static void __init kasan_shallow_populate_pgds(void *start, void *end)
 +{
 +  unsigned long addr, next;
 +  pgd_t *pgd;
 +  void *p;
 +  int nid = early_pfn_to_nid((unsigned long)start);
>>>
>>> This doesn't make sense. start is not even a pfn. With linear mapping 
>>> we try to identify nid to have the shadow on the same node as memory. But 
>>> in this case we don't have memory or the corresponding shadow (yet),
>>> we only install pgd/p4d.
>>> I guess we could just use NUMA_NO_NODE.
>> 
>> Ah wow, that's quite the clanger on my part.
>> 
>> There are a couple of other invocations of early_pfn_to_nid in that file
>> that use an address directly, but at least they reference actual memory.
>> I'll send a separate patch to fix those up.
>
> I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext))
> It should be wrapped with PFN_DOWN().
> Other usages in map_range() seems to be correct, range->start,end is pfns.
>

Oh, right, I didn't realise map_range was already using pfns.

>
>> 
>>> The rest looks ok, so with that fixed:
>>>
>>> Reviewed-by: Andrey Ryabinin 
>> 
>> Thanks heaps! I've fixed up the nit you identifed in the first patch,
>> and I agree that the last patch probably isn't needed. I'll respin the
>> series shortly.
>> 
>
> Hold on a sec, just spotted another thing to fix.
>
>> @@ -352,9 +397,24 @@ void __init kasan_init(void)
>>  shadow_cpu_entry_end = (void *)round_up(
>>  (unsigned long)shadow_cpu_entry_end, PAGE_SIZE);
>>  
>> +/*
>> + * If we're in full vmalloc mode, don't back vmalloc space with early
>> + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to
>> + * the global table and we can populate the lower levels on demand.
>> + */
>> +#ifdef CONFIG_KASAN_VMALLOC
>> +kasan_shallow_populate_pgds(
>> +kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
>
> This should be VMALLOC_START, there is no point to allocate pgds for the hole 
> between linear mapping
> and vmalloc, just waste of memory. It make sense to map early shadow for that 
> hole, because if code
> dereferences address in that hole we will see the page fault on that address 
> instead of fault on the shadow.
>
> So something like this might work:
>
>   kasan_populate_early_shadow(
>   kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
>   kasan_mem_to_shadow((void *)VMALLOC_START));
>
>   if (IS_ENABLED(CONFIG_KASAN_VMALLOC)
>   kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), 
> kasan_mem_to_shadow((void *)VMALLOC_END))
>   else
>   kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), 
> kasan_mem_to_shadow((void *)VMALLOC_END));
>
>   kasan_populate_early_shadow(
>   kasan_mem_to_shadow((void *)VMALLOC_END + 1),
>   shadow_cpu_entry_begin);

Sounds good. It's getting late for me so I'll change and test that and
send a respin tomorrow my time.

Regards,
Daniel

Re: [PATCH v10 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-30 Thread Uladzislau Rezki

Hello, Daniel

>  
> @@ -1294,14 +1299,19 @@ static bool __purge_vmap_area_lazy(unsigned long 
> start, unsigned long end)
>   spin_lock(&free_vmap_area_lock);
>   llist_for_each_entry_safe(va, n_va, valist, purge_list) {
>   unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
> + unsigned long orig_start = va->va_start;
> + unsigned long orig_end = va->va_end;
>  
>   /*
>* Finally insert or merge lazily-freed area. It is
>* detached and there is no need to "unlink" it from
>* anything.
>*/
> - merge_or_add_vmap_area(va,
> - &free_vmap_area_root, &free_vmap_area_list);
> + va = merge_or_add_vmap_area(va, &free_vmap_area_root,
> + &free_vmap_area_list);
> +
> + kasan_release_vmalloc(orig_start, orig_end,
> +   va->va_start, va->va_end);
>  
I have some questions here. I have not analyzed kasan_releace_vmalloc()
logic in detail, sorry for that if i miss something. __purge_vmap_area_lazy()
deals with big address space, so not only vmalloc addresses it frees here,
basically it can be any, starting from 1 until ULONG_MAX, whereas vmalloc
space spans from VMALLOC_START - VMALLOC_END:

1) Should it be checked that vmalloc only address is freed or you handle
it somewhere else?

if (is_vmalloc_addr(va->va_start))
kasan_release_vmalloc(...)

2) Have you run any bencmarking just to see how much overhead it adds?
I am asking, because probably it make sense to add those figures to the
backlog(commit message). For example you can run:


sudo ./test_vmalloc.sh performance
and
sudo ./test_vmalloc.sh sequential_test_order=1


Thanks!

--
Vlad Rezki

Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic

2019-10-30 Thread Lakshmi Ramasubramanian


On 10/23/19 8:47 PM, Nayna Jain wrote:

Hi Nayna,


process_buffer_measurement() is limited to measuring the kexec boot
command line. This patch makes process_buffer_measurement() more
generic, allowing it to measure other types of buffer data (e.g.
blacklisted binary hashes or key hashes).


Now that process_buffer_measurement() is being made generic to measure 
any buffer, it would be good to add a tag to indicate what type of 
buffer is being measured.


For example, if the buffer is kexec command line the log could look like:

 "kexec_cmdline: "

Similarly, if the buffer is blacklisted binary hash:

 "blacklist hash: ".

If the buffer is key hash:

 ": key data".

This would greatly help the consumer of the IMA log to know the type of 
data represented in each IMA log entry.


thanks,
 -lakshmi

Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic

2019-10-30 Thread Mimi Zohar

On Wed, 2019-10-30 at 08:22 -0700, Lakshmi Ramasubramanian wrote:
> On 10/23/19 8:47 PM, Nayna Jain wrote:
> 
> Hi Nayna,
> 
> > process_buffer_measurement() is limited to measuring the kexec boot
> > command line. This patch makes process_buffer_measurement() more
> > generic, allowing it to measure other types of buffer data (e.g.
> > blacklisted binary hashes or key hashes).
> 
> Now that process_buffer_measurement() is being made generic to measure 
> any buffer, it would be good to add a tag to indicate what type of 
> buffer is being measured.
> 
> For example, if the buffer is kexec command line the log could look like:
> 
>   "kexec_cmdline: "
> 
> Similarly, if the buffer is blacklisted binary hash:
> 
>   "blacklist hash: ".
> 
> If the buffer is key hash:
> 
>   ": key data".
> 
> This would greatly help the consumer of the IMA log to know the type of 
> data represented in each IMA log entry.

Both the existing kexec command line and the new blacklist buffer
measurement pass that information in the eventname.   The [PATCH 7/8]
"ima: check against blacklisted hashes for files with modsig" patch
description includes an example.

Mimi

[PATCH 06/11] powerpc/powernv: Remove intermediate variable

2019-10-30 Thread Reza Arbab

Trim the pointless temporary variable.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index b78b5e81f941..319152d30bc3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1854,9 +1854,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
(pe->device_count == 1 || !pe->pbus) &&
phb->model == PNV_PHB_MODEL_PHB3) {
/* Configure the bypass mode */
-   s64 rc = pnv_pci_ioda_dma_64bit_bypass(pe);
-   if (rc)
+   if (pnv_pci_ioda_dma_64bit_bypass(pe))
return false;
+
/* 4GB offset bypasses 32-bit space */
pdev->dev.archdata.dma_offset = (1ULL << 32);
 
-- 
1.8.3.1

[PATCH 04/11] powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab

Rework of pnv_pci_ioda_iommu_bypass_supported() dropped a call to
pnv_npu_try_dma_set_bypass(). Reintroduce this call, so that the DMA
bypass configuration of a GPU device is propagated to its corresponding
NPU devices.

Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8849218187d7..70e834635971 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1833,14 +1833,13 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
struct pnv_phb *phb = hose->private_data;
struct pci_dn *pdn = pci_get_pdn(pdev);
struct pnv_ioda_pe *pe;
+   bool bypass;
 
if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
return false;
 
pe = &phb->ioda.pe_array[pdn->pe_number];
-
-   if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask))
-   return true;
+   bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
 
/*
 * If the device can't set the TCE bypass bit but still wants
@@ -1848,7 +1847,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
 * bypass the 32-bit region and be usable for 64-bit DMAs.
 * The device needs to be able to address all of this space.
 */
-   if (dma_mask >> 32 &&
+   if (!bypass &&
+   dma_mask >> 32 &&
dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
/* pe->pdev should be set if it's a single device, pe->pbus if not 
*/
(pe->device_count == 1 || !pe->pbus) &&
@@ -1859,10 +1859,14 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
/* 4GB offset bypasses 32-bit space */
pdev->dev.archdata.dma_offset = (1ULL << 32);
-   return true;
+
+   bypass = true;
}
 
-   return false;
+   /* Update peer npu devices */
+   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
+
+   return bypass;
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
-- 
1.8.3.1

[PATCH 05/11] powerpc/powernv: Return failure for some uses of dma_set_mask()

2019-10-30 Thread Reza Arbab

Rework of pnv_pci_ioda_dma_set_mask() effectively reverted commit
253fd51e2f53 ("powerpc/powernv/pci: Return failure for some uses of
dma_set_mask()").

Reintroduce the desired behavior that an unfulfilled request for a DMA
mask between 32 and 64 bits will return error instead of silently
falling back to 32 bits.

Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/kernel/dma-iommu.c   | 19 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |  8 ++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e486d1d78de2..e1693760654b 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -122,10 +122,21 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
 
-   if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-   dev->archdata.iommu_bypass = true;
-   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
-   return 1;
+   if (dev_is_pci(dev)) {
+   if (dma_iommu_bypass_supported(dev, mask)) {
+   dev->archdata.iommu_bypass = true;
+   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+   return 1;
+   }
+
+   if (mask >> 32) {
+   dev_err(dev, "Warning: IOMMU dma bypass not supported: 
mask 0x%016llx\n",
+   mask);
+
+   /* A 64-bit request falls back to default ops */
+   if (mask != DMA_BIT_MASK(64))
+   return 0;
+   }
}
 
if (!tbl) {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70e834635971..b78b5e81f941 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1863,8 +1863,12 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
bypass = true;
}
 
-   /* Update peer npu devices */
-   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
+   /*
+* Update peer npu devices. We also do this for the special case where
+* a 64-bit dma mask can't be fulfilled and falls back to default.
+*/
+   if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64))
+   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
 
return bypass;
 }
-- 
1.8.3.1

[PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

2019-10-30 Thread Reza Arbab

Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by
separating the part of the function that determines if bypass is
supported from the part that actually attempts to configure it.

Move the latter to a controller-specific dma_set_mask() callback.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/Kconfig|  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 --
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..6e6e27841764 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -17,6 +17,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+   select ARCH_HAS_DMA_SET_MASK
default y
 
 config OPAL_PRD
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 57e6a43d9a3a..5291464930ed 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1854,32 +1854,33 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
u64 dma_mask)
 {
struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
-   bool bypass;
 
if (WARN_ON(!pe))
return false;
 
-   bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
+   return pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask) ||
+  pnv_phb3_iommu_bypass_supported(pe, dma_mask);
+}
+
+static void pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 mask)
+{
+   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
+
+   if (!pe)
+   return;
 
-   if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) {
+   if (!pnv_ioda_pe_iommu_bypass_supported(pe, mask) &&
+   pnv_phb3_iommu_bypass_supported(pe, mask)) {
/* Configure the bypass mode */
if (pnv_pci_ioda_dma_64bit_bypass(pe))
-   return false;
+   return;
 
/* 4GB offset bypasses 32-bit space */
pdev->dev.archdata.dma_offset = (1ULL << 32);
-
-   bypass = true;
}
 
-   /*
-* Update peer npu devices. We also do this for the special case where
-* a 64-bit dma mask can't be fulfilled and falls back to default.
-*/
-   if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64))
-   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
-
-   return bypass;
+   /* Update peer npu devices */
+   pnv_npu_try_dma_set_bypass(pdev, mask);
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
@@ -3612,6 +3613,7 @@ static void pnv_pci_ioda_shutdown(struct pci_controller 
*hose)
 static const struct pci_controller_ops pnv_pci_ioda_controller_ops = {
.dma_dev_setup  = pnv_pci_dma_dev_setup,
.dma_bus_setup  = pnv_pci_dma_bus_setup,
+   .dma_set_mask   = pnv_pci_ioda_dma_set_mask,
.iommu_bypass_supported = pnv_pci_ioda_iommu_bypass_supported,
.setup_msi_irqs = pnv_setup_msi_irqs,
.teardown_msi_irqs  = pnv_teardown_msi_irqs,
-- 
1.8.3.1

[PATCH 08/11] powerpc/powernv: Replace open coded pnv_ioda_get_pe()s

2019-10-30 Thread Reza Arbab

Collapse several open coded instances of pnv_ioda_get_pe().

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 22 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +++---
 2 files changed, 8 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index a77ce7d71634..0010b21d45b8 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -97,24 +97,17 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, 
int index)
 static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
  struct pci_dev **gpdev)
 {
-   struct pnv_phb *phb;
-   struct pci_controller *hose;
struct pci_dev *pdev;
struct pnv_ioda_pe *pe;
-   struct pci_dn *pdn;
 
pdev = pnv_pci_get_gpu_dev(npe->pdev);
if (!pdev)
return NULL;
 
-   pdn = pci_get_pdn(pdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   pe = pnv_ioda_get_pe(pdev);
+   if (WARN_ON(!pe))
return NULL;
 
-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-   pe = &phb->ioda.pe_array[pdn->pe_number];
-
if (gpdev)
*gpdev = pdev;
 
@@ -261,9 +254,6 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask)
 {
struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev);
-   struct pnv_phb *phb;
-   struct pci_dn *pdn;
-   struct pnv_ioda_pe *npe;
struct pci_dev *npdev;
bool bypass;
int i = 0;
@@ -275,12 +265,10 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, 
u64 mask)
bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask);
 
while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) {
-   pdn = pci_get_pdn(npdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-   return;
+   struct pnv_ioda_pe *npe = pnv_ioda_get_pe(npdev);
 
-   phb = pci_bus_to_host(npdev->bus)->private_data;
-   npe = &phb->ioda.pe_array[pdn->pe_number];
+   if (WARN_ON(!npe))
+   return;
 
if (bypass) {
dev_info(&npdev->dev,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 319152d30bc3..6b932cfc0deb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1829,16 +1829,12 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct 
pnv_ioda_pe *pe)
 static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
u64 dma_mask)
 {
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct pci_dn *pdn = pci_get_pdn(pdev);
-   struct pnv_ioda_pe *pe;
+   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
bool bypass;
 
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   if (WARN_ON(!pe))
return false;
 
-   pe = &phb->ioda.pe_array[pdn->pe_number];
bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
 
/*
@@ -1852,7 +1848,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
/* pe->pdev should be set if it's a single device, pe->pbus if not 
*/
(pe->device_count == 1 || !pe->pbus) &&
-   phb->model == PNV_PHB_MODEL_PHB3) {
+   pe->phb->model == PNV_PHB_MODEL_PHB3) {
/* Configure the bypass mode */
if (pnv_pci_ioda_dma_64bit_bypass(pe))
return false;
-- 
1.8.3.1

[PATCH 01/11] Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function"

2019-10-30 Thread Reza Arbab

Revert commit b4d37a7b6934 ("powerpc/powernv: Remove unused
pnv_npu_try_dma_set_bypass() function") so that this function can be
reintegrated.

Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 99 
 1 file changed, 99 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index b95b9e3c4c98..5a8313654033 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -193,6 +193,105 @@ static long pnv_npu_unset_window(struct iommu_table_group 
*table_group, int num)
return 0;
 }
 
+/*
+ * Enables 32 bit DMA on NPU.
+ */
+static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
+{
+   struct pci_dev *gpdev;
+   struct pnv_ioda_pe *gpe;
+   int64_t rc;
+
+   /*
+* Find the assoicated PCI devices and get the dma window
+* information from there.
+*/
+   if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
+   return;
+
+   gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
+   if (!gpe)
+   return;
+
+   rc = pnv_npu_set_window(&npe->table_group, 0,
+   gpe->table_group.tables[0]);
+
+   /*
+* NVLink devices use the same TCE table configuration as
+* their parent device so drivers shouldn't be doing DMA
+* operations directly on these devices.
+*/
+   set_dma_ops(&npe->pdev->dev, &dma_dummy_ops);
+}
+
+/*
+ * Enables bypass mode on the NPU. The NPU only supports one
+ * window per link, so bypass needs to be explicitly enabled or
+ * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be
+ * active at the same time.
+ */
+static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
+{
+   struct pnv_phb *phb = npe->phb;
+   int64_t rc = 0;
+   phys_addr_t top = memblock_end_of_DRAM();
+
+   if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
+   return -EINVAL;
+
+   rc = pnv_npu_unset_window(&npe->table_group, 0);
+   if (rc != OPAL_SUCCESS)
+   return rc;
+
+   /* Enable the bypass window */
+
+   top = roundup_pow_of_two(top);
+   dev_info(&npe->pdev->dev, "Enabling bypass for PE %x\n",
+   npe->pe_number);
+   rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+   npe->pe_number, npe->pe_number,
+   0 /* bypass base */, top);
+
+   if (rc == OPAL_SUCCESS)
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+   return rc;
+}
+
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
+{
+   int i;
+   struct pnv_phb *phb;
+   struct pci_dn *pdn;
+   struct pnv_ioda_pe *npe;
+   struct pci_dev *npdev;
+
+   for (i = 0; ; ++i) {
+   npdev = pnv_pci_get_npu_dev(gpdev, i);
+
+   if (!npdev)
+   break;
+
+   pdn = pci_get_pdn(npdev);
+   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   return;
+
+   phb = pci_bus_to_host(npdev->bus)->private_data;
+
+   /* We only do bypass if it's enabled on the linked device */
+   npe = &phb->ioda.pe_array[pdn->pe_number];
+
+   if (bypass) {
+   dev_info(&npdev->dev,
+   "Using 64-bit DMA iommu bypass\n");
+   pnv_npu_dma_set_bypass(npe);
+   } else {
+   dev_info(&npdev->dev, "Using 32-bit DMA via iommu\n");
+   pnv_npu_dma_set_32(npe);
+   }
+   }
+}
+
 /* Switch ownership from platform code to external user (e.g. VFIO) */
 static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 {
-- 
1.8.3.1

[PATCH 07/11] powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop

2019-10-30 Thread Reza Arbab

Write this loop more compactly to improve readability.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index a6b8c7ad36e4..a77ce7d71634 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -261,12 +261,12 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask)
 {
struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev);
-   int i;
struct pnv_phb *phb;
struct pci_dn *pdn;
struct pnv_ioda_pe *npe;
struct pci_dev *npdev;
bool bypass;
+   int i = 0;
 
if (!gpe)
return;
@@ -274,12 +274,7 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 
mask)
/* We only do bypass if it's enabled on the linked device */
bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask);
 
-   for (i = 0; ; ++i) {
-   npdev = pnv_pci_get_npu_dev(gpdev, i);
-
-   if (!npdev)
-   break;
-
+   while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) {
pdn = pci_get_pdn(npdev);
if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
return;
-- 
1.8.3.1

[PATCH 10/11] powerpc/powernv: Add pnv_phb3_iommu_bypass_supported()

2019-10-30 Thread Reza Arbab

Move this code to its own function for reuse. As a side benefit,
rearrange the comments and spread things out for readability.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 37 +--
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6b932cfc0deb..57e6a43d9a3a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1826,6 +1826,30 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct 
pnv_ioda_pe *pe)
return -EIO;
 }
 
+/*
+ * If the device can't set the TCE bypass bit but still wants
+ * to access 4GB or more, on PHB3 we can reconfigure TVE#0 to
+ * bypass the 32-bit region and be usable for 64-bit DMAs.
+ */
+static bool pnv_phb3_iommu_bypass_supported(struct pnv_ioda_pe *pe, u64 mask)
+{
+   if (pe->phb->model != PNV_PHB_MODEL_PHB3)
+   return false;
+
+   /* pe->pdev should be set if it's a single device, pe->pbus if not */
+   if (pe->pbus && pe->device_count != 1)
+   return false;
+
+   if (!(mask >> 32))
+   return false;
+
+   /* The device needs to be able to address all of this space. */
+   if (mask <= memory_hotplug_max() + (1ULL << 32))
+   return false;
+
+   return true;
+}
+
 static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
u64 dma_mask)
 {
@@ -1837,18 +1861,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
 
bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
 
-   /*
-* If the device can't set the TCE bypass bit but still wants
-* to access 4GB or more, on PHB3 we can reconfigure TVE#0 to
-* bypass the 32-bit region and be usable for 64-bit DMAs.
-* The device needs to be able to address all of this space.
-*/
-   if (!bypass &&
-   dma_mask >> 32 &&
-   dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
-   /* pe->pdev should be set if it's a single device, pe->pbus if not 
*/
-   (pe->device_count == 1 || !pe->pbus) &&
-   pe->phb->model == PNV_PHB_MODEL_PHB3) {
+   if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) {
/* Configure the bypass mode */
if (pnv_pci_ioda_dma_64bit_bypass(pe))
return false;
-- 
1.8.3.1

[PATCH 02/11] powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported()

2019-10-30 Thread Reza Arbab

This little calculation will be needed in other places. Move it to a
convenience function.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++-
 arch/powerpc/platforms/powernv/pci.h  | 8 
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c28d0d9b7ee0..8849218187d7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1838,11 +1838,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 
pe = &phb->ioda.pe_array[pdn->pe_number];
-   if (pe->tce_bypass_enabled) {
-   u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
-   if (dma_mask >= top)
-   return true;
-   }
+
+   if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask))
+   return true;
 
/*
 * If the device can't set the TCE bypass bit but still wants
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index f914f0b14e4e..41f7dec3aee5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -4,6 +4,7 @@
 
 #include /* for __printf */
 #include 
+#include 
 #include 
 #include 
 
@@ -247,4 +248,11 @@ extern void pnv_pci_setup_iommu_table(struct iommu_table 
*tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset, unsigned int page_shift);
 
+static inline bool pnv_ioda_pe_iommu_bypass_supported(struct pnv_ioda_pe *pe,
+ u64 mask)
+{
+   return pe->tce_bypass_enabled &&
+  mask >= pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+}
+
 #endif /* __POWERNV_PCI_H */
-- 
1.8.3.1

[PATCH 09/11] Revert "powerpc/pci: remove the dma_set_mask pci_controller ops methods"

2019-10-30 Thread Reza Arbab

Bring back the pci controller based hook in dma_set_mask(), as it will
have a user again.

This reverts commit 662acad4067a ("powerpc/pci: remove the dma_set_mask
pci_controller ops methods"). The callback signature has been adjusted
with void return to fit its caller.

Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/include/asm/pci-bridge.h | 2 ++
 arch/powerpc/kernel/dma-mask.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index ea6ec65970ef..8512dcd053c5 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -43,6 +43,8 @@ struct pci_controller_ops {
void(*teardown_msi_irqs)(struct pci_dev *pdev);
 #endif
 
+   void(*dma_set_mask)(struct pci_dev *pdev, u64 dma_mask);
+
void(*shutdown)(struct pci_controller *hose);
 };
 
diff --git a/arch/powerpc/kernel/dma-mask.c b/arch/powerpc/kernel/dma-mask.c
index ffbbbc432612..35b5fd1b03a6 100644
--- a/arch/powerpc/kernel/dma-mask.c
+++ b/arch/powerpc/kernel/dma-mask.c
@@ -2,11 +2,20 @@
 
 #include 
 #include 
+#include 
 #include 
 
 void arch_dma_set_mask(struct device *dev, u64 dma_mask)
 {
if (ppc_md.dma_set_mask)
ppc_md.dma_set_mask(dev, dma_mask);
+
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_controller *phb = pci_bus_to_host(pdev->bus);
+
+   if (phb->controller_ops.dma_set_mask)
+   phb->controller_ops.dma_set_mask(pdev, dma_mask);
+   }
 }
 EXPORT_SYMBOL(arch_dma_set_mask);
-- 
1.8.3.1

[PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab

With recent kernels, TCE tables for NPU devices are no longer being
configured. That task was performed by pnv_npu_try_dma_set_bypass(), a
function that got swept away in recent overhauling of dma code.

Patches 1-4 here bring the lost function back and reintegrate it with
the updated generic iommu bypass infrastructure.

Patch 5 fixes a regression in behavior when a requested dma mask can not
be fulfilled.

Patches 6-8 are cleanup. I put these later in the set because they
aren't bisectable until after the restored code is wired back in.

Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems
wrong for a boolean *_supported() function to have side effects. They
reintroduce a pci controller based dma_set_mask() hook. If that's
undesirable, these last three patches can be dropped.

Reza Arbab (11):
  Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass()
function"
  powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported()
  powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument
  powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass()
  powerpc/powernv: Return failure for some uses of dma_set_mask()
  powerpc/powernv: Remove intermediate variable
  powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop
  powerpc/powernv: Replace open coded pnv_ioda_get_pe()s
  Revert "powerpc/pci: remove the dma_set_mask pci_controller ops
methods"
  powerpc/powernv: Add pnv_phb3_iommu_bypass_supported()
  powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

 arch/powerpc/include/asm/pci-bridge.h |   2 +
 arch/powerpc/kernel/dma-iommu.c   |  19 --
 arch/powerpc/kernel/dma-mask.c|   9 +++
 arch/powerpc/platforms/powernv/Kconfig|   1 +
 arch/powerpc/platforms/powernv/npu-dma.c  | 106 +++---
 arch/powerpc/platforms/powernv/pci-ioda.c |  71 
 arch/powerpc/platforms/powernv/pci.h  |  10 ++-
 7 files changed, 177 insertions(+), 41 deletions(-)

-- 
1.8.3.1

[PATCH 03/11] powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument

2019-10-30 Thread Reza Arbab

To enable simpler calling code, change this function to find the value
of bypass instead of taking it as an argument.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 12 +---
 arch/powerpc/platforms/powernv/pci.h |  2 +-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 5a8313654033..a6b8c7ad36e4 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -258,13 +258,21 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
return rc;
 }
 
-void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask)
 {
+   struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev);
int i;
struct pnv_phb *phb;
struct pci_dn *pdn;
struct pnv_ioda_pe *npe;
struct pci_dev *npdev;
+   bool bypass;
+
+   if (!gpe)
+   return;
+
+   /* We only do bypass if it's enabled on the linked device */
+   bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask);
 
for (i = 0; ; ++i) {
npdev = pnv_pci_get_npu_dev(gpdev, i);
@@ -277,8 +285,6 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool 
bypass)
return;
 
phb = pci_bus_to_host(npdev->bus)->private_data;
-
-   /* We only do bypass if it's enabled on the linked device */
npe = &phb->ioda.pe_array[pdn->pe_number];
 
if (bypass) {
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 41f7dec3aee5..21db0f4cfb11 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -211,7 +211,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
 
 /* Nvlink functions */
-extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
+extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
 extern struct iommu_table_group *pnv_try_setup_npu_table_group(
-- 
1.8.3.1

Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Christoph Hellwig

On Wed, Oct 30, 2019 at 11:59:49AM -0500, Reza Arbab wrote:
> With recent kernels, TCE tables for NPU devices are no longer being
> configured. That task was performed by pnv_npu_try_dma_set_bypass(), a
> function that got swept away in recent overhauling of dma code.
> 
> Patches 1-4 here bring the lost function back and reintegrate it with
> the updated generic iommu bypass infrastructure.
> 
> Patch 5 fixes a regression in behavior when a requested dma mask can not
> be fulfilled.
> 
> Patches 6-8 are cleanup. I put these later in the set because they
> aren't bisectable until after the restored code is wired back in.
> 
> Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems
> wrong for a boolean *_supported() function to have side effects. They
> reintroduce a pci controller based dma_set_mask() hook. If that's
> undesirable, these last three patches can be dropped.

How do you even use this code?  Nothing in the kernel even calls
dma_set_mask for NPU devices, as we only suport vfio pass through.

Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

2019-10-30 Thread Christoph Hellwig

On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote:
> Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by
> separating the part of the function that determines if bypass is
> supported from the part that actually attempts to configure it.
> 
> Move the latter to a controller-specific dma_set_mask() callback.

Nak, the dma_set_mask overrides are going away.  But as said in the
reply to the cover letter I don't even see how you could end up calling
this code.

Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab


On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote:

How do you even use this code?  Nothing in the kernel even calls
dma_set_mask for NPU devices, as we only suport vfio pass through.


You use it by calling dma_set_mask() for the *GPU* device. The purpose 
of pnv_npu_try_dma_set_bypass() is to then propagate the same bypass 
configuration to all the NPU devices associated with that GPU.


--
Reza Arbab

Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

2019-10-30 Thread Reza Arbab


On Wed, Oct 30, 2019 at 06:55:18PM +0100, Christoph Hellwig wrote:

On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote:

Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by
separating the part of the function that determines if bypass is
supported from the part that actually attempts to configure it.

Move the latter to a controller-specific dma_set_mask() callback.


Nak, the dma_set_mask overrides are going away.  But as said in the
reply to the cover letter I don't even see how you could end up calling
this code.


Okay. As mentioned in the cover letter these last few patches could be 
omitted if that's the case.


--
Reza Arbab

Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Christoph Hellwig

On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote:
> On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote:
>> How do you even use this code?  Nothing in the kernel even calls
>> dma_set_mask for NPU devices, as we only suport vfio pass through.
>
> You use it by calling dma_set_mask() for the *GPU* device. The purpose of 
> pnv_npu_try_dma_set_bypass() is to then propagate the same bypass 
> configuration to all the NPU devices associated with that GPU.

Which in-kernel driver, which PCI ID?

Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Kees Cook

On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote:
> 
> 
> Le 30/10/2019 à 08:31, Russell Currey a écrit :
> > v4 cover letter: 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
> > v3 cover letter: 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html
> > 
> > Changes since v4:
> > [1/5]: Addressed review comments from Michael Ellerman (thanks!)
> > [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
> >ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
> >STRICT_MODULE_RWX being *on by default* in cases where
> >STRICT_KERNEL_RWX is *unavailable*
> > [5/5]: split skiroot_defconfig changes out into its own patch
> > 
> > The whole Kconfig situation is really weird and confusing, I believe the
> > correct resolution is to change arch/Kconfig but the consequences are so
> > minor that I don't think it's worth it, especially given that I expect
> > powerpc to have mandatory strict RWX Soon(tm).
> 
> I'm not such strict RWX can be made mandatory due to the impact it has on
> some subarches:
> - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant
> overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which
> is not acceptable on systems having very few RAM.
> - On hash book3s32, we are able to map the kernel BATs. With a few alignment
> constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to
> provide exec protection on page granularity. Only on 256Mbytes segments. So
> for modules, we have to have the vmspace X. It is also not possible to have
> a kernel area RO. Only user areas can be made RO.

As I understand it, the idea was for it to be mandatory (or at least
default-on) only for the subarches where it wasn't totally insane to
accomplish. :) (I'm not familiar with all the details on the subarchs,
but it sounded like the more modern systems would be the targets for
this?)

-- 
Kees Cook

Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab


On Wed, Oct 30, 2019 at 07:13:59PM +0100, Christoph Hellwig wrote:

On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote:

On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote:

How do you even use this code?  Nothing in the kernel even calls
dma_set_mask for NPU devices, as we only suport vfio pass through.


You use it by calling dma_set_mask() for the *GPU* device. The purpose of
pnv_npu_try_dma_set_bypass() is to then propagate the same bypass
configuration to all the NPU devices associated with that GPU.


Which in-kernel driver, which PCI ID?


Aha, it's this again. Didn't catch your meaning at first. Point taken.

--
Reza Arbab

Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Christoph Hellwig

On Wed, Oct 30, 2019 at 01:32:01PM -0500, Reza Arbab wrote:
> Aha, it's this again. Didn't catch your meaning at first. Point taken.

It's not _me_.  It that you (plural) keep ignoring how Linux development
works.

Re: [PATCH v2] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-10-30 Thread Qian Cai

On Tue, 2019-07-23 at 16:57 +0530, Anju T Sudhakar wrote:
> Hi Qian,
> 
> On 7/16/19 12:11 AM, Qian Cai wrote:
> > On Thu, 2019-07-11 at 14:53 +1000, Michael Ellerman wrote:
> > > Hi Maddy,
> > > 
> > > Madhavan Srinivasan  writes:
> > > > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c
> > > > b/arch/powerpc/platforms/powernv/opal-imc.c
> > > > index 186109bdd41b..e04b20625cb9 100644
> > > > --- a/arch/powerpc/platforms/powernv/opal-imc.c
> > > > +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> > > > @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct 
> > > > device_node
> > > > *node,
> > > >     if (of_property_read_u32(node, "cb_offset", &cb_offset))
> > > >     cb_offset = IMC_CNTL_BLK_OFFSET;
> > > >   
> > > > -   for_each_node(nid) {
> > > > -   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
> > > > +   while (ptr->vbase != NULL) {
> > > 
> > > This means you'll bail out as soon as you find a node with no vbase, but
> > > it's possible we could have a CPU-less node intermingled with other
> > > nodes.
> > > 
> > > So I think you want to keep the for loop, but continue if you see a NULL
> > > vbase?
> > 
> > Not sure if this will also takes care of some of those messages during the 
> > boot
> > on today's linux-next even without this patch.
> > 
> > 
> > [   18.077780][T1] debugfs: Directory 'imc' with parent 'powerpc' 
> > already
> > present!
> > 
> > 
> 
> This is introduced by a recent commit: c33d442328f55 (debugfs: make 
> error message a bit more verbose).
> 
> So basically, the debugfs imc_* file is created per node, and is created 
> by the first nest unit which is
> 
> being registered. For the subsequent nest units, debugfs_create_dir() 
> will just return since the imc_* file already
> 
> exist.
> 
> The commit "c33d442328f55 (debugfs: make error message a bit more 
> verbose)", prints
> 
> a message if the debugfs file already exists in debugfs_create_dir(). 
> That is why we are encountering these
> 
> messages now.
> 
> 
> This patch (i.e, powerpc/imc: Dont create debugfs files for cpu-less 
> nodes) will address the initial issue, i.e
> 
> "numa crash while reading imc_* debugfs files for cpu less nodes", and 
> will not address these debugfs messages.
> 
> 
> But yeah this is a good catch. We can have some checks to avoid these 
> debugfs messages.

Anju, do you still plan to fix those "Directory 'imc' with parent 'powerpc'
already present!" warnings as they are still there in the latest linux-next?

> 
> 
> Hi Michael,
> 
> Do we need to have a separate patch to address these debugfs messages, 
> or can we address the same
> 
> in the next version of this patch itself?
> 
> 
> Thanks,
> 
> Anju
> 
> 
> 
>

Re: [PATCH v4 0/4] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Kees Cook

On Wed, Oct 30, 2019 at 11:16:22AM +1100, Michael Ellerman wrote:
> Kees Cook  writes:
> > On Mon, Oct 14, 2019 at 04:13:16PM +1100, Russell Currey wrote:
> >> v3 cover letter here:
> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html
> >> 
> >> Only minimal changes since then:
> >> 
> >> - patch 2/4 commit message update thanks to Andrew Donnellan
> >> - patch 3/4 made neater thanks to Christophe Leroy
> >> - patch 3/4 updated Kconfig description thanks to Daniel Axtens
> >
> > I continue to be excited about this work. :) Is there anything holding
> > it back from landing in linux-next?
> 
> I had some concerns, which I stupidly posted in reply to v3:
> 
>   https://lore.kernel.org/linuxppc-dev/87pnio5fva@mpe.ellerman.id.au/

Ah-ha! Thanks; I missed that. :)

-- 
Kees Cook

Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Christophe Leroy





Le 30/10/2019 à 19:30, Kees Cook a écrit :

On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote:



Le 30/10/2019 à 08:31, Russell Currey a écrit :

v4 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
v3 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html

Changes since v4:
[1/5]: Addressed review comments from Michael Ellerman (thanks!)
[4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
   ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
   STRICT_MODULE_RWX being *on by default* in cases where
   STRICT_KERNEL_RWX is *unavailable*
[5/5]: split skiroot_defconfig changes out into its own patch

The whole Kconfig situation is really weird and confusing, I believe the
correct resolution is to change arch/Kconfig but the consequences are so
minor that I don't think it's worth it, especially given that I expect
powerpc to have mandatory strict RWX Soon(tm).


I'm not such strict RWX can be made mandatory due to the impact it has on
some subarches:
- On the 8xx, unless all areas are 8Mbytes aligned, there is a significant
overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which
is not acceptable on systems having very few RAM.
- On hash book3s32, we are able to map the kernel BATs. With a few alignment
constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to
provide exec protection on page granularity. Only on 256Mbytes segments. So
for modules, we have to have the vmspace X. It is also not possible to have
a kernel area RO. Only user areas can be made RO.


As I understand it, the idea was for it to be mandatory (or at least
default-on) only for the subarches where it wasn't totally insane to
accomplish. :) (I'm not familiar with all the details on the subarchs,
but it sounded like the more modern systems would be the targets for
this?)



Yes I guess so.

I was just afraid by "I expect powerpc to have mandatory strict RWX"

By the way, we have an open issue #223 namely 'Make strict kernel RWX 
the default on 64-bit', so no worry as 32-bit is aside for now.


Christophe

Section mismatch warnings on powerpc

2019-10-30 Thread Qian Cai

Still see those,

WARNING: vmlinux.o(.text+0x2d04): Section mismatch in reference from the
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x2ec8): Section mismatch in reference from the
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

There is a patch around,

http://patchwork.ozlabs.org/patch/895442/

Does it still wait for Michael to come with some better names?

[PATCH 00/12] Convert cpu_up/down to device_online/offline

2019-10-30 Thread Qais Yousef

Using cpu_up/down directly to bring cpus online/offline loses synchronization
with sysfs and could suffer from a race similar to what is described in
commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization
during LPM").

cpu_up/down seem to be more of a internal implementation detail for the cpu
subsystem to use to boot up cpus, perform suspend/resume and low level hotplug
operations. Users outside of the cpu subsystem would be better using the device
core API to bring a cpu online/offline which is the interface used to hotplug
memory and other system devices.

Several users have already migrated to use the device core API, this series
converts the remaining users and hides cpu_up/down from internal users at the
end.

I still need to update the documentation to remove references to cpu_up/down
and advocate for device_online/offline instead if this series will make its way
through.

I noticed this problem while working on a hack to disable offlining
a particular CPU but noticed that setting the offline_disabled attribute in the
device struct isn't enough because users can easily bypass the device core.
While my hack isn't a valid use case but it did highlight the inconsistency in
the way cpus are being onlined/offlined and this attempt hopefully improves on
this.

The first 6 patches fixes arch users.

The next 5 patches fixes generic code users. Particularly creating a new
special exported API for the device core to use instead of cpu_up/down.
Maybe we can do something more restrictive than that.

The last patch removes cpu_up/down from cpu.h and unexport the functions.

In some cases where the use of cpu_up/down seemed legitimate, I encapsulated
the logic in a higher level - special purposed function; and converted the code
to use that instead.

I did run the rcu torture, lock torture and psci checker tests and no problem
was noticed. I did perform build tests on all arch affected except for parisc.

Hopefully I got the CC list right for all the patches. Apologies in advance if
some people were omitted from some patches but they should have been CCed.

CC: Armijn Hemel 
CC: Benjamin Herrenschmidt 
CC: Bjorn Helgaas 
CC: Borislav Petkov 
CC: Boris Ostrovsky 
CC: Catalin Marinas 
CC: Christophe Leroy 
CC: Daniel Lezcano 
CC: Davidlohr Bueso 
CC: "David S. Miller" 
CC: Eiichi Tsukata 
CC: Enrico Weigelt 
CC: Fenghua Yu 
CC: Greg Kroah-Hartman 
CC: Helge Deller 
CC: "H. Peter Anvin" 
CC: Ingo Molnar 
CC: "James E.J. Bottomley" 
CC: James Morse 
CC: Jiri Kosina 
CC: Josh Poimboeuf 
CC: Josh Triplett 
CC: Juergen Gross 
CC: Lorenzo Pieralisi 
CC: Mark Rutland 
CC: Michael Ellerman 
CC: Nadav Amit 
CC: Nicholas Piggin 
CC: "Paul E. McKenney" 
CC: Paul Mackerras 
CC: Pavankumar Kondeti 
CC: "Peter Zijlstra (Intel)" 
CC: "Rafael J. Wysocki" 
CC: Ram Pai 
CC: Richard Fontana 
CC: Sakari Ailus 
CC: Stefano Stabellini 
CC: Steve Capper 
CC: Thiago Jung Bauermann 
CC: Thomas Gleixner 
CC: Tony Luck 
CC: Will Deacon 
CC: Zhenzhong Duan 
CC: linux-arm-ker...@lists.infradead.org
CC: linux-i...@vger.kernel.org
CC: linux-ker...@vger.kernel.org
CC: linux-par...@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: sparcli...@vger.kernel.org
CC: x...@kernel.org
CC: xen-de...@lists.xenproject.org


Qais Yousef (12):
  arm64: hibernate.c: create a new function to handle cpu_up(sleep_cpu)
  x86: Replace cpu_up/down with devcie_online/offline
  powerpc: Replace cpu_up/down with device_online/offline
  ia64: Replace cpu_down with freeze_secondary_cpus
  sparc: Replace cpu_up/down with device_online/offline
  parisc: Replace cpu_up/down with device_online/offline
  driver: base: cpu: export device_online/offline
  driver: xen: Replace cpu_up/down with device_online/offline
  firmware: psci: Replace cpu_up/down with device_online/offline
  torture: Replace cpu_up/down with device_online/offline
  smp: Create a new function to bringup nonboot cpus online
  cpu: Hide cpu_up/down

 arch/arm64/kernel/hibernate.c  | 13 +++
 arch/ia64/kernel/process.c |  8 +---
 arch/parisc/kernel/processor.c |  4 +-
 arch/powerpc/kernel/machine_kexec_64.c |  4 +-
 arch/sparc/kernel/ds.c |  8 +++-
 arch/x86/kernel/topology.c |  4 +-
 arch/x86/mm/mmio-mod.c |  8 +++-
 arch/x86/xen/smp.c |  4 +-
 drivers/base/core.c|  4 ++
 drivers/base/cpu.c |  4 +-
 drivers/firmware/psci/psci_checker.c   |  6 ++-
 drivers/xen/cpu_hotplug.c  |  2 +-
 include/linux/cpu.h|  6 ++-
 kernel/cpu.c   | 53 --
 kernel/smp.c   |  9 +
 kernel/torture.c   | 15 ++--
 16 files changed, 106 insertions(+), 46 deletions(-)

-- 
2.17.1

[PATCH 03/12] powerpc: Replace cpu_up/down with device_online/offline

2019-10-30 Thread Qais Yousef

The core device API performs extra housekeeping bits that are missing
from directly calling cpu_up/down.

See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and
serialization during LPM") for an example description of what might go
wrong.

This also prepares to make cpu_up/down a private interface for anything
but the cpu subsystem.

Signed-off-by: Qais Yousef 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Enrico Weigelt 
CC: Ram Pai 
CC: Nicholas Piggin 
CC: Thiago Jung Bauermann 
CC: Christophe Leroy 
CC: Thomas Gleixner 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-ker...@vger.kernel.org
---
 arch/powerpc/kernel/machine_kexec_64.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index 04a7cba58eff..ebf8cc7acc4d 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -208,13 +208,15 @@ static void wake_offline_cpus(void)
 {
int cpu = 0;
 
+   lock_device_hotplug();
for_each_present_cpu(cpu) {
if (!cpu_online(cpu)) {
printk(KERN_INFO "kexec: Waking offline cpu %d.\n",
   cpu);
-   WARN_ON(cpu_up(cpu));
+   WARN_ON(device_online(get_cpu_device(cpu)));
}
}
+   unlock_device_hotplug();
 }
 
 static void kexec_prepare_cpus(void)
-- 
2.17.1

Re: [PATCH RFC 1/5] dma/direct: turn ARCH_ZONE_DMA_BITS into a variable

2019-10-30 Thread Christoph Hellwig

On Mon, Oct 14, 2019 at 08:31:03PM +0200, Nicolas Saenz Julienne wrote:
> Some architectures, notably ARM, are interested in tweaking this
> depending on their runtime DMA addressing limitations.
> 
> Signed-off-by: Nicolas Saenz Julienne 

Do you want me to pick this up for the 5.5 dma-mapping tree, or do you
want me to wait for the rest to settle?

Re: [PATCH v3 3/8] powerpc: Fix vDSO clock_getres()

2019-10-30 Thread Sasha Levin

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: a7f290dad32ee [PATCH] powerpc: Merge vdso's and add vdso support 
to 32 bits kernel.

The bot has tested the following trees: v5.3.8, v4.19.81, v4.14.151, v4.9.198, 
v4.4.198.

v5.3.8: Build OK!
v4.19.81: Build OK!
v4.14.151: Failed to apply! Possible dependencies:
5c929885f1bb4 ("powerpc/vdso64: Add support for 
CLOCK_{REALTIME/MONOTONIC}_COARSE")
b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across 
Y2038")

v4.9.198: Failed to apply! Possible dependencies:
4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c")
5c929885f1bb4 ("powerpc/vdso64: Add support for 
CLOCK_{REALTIME/MONOTONIC}_COARSE")
5d451a87e5ebb ("powerpc/64: Retrieve number of L1 cache sets from 
device-tree")
7c5b06cadf274 ("KVM: PPC: Book3S HV: Adapt TLB invalidations to work on 
POWER9")
83677f551e0a6 ("KVM: PPC: Book3S HV: Adjust host/guest context switch for 
POWER9")
902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per 
task")
b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across 
Y2038")
bd067f83b0840 ("powerpc/64: Fix naming of cache block vs. cache line")
e2827fe5c1566 ("powerpc/64: Clean up ppc64_caches using a struct per cache")
e9cf1e085647b ("KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs")
f4c51f841d2ac ("KVM: PPC: Book3S HV: Modify guest entry/exit paths to 
handle radix guests")

v4.4.198: Failed to apply! Possible dependencies:
153086644fd1f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace 
ABI")
3eb5d5888dc68 ("powerpc: Add ppc_strict_facility_enable boot option")
4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c")
579e633e764e6 ("powerpc: create flush_all_to_thread()")
5c929885f1bb4 ("powerpc/vdso64: Add support for 
CLOCK_{REALTIME/MONOTONIC}_COARSE")
70fe3d980f5f1 ("powerpc: Restore FPU/VEC/VSX if previously used")
85baa095497f3 ("powerpc/livepatch: Add live patching support on ppc64le")
902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per 
task")
b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across 
Y2038")
bf76f73c5f655 ("powerpc: enable UBSAN support")
c208505900b23 ("powerpc: create giveup_all()")
d1e1cf2e38def ("powerpc: clean up asm/switch_to.h")
dc4fbba11e466 ("powerpc: Create disable_kernel_{fp,altivec,vsx,spe}()")
f17c4e01e906c ("powerpc/module: Mark module stubs with a magic value")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks,
Sasha

[PATCH 01/19] mm/gup: pass flags arg to __gup_device_* functions

2019-10-30 Thread John Hubbard

A subsequent patch requires access to gup flags, so
pass the flags argument through to the __gup_device_*
functions.

Also placate checkpatch.pl by shortening a nearby line.

Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..85caf76b3012 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
int nr_start = *nr;
struct dev_pagemap *pgmap = NULL;
@@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
unsigned long fault_pfn;
int nr_start = *nr;
 
fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+   if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
return 0;
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
*pmdp, unsigned long addr,
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
unsigned long fault_pfn;
int nr_start = *nr;
 
fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+   if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
return 0;
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
*pudp, unsigned long addr,
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
BUILD_BUG();
return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
BUILD_BUG();
return 0;
@@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
if (pmd_devmap(orig)) {
if (unlikely(flags & FOLL_LONGTERM))
return 0;
-   return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+   return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+pages, nr);
}
 
refs = 0;
@@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-   unsigned long end, unsigned int flags, struct page **pages, int 
*nr)
+   unsigned long end, unsigned int flags,
+   struct page **pages, int *nr)
 {
struct page *head, *page;
int refs;
@@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned 
long addr,
if (pud_devmap(orig)) {
if (unlikely(flags & FOLL_LONGTERM))
return 0;
-   return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+   return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+pages, nr);
}
 
refs = 0;
-- 
2.23.0

[PATCH 00/19] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-10-30 Thread John Hubbard

Hi,

This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.

This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()

Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:

get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()

to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()


Patch summary:

* Patches 1-4: refactoring and preparatory cleanup, independent fixes
(Patch 4: V4L2-core bug fix (can be separately applied))

* Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional
   changes yet
* Patches 6-11: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 12: Activate tracking of FOLL_PIN pages.
* Patches 13-15: convert FOLL_LONGTERM callers
* Patches: 16-17: gup_benchmark and run_vmtests support
* Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag
* Patch 19: Documentation/vm/pin_user_pages.rst


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've been
  able to runtime test the core get_user_pages() and pin_user_pages() and
  related routines, but not so much on several of the call sites--but those
  are generally just a couple of lines changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Also, my runtime testing for the call sites so far is very weak:

* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
  good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
  not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...


Next:

* Get the block/bio_vec sites converted to use pin_user_pages().

* Work with Ira and Dave Chinner to weave this together with the
  layout lease stuff.



[1] Some slow progress on get_user_pages() (Apr 2, 2019): 
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): 
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): 
https://lwn.net/Articles/753027/

John Hubbard (19):
  mm/gup: pass flags arg to __gup_device_* functions
  mm/gup: factor out duplicate code from four routines
  goldish_pipe: rename local pin_user_pages() routine
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  mm/gup: track FOLL_PIN pages
  media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
  vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversio

[PATCH 02/19] mm/gup: factor out duplicate code from four routines

2019-10-30 Thread John Hubbard

There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 113 ++-
 1 file changed, 46 insertions(+), 67 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 85caf76b3012..8fb0d9cdfaf5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1969,6 +1969,35 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int __record_subpages(struct page *page, unsigned long addr,
+unsigned long end, struct page **pages, int nr)
+{
+   int nr_recorded_pages = 0;
+
+   do {
+   pages[nr] = page;
+   nr++;
+   page++;
+   nr_recorded_pages++;
+   } while (addr += PAGE_SIZE, addr != end);
+   return nr_recorded_pages;
+}
+
+static void __remove_refs_from_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
+static int __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
+{
+   *nr += nr_recorded_pages;
+   SetPageReferenced(head);
+   return 1;
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -1998,34 +2027,19 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   __remove_refs_from_head(head, refs);
return 0;
}
-
-   SetPageReferenced(head);
-   return 1;
+   return __huge_pt_done(head, refs, nr);
 }
 
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
@@ -2071,30 +2085,18 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
 pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   __remove_refs_from_head(head, refs);
return 0;
}
-
-   SetPageReferenced(head);
-   return 1;
+   return __huge_pt_done(head, refs, nr);
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
@@ -2114,30 +2116,18 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
 pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud

[PATCH 03/19] goldish_pipe: rename local pin_user_pages() routine

2019-10-30 Thread John Hubbard

1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "pin_goldfish_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..7ed2a21a0bac 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int pin_goldfish_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(&pipe->lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, &iter_last_page_size);
+   pages_count = pin_goldfish_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, &iter_last_page_size);
if (pages_count < 0) {
mutex_unlock(&pipe->lock);
return pages_count;
-- 
2.23.0

[PATCH 04/19] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-10-30 Thread John Hubbard

After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Cc: Mauro Carvalho Chehab 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.23.0

[PATCH 06/19] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-10-30 Thread John Hubbard

1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index 7ed2a21a0bac..635a8bc1b480 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(&pipe->lock);
return 0;
-- 
2.23.0

[PATCH 07/19] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-10-30 Thread John Hubbard

Convert infiniband to use the new wrapper calls, and stop
explicitly setting FOLL_LONGTERM at the call sites.

The new pin_longterm_*() calls replace get_user_pages*()
calls, and set both FOLL_LONGTERM and a new FOLL_PIN
flag. The FOLL_PIN flag requires that the caller must
return the pages via put_user_page*() calls, but
infiniband was already doing that as part of an earlier
commit.

Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  5 ++---
 drivers/infiniband/core/umem_odp.c  | 10 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  4 ++--
 drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  8 
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  9 -
 drivers/infiniband/sw/siw/siw_mem.c |  5 ++---
 8 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 24244a2f68cc..c5a78d3e674b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
 
while (npages) {
down_read(&mm->mmap_sem);
-   ret = get_user_pages(cur_base,
+   ret = pin_longterm_pages(cur_base,
 min_t(unsigned long, npages,
   PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
+gup_flags, page_list, NULL);
if (ret < 0) {
up_read(&mm->mmap_sem);
goto umem_release;
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 163ff7ba92b7..a38b67b83db5 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -534,7 +534,7 @@ static int ib_umem_odp_map_dma_single_page(
} else if (umem_odp->page_list[page_index] == page) {
umem_odp->dma_list[page_index] |= access_mask;
} else {
-   pr_err("error: got different pages in IB device and from 
get_user_pages. IB device page: %p, gup page: %p\n",
+   pr_err("error: got different pages in IB device and from 
pin_longterm_pages. IB device page: %p, gup page: %p\n",
   umem_odp->page_list[page_index], page);
/* Better remove the mapping now, to prevent any further
 * damage. */
@@ -639,11 +639,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp 
*umem_odp, u64 user_virt,
/*
 * Note: this might result in redundent page getting. We can
 * avoid this by checking dma_list to be 0 before calling
-* get_user_pages. However, this make the code much more
-* complex (and doesn't gain us much performance in most use
-* cases).
+* pin_longterm_pages. However, this makes the code much
+* more complex (and doesn't gain us much performance in most
+* use cases).
 */
-   npages = get_user_pages_remote(owning_process, owning_mm,
+   npages = pin_longterm_pages_remote(owning_process, owning_mm,
user_virt, gup_num_pages,
flags, local_page_list, NULL, NULL);
up_read(&owning_mm->mmap_sem);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9b55b0a73e29 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -104,9 +104,9 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
bool writable, struct page **pages)
 {
int ret;
-   unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
+   unsigned int gup_flags = (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_longterm_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..beec7e4b8a96 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
- FOLL_WRITE | FOLL_LONGTERM, pages);
+   ret = pin_longterm_pages_fast(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages);
if (ret < 0)

[PATCH 05/19] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-10-30 Thread John Hubbard

Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

These variants all set FOLL_PIN, which is also introduced, and
basically documented. (An upcoming patch provides more extensive
documentation.) The second set (pin_longterm*) also sets
FOLL_LONGTERM:

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

pin_longterm_pages()
pin_longterm_pages_remote()
pin_longterm_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* These are gup-internal flags, so the call sites should not directly
set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
assertions, for the new FOLL_PIN flag. However, for the pre-existing
FOLL_LONGTERM flag, which has some call sites that still directly
set FOLL_LONGTERM, there is no assertion yet.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Signed-off-by: John Hubbard 
---
 include/linux/mm.h |  53 -
 mm/gup.c   | 284 +
 2 files changed, 311 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..62c838a3e6c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+  unsigned long start, unsigned long nr_pages,
+  unsigned int gup_flags, struct page **pages,
+  struct vm_area_struct **vmas, int *locked);
+long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+  unsigned long start, unsigned long nr_pages,
+  unsigned int gup_flags, struct page **pages,
+  struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+   unsigned int gup_flags, struct page **pages,
+   struct vm_area_struct **vmas);
+long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
+   unsigned int gup_flags, struct page **pages,
+   struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+   unsigned int gup_flags, struct page **pages);
+int pin_longterm_pages_fast(unsigned long start, int nr_pages,
+   unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_ANON  0x8000  /* don't do file mappings */
 #define FOLL_LONGTERM  0x1 /* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
+#define FOLL_PIN   0x4 /* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), where usages which are transient.
  *
  * FIXME: For pages

[PATCH 08/19] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-10-30 Thread John Hubbard

Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(&mm->mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, &locked);
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, &locked);
if (locked)
up_read(&mm->mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.23.0

[PATCH 10/19] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-10-30 Thread John Hubbard

Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a30c4f622cb3..d3924b1760eb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3431,9 +3431,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(¤t->mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
- FOLL_WRITE | FOLL_LONGTERM,
- pages, vmas);
+   pret = pin_longterm_pages(ubuf, nr_pages, FOLL_WRITE, pages,
+ vmas);
if (pret == nr_pages) {
/* don't support file backed memory */
for (j = 0; j < nr_pages; j++) {
-- 
2.23.0

[PATCH 09/19] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-10-30 Thread John Hubbard

Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.23.0

[PATCH 11/19] net/xdp: set FOLL_PIN via pin_user_pages()

2019-10-30 Thread John Hubbard

Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 16d5f353163a..4d56dfb1139a 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -285,8 +285,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(¤t->mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
- gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
+   npgs = pin_longterm_pages(umem->address, umem->npgs, gup_flags,
+ &umem->pgs[0], NULL);
up_read(¤t->mm->mmap_sem);
 
if (npgs != umem->npgs) {
-- 
2.23.0

[PATCH 12/19] mm/gup: track FOLL_PIN pages

2019-10-30 Thread John Hubbard

Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via put_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

This also has a couple of trivial, non-functional change fixes to
try_get_compound_head(). That function got moved to the top of the
file.

This includes the following fix from Ira Weiny:

DAX requires detection of a page crossing to a ref count of 1.  Fix this
for GUP pages by introducing put_devmap_managed_user_page() which
accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

[1] https://lwn.net/Articles/784574/ "Some slow progress on
get_user_pages()"

Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h   |  80 +++
 include/linux/mmzone.h   |   2 +
 include/linux/page_ref.h |  10 ++
 mm/gup.c | 213 +++
 mm/huge_memory.c |  32 +-
 mm/hugetlb.c |  28 -
 mm/memremap.c|   4 +-
 mm/vmstat.c  |   2 +
 8 files changed, 300 insertions(+), 71 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 62c838a3e6c7..882fda919c81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void __put_devmap_managed_page(struct page *page, int count);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(&devmap_managed_key))
return false;
@@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+static inline bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   __put_devmap_managed_page(page, count);
+   }
+
+   return is_devmap;
+}
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline bool put_devmap_managed_page(struct page *page)
 {
@@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page 
*page)
return true;
 }
 
+__must_check bool user_page_ref_inc(struct page *page);
+
 static inline void put_page(struct page *page)
 {
page = compound_head(page);
@@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * put_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many get_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, get_user_pages() becomes special: such pages are marked
+ * as distinct from normal pages. As such, the new put_user_page() call (and
+ * its variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
  *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to get_user_pages() and put_user_page() becomes simpler,
+ * due to the fact that adding an even power of two to the page refcount has
+ * the effect of using only the upper N bits, for the code that counts up using
+ * the bias value. This means that the lower bits are left for the exclusive
+ * use of the original code that increments and decrements by one (or at least,
+ * by much smaller values than the bias value).
  *
-

[PATCH 13/19] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-10-30 Thread John Hubbard

1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Cc: Mauro Carvalho Chehab 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..9b9c5b37bf59 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-flags | FOLL_LONGTERM, dma->pages, NULL);
+   err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages,
+flags, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.23.0

[PATCH 14/19] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-10-30 Thread John Hubbard

This also fixes one or two likely bugs.

1. Change vfio from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

Note that this is a change in behavior, because the
get_user_pages_remote() call was not setting FOLL_LONGTERM, but the
new pin_user_pages_remote() call that replaces it, *is* setting
FOLL_LONGTERM. It is important to set FOLL_LONGTERM, because the
DMA case requires it. Please see the FOLL_PIN documentation in
include/linux/mm.h, and Documentation/pin_user_pages.rst for details.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Cc: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d864277ea16f..795e13f3ef08 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 
down_read(&mm->mmap_sem);
if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
+   ret = pin_longterm_pages(vaddr, 1, flags, page, vmas);
} else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
+   ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
+   flags, page, vmas,
+   NULL);
/*
 * The lifetime of a vaddr_get_pfn() page pin is
 * userspace-controlled. In the fs-dax case this could
@@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 */
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
-   put_page(page[0]);
+   put_user_page(page[0]);
}
}
up_read(&mm->mmap_sem);
-- 
2.23.0

[PATCH 18/19] mm/gup: remove support for gup(FOLL_LONGTERM)

2019-10-30 Thread John Hubbard

Now that all other kernel callers of get_user_pages(FOLL_LONGTERM)
have been converted to pin_longterm_pages(), lock it down:

1) Add an assertion to get_user_pages(), preventing callers from
   passing FOLL_LONGTERM (in addition to the existing assertion that
   prevents FOLL_PIN).

2) Remove the associated GUP_LONGTERM_BENCHMARK test.

Signed-off-by: John Hubbard 
---
 mm/gup.c   | 8 
 mm/gup_benchmark.c | 9 +
 tools/testing/selftests/vm/gup_benchmark.c | 7 ++-
 3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e51b3820a995..9a28935a2cb1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1744,11 +1744,11 @@ long get_user_pages(unsigned long start, unsigned long 
nr_pages,
struct vm_area_struct **vmas)
 {
/*
-* As detailed above, FOLL_PIN must only be set internally by the
-* pin_user_page*() and pin_longterm_*() APIs, never directly by the
-* caller, so enforce that with an assertion:
+* As detailed above, FOLL_PIN and FOLL_LONGTERM must only be set
+* internally by the pin_user_page*() and pin_longterm_*() APIs, never
+* directly by the caller, so enforce that with an assertion:
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+   if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
return -EINVAL;
 
return __gup_longterm_locked(current, current->mm, start, nr_pages,
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 2bb0f5df4803..de6941855b7e 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,7 +6,7 @@
 #include 
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 #define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
 #define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
@@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, 
unsigned long nr_pages)
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
@@ -94,11 +93,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages_fast(addr, nr, gup->flags & 1,
 pages + i);
break;
-   case GUP_LONGTERM_BENCHMARK:
-   nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
-   pages + i, NULL);
-   break;
case GUP_BENCHMARK:
nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
NULL);
@@ -157,7 +151,6 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
case PIN_FAST_BENCHMARK:
case PIN_LONGTERM_BENCHMARK:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index c5c934c0f402..5ef3cf8f3da5 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,7 +15,7 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
 /*
@@ -46,7 +46,7 @@ int main(int argc, char **argv)
char *file = "/dev/zero";
char *p;
 
-   while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) {
+   while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -72,9 +72,6 @@ int main(int argc, char **argv)
case 'T':
thp = 0;
break;
-   case 'L':
-   cmd = GUP_LONGTERM_BENCHMARK;
-   break;
case 'U':
cmd = GUP_BENCHMARK;
break;
-- 
2.23.0

[PATCH 16/19] mm/gup_benchmark: support pin_user_pages() and related calls

2019-10-30 Thread John Hubbard

Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages(): via the '-c' command line option
* pin_longterm_pages(): via the '-b' command line option
* pin_user_pages_fast(): via the '-a' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the three commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and put_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 74 --
 tools/testing/selftests/vm/gup_benchmark.c | 23 ++-
 2 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..2bb0f5df4803 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,9 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 6, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +22,44 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_LONGTERM_BENCHMARK:
+   case PIN_BENCHMARK:
+   put_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_LONGTERM_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -62,6 +103,19 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags & 1,
+pages + i);
+   break;
+   case PIN_LONGTERM_BENCHMARK:
+   nr = pin_longterm_pages(addr, nr,
+   (gup->flags & 1),
+   pages + i, NULL);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags & 1, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -72,15 +126,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+
end_ti

[PATCH 15/19] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()

2019-10-30 Thread John Hubbard

1. Convert from get_user_pages(FOLL_LONGTERM) to pin_longterm_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

3. Release each page in mem->hpages[] (instead of mem->hpas[]), because
that is the array that pin_longterm_pages() filled in. This is more
accurate and should be a little safer from a maintenance point of
view.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..69d79cb50d47 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,9 +103,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
-   FOLL_WRITE | FOLL_LONGTERM,
-   mem->hpages + entry, NULL);
+   ret = pin_longterm_pages(ua + (entry << PAGE_SHIFT), n,
+FOLL_WRITE, mem->hpages + entry, NULL);
if (ret == n) {
pinned += n;
continue;
@@ -167,9 +166,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -212,10 +210,9 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (!page)
continue;
 
-   if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
-   SetPageDirty(page);
+   put_user_pages_dirty_lock(&mem->hpages[i], 1,
+ MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
 
-   put_page(page);
mem->hpas[i] = 0;
}
 }
-- 
2.23.0

[PATCH 19/19] Documentation/vm: add pin_user_pages.rst

2019-10-30 Thread John Hubbard

Document the new pin_user_pages() and related calls
and behavior.

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded on it slightly.)

Cc: Jonathan Corbet 
Signed-off-by: John Hubbard 
---
 Documentation/vm/index.rst  |   1 +
 Documentation/vm/pin_user_pages.rst | 213 
 2 files changed, 214 insertions(+)
 create mode 100644 Documentation/vm/pin_user_pages.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index e8d943b21cf9..7194efa3554a 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -44,6 +44,7 @@ descriptions of data structures and algorithms.
page_migration
page_frags
page_owner
+   pin_user_pages
remap_file_pages
slub
split_page_table_lock
diff --git a/Documentation/vm/pin_user_pages.rst 
b/Documentation/vm/pin_user_pages.rst
new file mode 100644
index ..7110bca3f188
--- /dev/null
+++ b/Documentation/vm/pin_user_pages.rst
@@ -0,0 +1,213 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions: ::
+
+ pin_user_pages
+ pin_user_pages_fast
+ pin_user_pages_remote
+
+ pin_longterm_pages
+ pin_longterm_pages_fast
+ pin_longterm_pages_remote
+
+Basic description of FOLL_PIN
+=
+
+A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
+significant interactions and interdependencies with FOLL_LONGTERM, so both are
+covered here.
+
+Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
+FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
+the associated wrapper functions  (pin_user_pages and others) to set the 
correct
+combination of these flags, and to check for problems as well.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
+whatever flags the caller provides::
+
+ Functiongup flags (FOLL_PIN or FOLL_LONGTERM only)
+ --
+ pin_user_pages  FOLL_PIN
+ pin_user_pages_fast FOLL_PIN
+ pin_user_pages_remote   FOLL_PIN
+
+ pin_longterm_pages  FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_fast FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
+
+Tracking dma-pinned pages
+=
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() 
called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 32-10==22 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 22 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 22 - 18 == 4 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of us

1 2 >

1 - 100 of 134 matches

Mail list logo