Re: [PATCH v3 4/8] PCI: replace pci_dev::driver usage that gets the driver name

2021-08-12 Thread Christoph Hellwig
On Wed, Aug 11, 2021 at 10:06:33AM +0200, Uwe Kleine-K??nig wrote:
>  static inline const char *eeh_driver_name(struct pci_dev *pdev)
>  {
> - return (pdev && pdev->driver) ? pdev->driver->name : "";
> + const char *drvstr = pdev ? dev_driver_string(&pdev->dev) : "";
> +
> + if (*drvstr == '\0')
> + return "";
> +
> + return drvstr;

This looks rather obsfucated due to the fact that dev_driver_string
never returns '\0', and due to the strange mix of a tenary operation
and the if on a related condition.


>  }
>  
>  #endif /* CONFIG_EEH */
> diff --git a/drivers/bcma/host_pci.c b/drivers/bcma/host_pci.c
> index 69c10a7b7c61..dc2ffa686964 100644
> --- a/drivers/bcma/host_pci.c
> +++ b/drivers/bcma/host_pci.c
> @@ -175,9 +175,10 @@ static int bcma_host_pci_probe(struct pci_dev *dev,
>   if (err)
>   goto err_kfree_bus;
>  
> - name = dev_name(&dev->dev);
> - if (dev->driver && dev->driver->name)
> - name = dev->driver->name;
> + name = dev_driver_string(&dev->dev);
> + if (*name == '\0')
> + name = dev_name(&dev->dev);

Where does this '\0' check come from?

> +
> + name = dev_driver_string(&dev->dev);
> + if (*name == '\0')
> + name = dev_name(&dev->dev);
> +

More of this weirdness.


Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-12 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit :
>> Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/mm/book3s64/radix_tlb.c | 48 
>>   1 file changed, 48 insertions(+)
>> 
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
>> b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..5cca0fe130e7 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -17,6 +17,7 @@
...
>> +
>> +static int __init create_tlb_single_page_flush_ceiling(void)
>> +{
>> +debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
>> +powerpc_debugfs_root, NULL, &fops_tlbflush);
>
> Could you just use debugfs_create_u32() instead of re-implementing simple 
> read and write ?

Yeah AFAICS that should work fine.

It could probably even be a u16?

cheers


Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Xianting TIan



在 2021/8/6 下午10:51, Arnd Bergmann 写道:

On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian
 wrote:

@@ -163,6 +155,13 @@ static void hvc_console_print(struct console *co, const 
char *b,
 if (vtermnos[index] == -1)
 return;

+   list_for_each_entry(hp, &hvc_structs, next)
+   if (hp->vtermno == vtermnos[index])
+   break;
+
+   c = hp->c;
+
+   spin_lock_irqsave(&hp->c_lock, flags);

The loop looks like it might race against changes to the list. It seems strange
that the print function has to actually search for the structure here.

It may be better to have yet another array for the buffer pointers next to
the cons_ops[] and vtermnos[] arrays.


+/*
+ * These sizes are most efficient for vio, because they are the
+ * native transfer size. We could make them selectable in the
+ * future to better deal with backends that want other buffer sizes.
+ */
+#define N_OUTBUF   16
+#define N_INBUF16
+
+#define __ALIGNED__ __attribute__((__aligned__(sizeof(long

I think you need a higher alignment for DMA buffers, instead of sizeof(long),
I would suggest ARCH_DMA_MINALIGN.


As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think 
it 's better remain the code unchanged,


I will send v5 patch soon.



Arnd


Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-12 Thread Aneesh Kumar K.V

On 8/12/21 12:58 PM, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit :

Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.

Signed-off-by: Aneesh Kumar K.V 
---
   arch/powerpc/mm/book3s64/radix_tlb.c | 48 
   1 file changed, 48 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..5cca0fe130e7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -17,6 +17,7 @@

...

+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+   debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+   powerpc_debugfs_root, NULL, &fops_tlbflush);


Could you just use debugfs_create_u32() instead of re-implementing simple read 
and write ?


Yeah AFAICS that should work fine.

It could probably even be a u16?



I was looking at switching all that to u64. Should i fallback to u16, 
considering a tlb_signle_page_flush_ceiling value larger that 2**16 
doesn't make sense?


-aneesh



Re: [PATCH v3 4/8] PCI: replace pci_dev::driver usage that gets the driver name

2021-08-12 Thread Uwe Kleine-König
On Thu, Aug 12, 2021 at 08:07:20AM +0100, Christoph Hellwig wrote:
> On Wed, Aug 11, 2021 at 10:06:33AM +0200, Uwe Kleine-K??nig wrote:
> >  static inline const char *eeh_driver_name(struct pci_dev *pdev)
> >  {
> > -   return (pdev && pdev->driver) ? pdev->driver->name : "";
> > +   const char *drvstr = pdev ? dev_driver_string(&pdev->dev) : "";
> > +
> > +   if (*drvstr == '\0')
> > +   return "";
> > +
> > +   return drvstr;
> 
> This looks rather obsfucated due to the fact that dev_driver_string
> never returns '\0', and due to the strange mix of a tenary operation
> and the if on a related condition.

dev_driver_string() might return "" (via dev_bus_name()). If that happens
*drvstr == '\0' becomes true.

Would the following be better?:

const char *drvstr;

if (pdev)
return "";

drvstr = dev_driver_string(&pdev->dev);

if (!strcmp(drvstr, ""))
return "";

return drvstr;

When I thought about this hunk I considered it ugly to have "" in
it twice.

> >  }
> >  
> >  #endif /* CONFIG_EEH */
> > diff --git a/drivers/bcma/host_pci.c b/drivers/bcma/host_pci.c
> > index 69c10a7b7c61..dc2ffa686964 100644
> > --- a/drivers/bcma/host_pci.c
> > +++ b/drivers/bcma/host_pci.c
> > @@ -175,9 +175,10 @@ static int bcma_host_pci_probe(struct pci_dev *dev,
> > if (err)
> > goto err_kfree_bus;
> >  
> > -   name = dev_name(&dev->dev);
> > -   if (dev->driver && dev->driver->name)
> > -   name = dev->driver->name;
> > +   name = dev_driver_string(&dev->dev);
> > +   if (*name == '\0')
> > +   name = dev_name(&dev->dev);
> 
> Where does this '\0' check come from?

The original code is equivalent to

if (dev->driver && dev->driver->name)
name = dev->driver->name;
else:
name = dev_name(...);

As dev_driver_string() implements something like:

if (dev->driver && dev->driver->name)
return dev->driver->name;
else
return "";

the change looks fine to me. (One could wonder if it's sensible to fall
back to the device name if the driver has no nice name, but this isn't
new with my change.)

> > +   name = dev_driver_string(&dev->dev);
> > +   if (*name == '\0')
> > +   name = dev_name(&dev->dev);
> > +
> 
> More of this weirdness.

I admit it's not pretty. Would it help to use !strcmp(name, "")
instead of *name == '\0'? Any other constructive suggestion?

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature


[PATCH v5 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Xianting Tian
As well known, hvc backend can register its opertions to hvc backend.
the opertions contain put_chars(), get_chars() and so on.

Some hvc backend may do dma in its opertions. eg, put_chars() of
virtio-console. But in the code of hvc framework, it may pass DMA
incapable memory to put_chars() under a specific configuration, which
is explained in commit c4baad5029(virtio-console: avoid DMA from stack):
1, c[] is on stack,
   hvc_console_print():
char c[N_OUTBUF] __ALIGNED__;
cons_ops[index]->put_chars(vtermnos[index], c, i);
2, ch is on stack,
   static void hvc_poll_put_char(,,char ch)
   {
struct tty_struct *tty = driver->ttys[0];
struct hvc_struct *hp = tty->driver_data;
int n;

do {
n = hp->ops->put_chars(hp->vtermno, &ch, 1);
} while (n <= 0);
   }

Commit c4baad5029 is just the fix to avoid DMA from stack memory, which
is passed to virtio-console by hvc framework in above code. But I think
the fix is aggressive, it directly uses kmemdup() to alloc new buffer
from kmalloc area and do memcpy no matter the memory is in kmalloc area
or not. But most importantly, it should better be fixed in the hvc
framework, by changing it to never pass stack memory to the put_chars()
function in the first place. Otherwise, we still face the same issue if
a new hvc backend using dma added in the furture.

We make 'char c[N_OUTBUF]' part of 'struct hvc_struct', so hp->c is no
longer the stack memory. we can use it in above two cases.

Other cleanup is use ARCH_DMA_MINALIGN for align, and make 'hp->outbuf'
aligned and use struct_size() for the size parameter of kzalloc().

With the patch, we can remove the fix c4baad5029.

Signed-off-by: Xianting Tian 
Tested-by: Xianting Tian 
---
 drivers/tty/hvc/hvc_console.c | 40 +--
 drivers/tty/hvc/hvc_console.h | 16 --
 2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 5bb8c4e44..c56564eb7 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -41,16 +41,6 @@
  */
 #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
 
-/*
- * These sizes are most efficient for vio, because they are the
- * native transfer size. We could make them selectable in the
- * future to better deal with backends that want other buffer sizes.
- */
-#define N_OUTBUF   16
-#define N_INBUF16
-
-#define __ALIGNED__ __attribute__((__aligned__(sizeof(long
-
 static struct tty_driver *hvc_driver;
 static struct task_struct *hvc_task;
 
@@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp)
 static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES];
 static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
{[0 ... MAX_NR_HVC_CONSOLES - 1] = -1};
+static char *cons_outbuf[MAX_NR_HVC_CONSOLES];
 
 /*
  * Console APIs, NOT TTY.  These APIs are available immediately when
@@ -151,18 +142,23 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
 static void hvc_console_print(struct console *co, const char *b,
  unsigned count)
 {
-   char c[N_OUTBUF] __ALIGNED__;
+   char *c;
unsigned i = 0, n = 0;
int r, donecr = 0, index = co->index;
+   unsigned long flags;
+   struct hvc_struct *hp;
 
/* Console access attempt outside of acceptable console range. */
if (index >= MAX_NR_HVC_CONSOLES)
return;
 
/* This console adapter was removed so it is not usable. */
-   if (vtermnos[index] == -1)
+   if (vtermnos[index] == -1 || !cons_outbuf[index])
return;
 
+   c = cons_outbuf[index];
+
+   spin_lock_irqsave(&hp->c_lock, flags);
while (count > 0 || i > 0) {
if (count > 0 && i < sizeof(c)) {
if (b[n] == '\n' && !donecr) {
@@ -191,6 +187,7 @@ static void hvc_console_print(struct console *co, const 
char *b,
}
}
}
+   spin_unlock_irqrestore(&hp->c_lock, flags);
hvc_console_flush(cons_ops[index], vtermnos[index]);
 }
 
@@ -878,9 +875,19 @@ static void hvc_poll_put_char(struct tty_driver *driver, 
int line, char ch)
struct tty_struct *tty = driver->ttys[0];
struct hvc_struct *hp = tty->driver_data;
int n;
+   unsigned long flags;
+   char *c;
+
+   if (!hp || !cons_outbuf[hp->index])
+   return;
+
+   c = cons_outbuf[hp->index];
 
do {
-   n = hp->ops->put_chars(hp->vtermno, &ch, 1);
+   spin_lock_irqsave(&hp->c_lock, flags);
+   c[0] = ch;
+   n = hp->ops->put_chars(hp->vtermno, c, 1);
+   spin_unlock_irqrestore(&hp->c_lock, flags);
} while (n <= 0);
 }
 #endif
@@ -922,8 +929,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
return ERR_PTR(err);
}
 
-   hp = kzalloc(ALIGN(size

[PATCH v5 0/2] make hvc pass dma capable memory to its backend

2021-08-12 Thread Xianting Tian
Dear all,

This patch series make hvc framework pass DMA capable memory to
put_chars() of hvc backend(eg, virtio-console), and revert commit
c4baad5029 ("virtio-console: avoid DMA from stack”)

V1
virtio-console: avoid DMA from vmalloc area
https://lkml.org/lkml/2021/7/27/494

For v1 patch, Arnd Bergmann suggests to fix the issue in the first
place:
Make hvc pass DMA capable memory to put_chars()
The fix suggestion is included in v2.

V2
[PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/1/8
[PATCH 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/1/9

For v2 patch, Arnd Bergmann suggests to make new buf part of the
hvc_struct structure, and fix the compile issue.
The fix suggestion is included in v3.

V3
[PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/3/1347
[PATCH v3 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/3/1348

For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of
hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to
calculate the size of hvc_struct. The fix suggestion is included in
v4.

V4
[PATCH v4 0/2] make hvc pass dma capable memory to its backend
https://lkml.org/lkml/2021/8/5/1350
[PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/5/1351
[PATCH v4 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/5/1352

For v4 patch, Arnd Bergmann suggests to introduce another
array(cons_outbuf[]) for the buffer pointers next to the cons_ops[]
and vtermnos[] arrays. This fix included in this v5 patch.

drivers/char/virtio_console.c | 12 ++--
drivers/tty/hvc/hvc_console.c | 40 +--
drivers/tty/hvc/hvc_console.h | 16 --
3 file changed


[PATCH v5 2/2] virtio-console: remove unnecessary kmemdup()

2021-08-12 Thread Xianting Tian
hvc framework will never pass stack memory to the put_chars() function,
So the calling of kmemdup() is unnecessary, we can remove it.

This revert commit c4baad5029 ("virtio-console: avoid DMA from stack")

Signed-off-by: Xianting Tian 
---
 drivers/char/virtio_console.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7eaf303a7..4ed3ffb1d 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int 
count)
 {
struct port *port;
struct scatterlist sg[1];
-   void *data;
-   int ret;
 
if (unlikely(early_put_chars))
return early_put_chars(vtermno, buf, count);
@@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int 
count)
if (!port)
return -EPIPE;
 
-   data = kmemdup(buf, count, GFP_ATOMIC);
-   if (!data)
-   return -ENOMEM;
-
-   sg_init_one(sg, data, count);
-   ret = __send_to_port(port, sg, 1, count, data, false);
-   kfree(data);
-   return ret;
+   sg_init_one(sg, buf, count);
+   return __send_to_port(port, sg, 1, count, (void *)buf, false);
 }
 
 /*
-- 
2.17.1



Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Arnd Bergmann
On Thu, Aug 12, 2021 at 10:08 AM Xianting TIan
 wrote:
> 在 2021/8/6 下午10:51, Arnd Bergmann 写道:
> > On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian
> >> +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long
> > I think you need a higher alignment for DMA buffers, instead of 
> > sizeof(long),
> > I would suggest ARCH_DMA_MINALIGN.
>
> As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think
> it 's better remain the code unchanged,
>
> I will send v5 patch soon.

I think you could just use "L1_CACHE_BYTES" as the alignment in this case.
This will make the structure slightly larger for architectures that do not have
alignment constraints on DMA buffers, but using a smaller alignment is
clearly wrong. Another option would be to use ARCH_KMALLOC_MINALIGN.

Note that there is a patch to add ARCH_DMA_MINALIGN to riscv already,
as some implementations do not have coherent DMA. I had failed to
realized though that on x86 you do not get an ARCH_DMA_MINALIGN
definition.

   Arnd


Re: [PATCH v2 1/2] KVM: Refactor kvm_arch_vcpu_fault() to return a struct page pointer

2021-08-12 Thread David Hildenbrand

On 12.08.21 06:02, Hou Wenlong wrote:

From: Sean Christopherson 

Refactor kvm_arch_vcpu_fault() to return 'struct page *' instead of
'vm_fault_t' to simplify architecture specific implementations that do
more than return SIGBUS.  Currently this only applies to s390, but a
future patch will move x86's pio_data handling into x86 where it belongs.

No functional changed intended.

Cc: Hou Wenlong 
Signed-off-by: Sean Christopherson 
Signed-off-by: Hou Wenlong 
---
  arch/arm64/kvm/arm.c   |  4 ++--
  arch/mips/kvm/mips.c   |  4 ++--
  arch/powerpc/kvm/powerpc.c |  4 ++--
  arch/s390/kvm/kvm-s390.c   | 12 
  arch/x86/kvm/x86.c |  4 ++--
  include/linux/kvm_host.h   |  2 +-
  virt/kvm/kvm_main.c|  5 -
  7 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e9a2b8f27792..83f4ffe3e4f2 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -161,9 +161,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return ret;
  }
  
-vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)

+struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  {
-   return VM_FAULT_SIGBUS;
+   return NULL;
  }
  
  
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c

index af9dd029a4e1..ae79874e6fd2 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1053,9 +1053,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, 
struct kvm_fpu *fpu)
return -ENOIOCTLCMD;
  }
  
-vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)

+struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  {
-   return VM_FAULT_SIGBUS;
+   return NULL;
  }
  
  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index be33b5321a76..b9c21f9ab784 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -2090,9 +2090,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
return r;
  }
  
-vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)

+struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  {
-   return VM_FAULT_SIGBUS;
+   return NULL;
  }
  
  static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 02574d7b3612..e1b69833e228 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4979,17 +4979,13 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
return r;
  }
  
-vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)

+struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  {
  #ifdef CONFIG_KVM_S390_UCONTROL
-   if ((vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET)
-&& (kvm_is_ucontrol(vcpu->kvm))) {
-   vmf->page = virt_to_page(vcpu->arch.sie_block);
-   get_page(vmf->page);
-   return 0;
-   }
+   if (vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET && 
kvm_is_ucontrol(vcpu->kvm))
+   return virt_to_page(vcpu->arch.sie_block);
  #endif
-   return VM_FAULT_SIGBUS;
+   return NULL;
  }
  
  /* Section: memory related */

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3cedc7cc132a..1e3bbe5cd33a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5347,9 +5347,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
return r;
  }
  
-vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)

+struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  {
-   return VM_FAULT_SIGBUS;
+   return NULL;
  }
  
  static int kvm_vm_ioctl_set_tss_addr(struct kvm *kvm, unsigned long addr)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 492d183dd7d0..a949de534722 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -995,7 +995,7 @@ long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
  long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg);
-vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
+struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
  
  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
  
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index 30d322519253..f7d21418971b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3448,7 +3448,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
&vcpu->dirty_ring,
vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
else
-   return kvm_arch_vcpu_fault(vcpu, vmf);
+   page = kvm_arch_vcpu_fault(vcpu, vmf);
+   if (!page)
+   

Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Xianting TIan



在 2021/8/12 下午4:54, Arnd Bergmann 写道:

On Thu, Aug 12, 2021 at 10:08 AM Xianting TIan
 wrote:

在 2021/8/6 下午10:51, Arnd Bergmann 写道:

On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian

+#define __ALIGNED__ __attribute__((__aligned__(sizeof(long

I think you need a higher alignment for DMA buffers, instead of sizeof(long),
I would suggest ARCH_DMA_MINALIGN.

As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think
it 's better remain the code unchanged,

I will send v5 patch soon.

I think you could just use "L1_CACHE_BYTES" as the alignment in this case.
This will make the structure slightly larger for architectures that do not have
alignment constraints on DMA buffers, but using a smaller alignment is
clearly wrong. Another option would be to use ARCH_KMALLOC_MINALIGN.

yes, I unstand you, the align size must  L1_CACHE_BYTES at least.


Note that there is a patch to add ARCH_DMA_MINALIGN to riscv already,
yes, I summited this patch, it is discussing, seems they don't want to 
apply it.

as some implementations do not have coherent DMA. I had failed to
realized though that on x86 you do not get an ARCH_DMA_MINALIGN
definition.
I didn't find the definition in arch/x86/include/asm/cache.h and other 
place, x86 is dma coherent, it may doesn't need it.


Arnd


[PATCH v6 0/2] make hvc pass dma capable memory to its backend

2021-08-12 Thread Xianting Tian
Dear all,

This patch series make hvc framework pass DMA capable memory to
put_chars() of hvc backend(eg, virtio-console), and revert commit
c4baad5029 ("virtio-console: avoid DMA from stack”)

V1
virtio-console: avoid DMA from vmalloc area
https://lkml.org/lkml/2021/7/27/494

For v1 patch, Arnd Bergmann suggests to fix the issue in the first
place:
Make hvc pass DMA capable memory to put_chars()
The fix suggestion is included in v2.

V2
[PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/1/8
[PATCH 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/1/9

For v2 patch, Arnd Bergmann suggests to make new buf part of the
hvc_struct structure, and fix the compile issue.
The fix suggestion is included in v3.

V3
[PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/3/1347
[PATCH v3 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/3/1348

For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of
hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to
calculate the size of hvc_struct. The fix suggestion is included in
v4.

V4
[PATCH v4 0/2] make hvc pass dma capable memory to its backend
https://lkml.org/lkml/2021/8/5/1350
[PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
https://lkml.org/lkml/2021/8/5/1351
[PATCH v4 2/2] virtio-console: remove unnecessary kmemdup()
https://lkml.org/lkml/2021/8/5/1352

For v4 patch, Arnd Bergmann suggests to introduce another
array(cons_outbuf[]) for the buffer pointers next to the cons_ops[]
and vtermnos[] arrays. This fix included in this v5 patch.

V5
Arnd Bergmann suggests to use "L1_CACHE_BYTES" as dma alignment,
use 'sizeof(long)' as dma alignment is wrong. fix it in v6.


drivers/char/virtio_console.c | 12 ++--
drivers/tty/hvc/hvc_console.c | 40 +--
drivers/tty/hvc/hvc_console.h | 16 --
3 file changed


[PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Xianting Tian
As well known, hvc backend can register its opertions to hvc backend.
the opertions contain put_chars(), get_chars() and so on.

Some hvc backend may do dma in its opertions. eg, put_chars() of
virtio-console. But in the code of hvc framework, it may pass DMA
incapable memory to put_chars() under a specific configuration, which
is explained in commit c4baad5029(virtio-console: avoid DMA from stack):
1, c[] is on stack,
   hvc_console_print():
char c[N_OUTBUF] __ALIGNED__;
cons_ops[index]->put_chars(vtermnos[index], c, i);
2, ch is on stack,
   static void hvc_poll_put_char(,,char ch)
   {
struct tty_struct *tty = driver->ttys[0];
struct hvc_struct *hp = tty->driver_data;
int n;

do {
n = hp->ops->put_chars(hp->vtermno, &ch, 1);
} while (n <= 0);
   }

Commit c4baad5029 is just the fix to avoid DMA from stack memory, which
is passed to virtio-console by hvc framework in above code. But I think
the fix is aggressive, it directly uses kmemdup() to alloc new buffer
from kmalloc area and do memcpy no matter the memory is in kmalloc area
or not. But most importantly, it should better be fixed in the hvc
framework, by changing it to never pass stack memory to the put_chars()
function in the first place. Otherwise, we still face the same issue if
a new hvc backend using dma added in the furture.

We make 'char c[N_OUTBUF]' part of 'struct hvc_struct', so hp->c is no
longer the stack memory. we can use it in above two cases.

Other fix is use L1_CACHE_BYTES as the alignment, use 'sizeof(long)' as
dma alignment is wrong. And use struct_size() to calculate size of
hvc_struct.

Introduce another array(cons_outbuf[]) for the hp->c pointers next to
the cons_ops[] and vtermnos[] arrays.

With the patch, we can remove the fix c4baad5029.

Signed-off-by: Xianting Tian 
Tested-by: Xianting Tian 
---
 drivers/tty/hvc/hvc_console.c | 40 +--
 drivers/tty/hvc/hvc_console.h | 16 --
 2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 5bb8c4e44..c56564eb7 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -41,16 +41,6 @@
  */
 #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
 
-/*
- * These sizes are most efficient for vio, because they are the
- * native transfer size. We could make them selectable in the
- * future to better deal with backends that want other buffer sizes.
- */
-#define N_OUTBUF   16
-#define N_INBUF16
-
-#define __ALIGNED__ __attribute__((__aligned__(sizeof(long
-
 static struct tty_driver *hvc_driver;
 static struct task_struct *hvc_task;
 
@@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp)
 static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES];
 static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
{[0 ... MAX_NR_HVC_CONSOLES - 1] = -1};
+static char *cons_outbuf[MAX_NR_HVC_CONSOLES];
 
 /*
  * Console APIs, NOT TTY.  These APIs are available immediately when
@@ -151,18 +142,23 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
 static void hvc_console_print(struct console *co, const char *b,
  unsigned count)
 {
-   char c[N_OUTBUF] __ALIGNED__;
+   char *c;
unsigned i = 0, n = 0;
int r, donecr = 0, index = co->index;
+   unsigned long flags;
+   struct hvc_struct *hp;
 
/* Console access attempt outside of acceptable console range. */
if (index >= MAX_NR_HVC_CONSOLES)
return;
 
/* This console adapter was removed so it is not usable. */
-   if (vtermnos[index] == -1)
+   if (vtermnos[index] == -1 || !cons_outbuf[index])
return;
 
+   c = cons_outbuf[index];
+
+   spin_lock_irqsave(&hp->c_lock, flags);
while (count > 0 || i > 0) {
if (count > 0 && i < sizeof(c)) {
if (b[n] == '\n' && !donecr) {
@@ -191,6 +187,7 @@ static void hvc_console_print(struct console *co, const 
char *b,
}
}
}
+   spin_unlock_irqrestore(&hp->c_lock, flags);
hvc_console_flush(cons_ops[index], vtermnos[index]);
 }
 
@@ -878,9 +875,19 @@ static void hvc_poll_put_char(struct tty_driver *driver, 
int line, char ch)
struct tty_struct *tty = driver->ttys[0];
struct hvc_struct *hp = tty->driver_data;
int n;
+   unsigned long flags;
+   char *c;
+
+   if (!hp || !cons_outbuf[hp->index])
+   return;
+
+   c = cons_outbuf[hp->index];
 
do {
-   n = hp->ops->put_chars(hp->vtermno, &ch, 1);
+   spin_lock_irqsave(&hp->c_lock, flags);
+   c[0] = ch;
+   n = hp->ops->put_chars(hp->vtermno, c, 1);
+   spin_unlock_irqrestore(&hp->c_lock, flags);
} while (n <= 0);
 }
 #endif
@@ -922,8 +929,7 @@ struct hvc_struct *hvc_a

[PATCH v6 2/2] virtio-console: remove unnecessary kmemdup()

2021-08-12 Thread Xianting Tian
hvc framework will never pass stack memory to the put_chars() function,
So the calling of kmemdup() is unnecessary, we can remove it.

This revert commit c4baad5029 ("virtio-console: avoid DMA from stack")

Signed-off-by: Xianting Tian 
---
 drivers/char/virtio_console.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7eaf303a7..4ed3ffb1d 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int 
count)
 {
struct port *port;
struct scatterlist sg[1];
-   void *data;
-   int ret;
 
if (unlikely(early_put_chars))
return early_put_chars(vtermno, buf, count);
@@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int 
count)
if (!port)
return -EPIPE;
 
-   data = kmemdup(buf, count, GFP_ATOMIC);
-   if (!data)
-   return -ENOMEM;
-
-   sg_init_one(sg, data, count);
-   ret = __send_to_port(port, sg, 1, count, data, false);
-   kfree(data);
-   return ret;
+   sg_init_one(sg, buf, count);
+   return __send_to_port(port, sg, 1, count, (void *)buf, false);
 }
 
 /*
-- 
2.17.1



Re: [PATCH] powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()

2021-08-12 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 11, 2021 2:13 am:
> An interrupt handler shall not be called from another interrupt
> handler otherwise this leads to problems like the following:
> 
>   Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: 
> 1000)
>   [ cut here ]
>   Bug: Write fault blocked by KUAP!
>   WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 
> do_page_fault+0x484/0x720
>   Modules linked in:
>   CPU: 0 PID: 1617 Comm: sshd Tainted: GW 
> 5.13.0-pmac-00010-g8393422eb77 #7
>   NIP:  c001b77c LR: c001b77c CTR: 
>   REGS: cb9e5bc0 TRAP: 0700   Tainted: GW  
> (5.13.0-pmac-00010-g8393422eb77)
>   MSR:  00021032   CR: 24942424  XER: 
> 
>   GPR00: c001b77c cb9e5c80 c1582c00 0021 3bff 085b 0027 
> c8eb644c
>   GPR08: 0023    24942424 0063f8c8  
> 000186a0
>   GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 
> cb9e5e90
>   GPR24: 0040 afd4fa96 0040 0200 c1fda6c0 afd4fa84 0300 
> cb9e5cc0
>   NIP [c001b77c] do_page_fault+0x484/0x720
>   LR [c001b77c] do_page_fault+0x484/0x720
>   Call Trace:
>   [cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable)
>   [cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4
>   --- interrupt: 300 at __copy_tofrom_user+0x110/0x20c
>   NIP:  c001f9b4 LR: c03250a0 CTR: 0004
>   REGS: cb9e5cc0 TRAP: 0300   Tainted: GW  
> (5.13.0-pmac-00010-g8393422eb77)
>   MSR:  9032   CR: 48028468  XER: 2000
>   DAR: afd4fa84 DSISR: 0a00
>   GPR00: 20726f6f cb9e5d80 c1582c00 0004 cb9e5e3a 0016 afd4fa80 
> 
>   GPR08: 3835202d 72777872 2d78722d 0004 28028464 0063f8c8  
> 000186a0
>   GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 
> cb9e5e90
>   GPR24: 0040 afd4fa96 0040 cb9e5e0c 0daa a000 cb9e5e98 
> afd4fa56
>   NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c
>   LR [c03250a0] _copy_to_iter+0x144/0x990
>   --- interrupt: 300
>   [cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable)
>   [cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4
>   [cb9e5e80] [c0156bf8] vfs_read+0x274/0x340
>   [cb9e5f00] [c01571ac] ksys_read+0x70/0x118
>   [cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28
>   --- interrupt: c00 at 0xa7855c88
>   NIP:  a7855c88 LR: a7855c5c CTR: 
>   REGS: cb9e5f40 TRAP: 0c00   Tainted: GW  
> (5.13.0-pmac-00010-g8393422eb77)
>   MSR:  d032   CR: 2402446c  XER: 
> 
>   GPR00: 0003 afd4ec70 a72137d0 000b afd4ecac 4000 0065a990 
> 0800
>   GPR08:  a7947930  0004 c15831b0 0063f8c8  
> 000186a0
>   GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 0001 
> 0065fac0
>   GPR24:  0089 00664050  00668e30 a720c8dc a7943ff4 
> 0065f9b0
>   NIP [a7855c88] 0xa7855c88
>   LR [a7855c5c] 0xa7855c5c
>   --- interrupt: c00
>   Instruction dump:
>   3884aa88 38630178 48076861 807f0080 48042e45 2f83 419e0148 3c80c079
>   3c60c076 38841be4 386301c0 4801f705 <0fe0> 386b 4bfffe30 
> 3c80c06b
>   ---[ end trace fd69b91a8046c2e5 ]---
> 
> Here the problem is that by re-enterring an exception handler,
> kuap_save_and_lock() is called a second time with this time KUAP
> access locked, leading to regs->kuap being overwritten hence
> KUAP not being unlocked at exception exit as expected.
> 
> Do not call do_IRQ() from timer_interrupt() directly. Instead,
> redefine do_IRQ() as a standard function named __do_IRQ(), and
> call it from both do_IRQ() and time_interrupt() handlers.

Reviewed-by: Nicholas Piggin 

> 
> Reported-by: Stan Johnson 
> Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
> Cc: sta...@vger.kernel.org
> Cc: Nicholas Piggin 
> Cc: Finn Thain 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/interrupt.h | 3 +++
>  arch/powerpc/include/asm/irq.h   | 2 +-
>  arch/powerpc/kernel/irq.c| 7 ++-
>  arch/powerpc/kernel/time.c   | 2 +-
>  4 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index fc4702bdd119..1e984a35a39f 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -590,6 +590,9 @@ DECLARE_INTERRUPT_HANDLER_NMI(hmi_exception_realmode);
>  
>  DECLARE_INTERRUPT_HANDLER_ASYNC(TAUException);
>  
> +/* irq.c */
> +DECLARE_INTERRUPT_HANDLER_ASYNC(do_IRQ);
> +
>  void __noreturn unrecoverable_exception(struct pt_regs *regs);
>  
>  void replay_system_reset(void);
> diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
> index 4982f3711fc3..2b32785

Re: [PATCH] powerpc/interrupt: Do not call single_step_exception() from other exceptions

2021-08-12 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 11, 2021 2:13 am:
> single_step_exception() is called by emulate_single_step() which
> is called from (at least) alignment exception() handler and
> program_check_exception() handler.
> 
> Redefine it as a regular __single_step_exception() which is called
> by both single_step_exception() handler and emulate_single_step()
> function.
> 

Reviewed-by: Nicholas Piggin 

> Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
> Cc: sta...@vger.kernel.org
> Cc: Stan Johnson 
> Cc: Nicholas Piggin 
> Cc: Finn Thain 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/traps.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index dfbce527c98e..d56254f05e17 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -1104,7 +1104,7 @@ DEFINE_INTERRUPT_HANDLER(RunModeException)
>   _exception(SIGTRAP, regs, TRAP_UNK, 0);
>  }
>  
> -DEFINE_INTERRUPT_HANDLER(single_step_exception)
> +static void __single_step_exception(struct pt_regs *regs)
>  {
>   clear_single_step(regs);
>   clear_br_trace(regs);
> @@ -1121,6 +1121,11 @@ DEFINE_INTERRUPT_HANDLER(single_step_exception)
>   _exception(SIGTRAP, regs, TRAP_TRACE, regs->nip);
>  }
>  
> +DEFINE_INTERRUPT_HANDLER(single_step_exception)
> +{
> + __single_step_exception(regs);
> +}
> +
>  /*
>   * After we have successfully emulated an instruction, we have to
>   * check if the instruction was being single-stepped, and if so,
> @@ -1130,7 +1135,7 @@ DEFINE_INTERRUPT_HANDLER(single_step_exception)
>  static void emulate_single_step(struct pt_regs *regs)
>  {
>   if (single_stepping(regs))
> - single_step_exception(regs);
> + __single_step_exception(regs);
>  }
>  
>  static inline int __parse_fpscr(unsigned long fpscr)
> -- 
> 2.25.0
> 
> 


Re: [PATCH 07/11] treewide: Replace the use of mem_encrypt_active() with prot_guest_has()

2021-08-12 Thread Kirill A. Shutemov
On Wed, Aug 11, 2021 at 10:52:55AM -0500, Tom Lendacky wrote:
> On 8/11/21 7:19 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 02:48:54PM -0500, Tom Lendacky wrote:
> >> On 8/10/21 1:45 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>
> >>>
> >>> On 7/27/21 3:26 PM, Tom Lendacky wrote:
>  diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
>  index de01903c3735..cafed6456d45 100644
>  --- a/arch/x86/kernel/head64.c
>  +++ b/arch/x86/kernel/head64.c
>  @@ -19,7 +19,7 @@
>    #include 
>    #include 
>    #include 
>  -#include 
>  +#include 
>    #include 
>      #include 
>  @@ -285,7 +285,7 @@ unsigned long __head __startup_64(unsigned long
>  physaddr,
>     * there is no need to zero it after changing the memory encryption
>     * attribute.
>     */
>  -    if (mem_encrypt_active()) {
>  +    if (prot_guest_has(PATTR_MEM_ENCRYPT)) {
>    vaddr = (unsigned long)__start_bss_decrypted;
>    vaddr_end = (unsigned long)__end_bss_decrypted;
> >>>
> >>>
> >>> Since this change is specific to AMD, can you replace PATTR_MEM_ENCRYPT 
> >>> with
> >>> prot_guest_has(PATTR_SME) || prot_guest_has(PATTR_SEV). It is not used in
> >>> TDX.
> >>
> >> This is a direct replacement for now.
> > 
> > With current implementation of prot_guest_has() for TDX it breaks boot for
> > me.
> > 
> > Looking at code agains, now I *think* the reason is accessing a global
> > variable from __startup_64() inside TDX version of prot_guest_has().
> > 
> > __startup_64() is special. If you access any global variable you need to
> > use fixup_pointer(). See comment before __startup_64().
> > 
> > I'm not sure how you get away with accessing sme_me_mask directly from
> > there. Any clues? Maybe just a luck and complier generates code just right
> > for your case, I donno.
> 
> Hmm... yeah, could be that the compiler is using rip-relative addressing
> for it because it lives in the .data section?

I guess. It has to be fixed. It may break with complier upgrade or any
random change around the code.

BTW, does it work with clang for you?

> For the static variables in mem_encrypt_identity.c I did an assembler rip
> relative LEA, but probably could have passed physaddr to sme_enable() and
> used a fixup_pointer() style function, instead.

Sounds like a plan.

> > A separate point is that TDX version of prot_guest_has() relies on
> > cpu_feature_enabled() which is not ready at this point.
> 
> Does TDX have to do anything special to make memory able to be shared with
> the hypervisor?

Yes. But there's nothing that required any changes in early boot. It
handled in ioremap/set_memory.

> You might have to use something that is available earlier
> than cpu_feature_enabled() in that case (should you eventually support
> kvmclock).

Maybe.

> > I think __bss_decrypted fixup has to be done if sme_me_mask is non-zero.
> > Or just do it uncoditionally because it's NOP for sme_me_mask == 0.
> 
> For SNP, we'll have to additionally call the HV to update the RMP to make
> the memory shared. But that could also be done unconditionally since the
> early_snp_set_memory_shared() routine will check for SNP before doing
> anything.

-- 
 Kirill A. Shutemov


Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-12 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:
> On 8/12/21 12:58 PM, Michael Ellerman wrote:
>> Christophe Leroy  writes:
>>> Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit :
 Similar to x86/s390 add a debugfs file to tune 
 tlb_single_page_flush_ceiling.

 Signed-off-by: Aneesh Kumar K.V 
 ---
arch/powerpc/mm/book3s64/radix_tlb.c | 48 
1 file changed, 48 insertions(+)

 diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
 b/arch/powerpc/mm/book3s64/radix_tlb.c
 index aefc100d79a7..5cca0fe130e7 100644
 --- a/arch/powerpc/mm/book3s64/radix_tlb.c
 +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
 @@ -17,6 +17,7 @@
>> ...
 +
 +static int __init create_tlb_single_page_flush_ceiling(void)
 +{
 +  debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
 +  powerpc_debugfs_root, NULL, &fops_tlbflush);
>>>
>>> Could you just use debugfs_create_u32() instead of re-implementing simple 
>>> read and write ?
>> 
>> Yeah AFAICS that should work fine.
>> 
>> It could probably even be a u16?
>
> I was looking at switching all that to u64. Should i fallback to u16, 
> considering a tlb_signle_page_flush_ceiling value larger that 2**16 
> doesn't make sense?

Hmm, if we make it u16 and someone writes a value >= 2^16 it just
truncates the value to 0, which is a bit unfortunate.

So maybe just make it u32, that way if someone writes a stupidly large
value it stays large.

cheers


Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread kernel test robot
Hi Xianting,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tty/tty-testing]
[also build test WARNING on char-misc/char-misc-testing soc/for-next v5.14-rc5 
next-20210812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Xianting-Tian/make-hvc-pass-dma-capable-memory-to-its-backend/20210812-174847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git 
tty-testing
config: hexagon-randconfig-r041-20210812 (attached as .config)
compiler: clang version 12.0.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/9f2925b5429149ceb0ea6eeaa8c81d422c3124fc
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Xianting-Tian/make-hvc-pass-dma-capable-memory-to-its-backend/20210812-174847
git checkout 9f2925b5429149ceb0ea6eeaa8c81d422c3124fc
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=hexagon 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

>> drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is 
>> uninitialized when used here [-Wuninitialized]
   spin_unlock_irqrestore(&hp->c_lock, flags);
   ^~
   drivers/tty/hvc/hvc_console.c:149:23: note: initialize the variable 'hp' to 
silence this warning
   struct hvc_struct *hp;
^
 = NULL
   1 warning generated.


vim +/hp +190 drivers/tty/hvc/hvc_console.c

   136  
   137  /*
   138   * Console APIs, NOT TTY.  These APIs are available immediately when
   139   * hvc_console_setup() finds adapters.
   140   */
   141  
   142  static void hvc_console_print(struct console *co, const char *b,
   143unsigned count)
   144  {
   145  char *c;
   146  unsigned i = 0, n = 0;
   147  int r, donecr = 0, index = co->index;
   148  unsigned long flags;
   149  struct hvc_struct *hp;
   150  
   151  /* Console access attempt outside of acceptable console range. 
*/
   152  if (index >= MAX_NR_HVC_CONSOLES)
   153  return;
   154  
   155  /* This console adapter was removed so it is not usable. */
   156  if (vtermnos[index] == -1 || !cons_outbuf[index])
   157  return;
   158  
   159  c = cons_outbuf[index];
   160  
   161  spin_lock_irqsave(&hp->c_lock, flags);
   162  while (count > 0 || i > 0) {
   163  if (count > 0 && i < sizeof(c)) {
   164  if (b[n] == '\n' && !donecr) {
   165  c[i++] = '\r';
   166  donecr = 1;
   167  } else {
   168  c[i++] = b[n++];
   169  donecr = 0;
   170  --count;
   171  }
   172  } else {
   173  r = cons_ops[index]->put_chars(vtermnos[index], 
c, i);
   174  if (r <= 0) {
   175  /* throw away characters on error
   176   * but spin in case of -EAGAIN */
   177  if (r != -EAGAIN) {
   178  i = 0;
   179  } else {
   180  
hvc_console_flush(cons_ops[index],
   181vtermnos[index]);
   182  }
   183  } else if (r > 0) {
   184  i -= r;
   185  if (i > 0)
   186  memmove(c, c+r, i);
   187  }
   188  }
   189  }
 > 190  spin_unlock_irqrestore(&hp->c_lock, flags);
   191  hvc_console_flush(cons_ops[index], vtermnos[index]);
   192  }
   193  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


[PATCH -next] trap: Cleanup trap_init()

2021-08-12 Thread Kefeng Wang
There are some empty trap_init() in different ARCHs, introduce
a new weak trap_init() function to cleanup them.

Cc: Vineet Gupta 
Cc: Russell King 
Cc: Yoshinori Sato 
Cc: Ley Foon Tan 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Jeff Dike 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: Andrew Morton 
Signed-off-by: Kefeng Wang 
---
 arch/arc/kernel/traps.c  | 5 -
 arch/arm/kernel/traps.c  | 5 -
 arch/h8300/kernel/traps.c| 4 
 arch/hexagon/kernel/traps.c  | 4 
 arch/nds32/kernel/traps.c| 5 -
 arch/nios2/kernel/traps.c| 5 -
 arch/openrisc/kernel/traps.c | 5 -
 arch/parisc/kernel/traps.c   | 4 
 arch/powerpc/kernel/traps.c  | 5 -
 arch/riscv/kernel/traps.c| 5 -
 arch/um/kernel/trap.c| 4 
 init/main.c  | 2 ++
 12 files changed, 2 insertions(+), 51 deletions(-)

diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c
index 57235e5c0cea..6b83e3f2b41c 100644
--- a/arch/arc/kernel/traps.c
+++ b/arch/arc/kernel/traps.c
@@ -20,11 +20,6 @@
 #include 
 #include 
 
-void __init trap_init(void)
-{
-   return;
-}
-
 void die(const char *str, struct pt_regs *regs, unsigned long address)
 {
show_kernel_fault_diag(str, regs, address);
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 64308e3a5d0c..e9b4f2b49bd8 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -781,11 +781,6 @@ void abort(void)
panic("Oops failed to kill thread");
 }
 
-void __init trap_init(void)
-{
-   return;
-}
-
 #ifdef CONFIG_KUSER_HELPERS
 static void __init kuser_init(void *vectors)
 {
diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c
index 5d8b969cd8f3..bdbe988d8dbc 100644
--- a/arch/h8300/kernel/traps.c
+++ b/arch/h8300/kernel/traps.c
@@ -39,10 +39,6 @@ void __init base_trap_init(void)
 {
 }
 
-void __init trap_init(void)
-{
-}
-
 asmlinkage void set_esp0(unsigned long ssp)
 {
current->thread.esp0 = ssp;
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index 904134b37232..edfc35dafeb1 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -28,10 +28,6 @@
 #define TRAP_SYSCALL   1
 #define TRAP_DEBUG 0xdb
 
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_GENERIC_BUG
 /* Maybe should resemble arch/sh/kernel/traps.c ?? */
 int is_valid_bugaddr(unsigned long addr)
diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c
index ee0d9ae192a5..f06421c645af 100644
--- a/arch/nds32/kernel/traps.c
+++ b/arch/nds32/kernel/traps.c
@@ -183,11 +183,6 @@ void __pgd_error(const char *file, int line, unsigned long 
val)
 }
 
 extern char *exception_vector, *exception_vector_end;
-void __init trap_init(void)
-{
-   return;
-}
-
 void __init early_trap_init(void)
 {
unsigned long ivb = 0;
diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c
index b172da4eb1a9..596986a74a26 100644
--- a/arch/nios2/kernel/traps.c
+++ b/arch/nios2/kernel/traps.c
@@ -105,11 +105,6 @@ void show_stack(struct task_struct *task, unsigned long 
*stack,
printk("%s\n", loglvl);
 }
 
-void __init trap_init(void)
-{
-   /* Nothing to do here */
-}
-
 /* Breakpoint handler */
 asmlinkage void breakpoint_c(struct pt_regs *fp)
 {
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 4d61333c2623..aa1e709405ac 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs *regs, int ea, int 
vector)
die("Oops", regs, 9);
 }
 
-void __init trap_init(void)
-{
-   /* Nothing needs to be done */
-}
-
 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
 {
force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 8d8441d4562a..747c328fb886 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -859,7 +859,3 @@ void  __init early_trap_init(void)
 
initialize_ivt(&fault_vector_20);
 }
-
-void __init trap_init(void)
-{
-}
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index e103b89234cd..91efb5c6f2f3 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -2209,11 +2209,6 @@ DEFINE_INTERRUPT_HANDLER(kernel_bad_stack)
die("Bad kernel stack pointer", regs, SIGABRT);
 }
 
-void __init trap_init(void)
-{
-}
-
-
 #ifdef CONFIG_PPC_EMULATED_STATS
 
 #define WARN_EMULATED_SETUP(type)  .type = { .name = #type }
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 0a98fd0ddfe9..0daaa3e4630d 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -199,11 +199,6 @@ int is_valid_bugaddr(unsigned long pc)
 }
 #endif /* CONFIG_GENERIC_BUG */
 
-/* stvec

Re: [PATCH -next] trap: Cleanup trap_init()

2021-08-12 Thread Russell King (Oracle)
On Thu, Aug 12, 2021 at 08:36:02PM +0800, Kefeng Wang wrote:
> There are some empty trap_init() in different ARCHs, introduce
> a new weak trap_init() function to cleanup them.
> 
> Cc: Vineet Gupta 
> Cc: Russell King 
> Cc: Yoshinori Sato 
> Cc: Ley Foon Tan 
> Cc: Jonas Bonn 
> Cc: Stefan Kristiansson 
> Cc: Stafford Horne 
> Cc: James E.J. Bottomley 
> Cc: Helge Deller 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Paul Walmsley 
> Cc: Jeff Dike 
> Cc: Richard Weinberger 
> Cc: Anton Ivanov 
> Cc: Andrew Morton 
> Signed-off-by: Kefeng Wang 

For 32-bit arm:

Acked-by: Russell King (Oracle) 

Thanks!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE

2021-08-12 Thread Michael Ellerman
"Puvichakravarthy Ramachandran"  writes:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and 
>> memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a 
>> large
>> range. This can be improved by making sure the kernel switch to pid based 
>> flush if the
>> kernel is unmapping a 2M range.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/mm/book3s64/radix_tlb.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > 
>> b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>   * invalidating a full PID, so it has a far lower threshold to change > from
>>   * individual page flushes to full-pid flushes.
>>   */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = 
>> POWER9_TLB_SETS_RADIX * 2;
>> 
>>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > 
>> mm_struct *mm,
>>   if (fullmm)
>>   flush_pid = true;
>>   else if (type == FLUSH_TYPE_GLOBAL)
>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>   else
>>   flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> Additional details on the test environment. This was tested on a 2 Node/8 
> socket Power10 system.
> The LPAR had 105 cores and the LPAR spanned across all the sockets. 
>
> # perf stat -I 1000 -a -e cycles,instructions -e 
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>  Rate of work: = 176 
> #   time counts unit events
>  1.029206442 4198594519  cycles  
>  1.029206442 2458254252  instructions  # 0.59 
> insn per cycle 
>  1.029206442 3004031488  PM_EXEC_STALL   
>  1.029206442 1798186036  PM_EXEC_STALL_TLBIE   
>  Rate of work: = 181 
>  2.054288539 4183883450  cycles  
>  2.054288539 2472178171  instructions  # 0.59 
> insn per cycle 
>  2.054288539 3014609313  PM_EXEC_STALL   
>  2.054288539 1797851642  PM_EXEC_STALL_TLBIE   
>  Rate of work: = 180 
>  3.078306883 4171250717  cycles  
>  3.078306883 2468341094  instructions  # 0.59 
> insn per cycle 
>  3.078306883 2993036205  PM_EXEC_STALL   
>  3.078306883 1798181890  PM_EXEC_STALL_TLBIE   
> .
> . 
>
> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
> 34
>
> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>
> # perf stat -I 1000 -a -e cycles,instructions -e 
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>  Rate of work: = 313 
> #   time counts unit events
>  1.030310506 4206071143  cycles  
>  1.030310506 4314716958  instructions  # 1.03 
> insn per cycle 
>  1.030310506 2157762167  PM_EXEC_STALL   
>  1.030310506  110825573  PM_EXEC_STALL_TLBIE   
>  Rate of work: = 322 
>  2.056034068 4331745630  cycles  
>  2.056034068 4531658304  instructions  # 1.05 
> insn per cycle 
>  2.056034068 2288971361  PM_EXEC_STALL   
>  2.056034068  111267927  PM_EXEC_STALL_TLBIE   
>  Rate of work: = 321 
>  3.081216434 4327050349  cycles  
>  3.081216434 4379679508  instructions  # 1.01 
> insn per cycle 
>  3.081216434 2252602550  PM_EXEC_STALL   
>  3.081216434  110974887  PM_EXEC_STALL_TLBIE   


What is the tlbie test actually doing?

Does it do anything to measure the cost of refilling after the full mm flush?

cheers


Re: [PATCH v2 5/9] powerpc/microwatt: Use standard 16550 UART for console

2021-08-12 Thread Christophe Leroy




Le 18/06/2021 à 05:46, Paul Mackerras a écrit :

From: Benjamin Herrenschmidt 

This adds support to the Microwatt platform to use the standard
16550-style UART which available in the standalone Microwatt FPGA.

Signed-off-by: Benjamin Herrenschmidt 
Signed-off-by: Paul Mackerras 
---
  arch/powerpc/boot/dts/microwatt.dts  | 27 
  arch/powerpc/kernel/udbg_16550.c | 39 
  arch/powerpc/platforms/microwatt/Kconfig |  1 +
  arch/powerpc/platforms/microwatt/setup.c |  2 ++
  4 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/boot/dts/microwatt.dts 
b/arch/powerpc/boot/dts/microwatt.dts
index 04e5dd92270e..974abbdda249 100644
--- a/arch/powerpc/boot/dts/microwatt.dts
+++ b/arch/powerpc/boot/dts/microwatt.dts
@@ -6,6 +6,10 @@ / {
model-name = "microwatt";
compatible = "microwatt-soc";
  
+	aliases {

+   serial0 = &UART0;
+   };
+
reserved-memory {
#size-cells = <0x02>;
#address-cells = <0x02>;
@@ -89,12 +93,6 @@ PowerPC,Microwatt@0 {
};
};
  
-	chosen {

-   bootargs = "";
-   ibm,architecture-vec-5 = [19 00 10 00 00 00 00 00 00 00 00 00 
00 00 00 00
- 00 00 00 00 00 00 00 00 40 00 40];
-   };
-
soc@c000 {
compatible = "simple-bus";
#address-cells = <1>;
@@ -119,5 +117,22 @@ ICS: interrupt-controller@5000 {
#interrupt-cells = <2>;
};
  
+		UART0: serial@2000 {

+   device_type = "serial";
+   compatible = "ns16550";
+   reg = <0x2000 0x8>;
+   clock-frequency = <1>;
+   current-speed = <115200>;
+   reg-shift = <2>;
+   fifo-size = <16>;
+   interrupts = <0x10 0x1>;
+   };
+   };
+
+   chosen {
+   bootargs = "";
+   ibm,architecture-vec-5 = [19 00 10 00 00 00 00 00 00 00 00 00 
00 00 00 00
+ 00 00 00 00 00 00 00 00 40 00 40];
+   stdout-path = &UART0;
};
  };
diff --git a/arch/powerpc/kernel/udbg_16550.c b/arch/powerpc/kernel/udbg_16550.c
index 9356b60d6030..8513aa49614e 100644
--- a/arch/powerpc/kernel/udbg_16550.c
+++ b/arch/powerpc/kernel/udbg_16550.c
@@ -296,3 +296,42 @@ void __init udbg_init_40x_realmode(void)
  }
  
  #endif /* CONFIG_PPC_EARLY_DEBUG_40x */

+
+#ifdef CONFIG_PPC_EARLY_DEBUG_MICROWATT
+
+#define UDBG_UART_MW_ADDR  ((void __iomem *)0xc0002000)
+
+static u8 udbg_uart_in_isa300_rm(unsigned int reg)
+{
+   uint64_t msr = mfmsr();
+   uint8_t  c;
+
+   mtmsr(msr & ~(MSR_EE|MSR_DR));
+   isync();
+   eieio();
+   c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2));
+   mtmsr(msr);
+   isync();
+   return c;
+}


How do you make sure that GCC won't emit any access to the stack between the 
two mtmsr() ?

What about using real_205_readb() and real_205_writeb() instead ?


+
+static void udbg_uart_out_isa300_rm(unsigned int reg, u8 val)
+{
+   uint64_t msr = mfmsr();
+
+   mtmsr(msr & ~(MSR_EE|MSR_DR));
+   isync();
+   eieio();
+   __raw_rm_writeb(val, UDBG_UART_MW_ADDR + (reg << 2));
+   mtmsr(msr);
+   isync();
+}
+
+void __init udbg_init_debug_microwatt(void)
+{
+   udbg_uart_in = udbg_uart_in_isa300_rm;
+   udbg_uart_out = udbg_uart_out_isa300_rm;
+   udbg_use_uart();
+}
+
+#endif /* CONFIG_PPC_EARLY_DEBUG_MICROWATT */
diff --git a/arch/powerpc/platforms/microwatt/Kconfig 
b/arch/powerpc/platforms/microwatt/Kconfig
index b52c869c0eb8..50ed0cedb5f1 100644
--- a/arch/powerpc/platforms/microwatt/Kconfig
+++ b/arch/powerpc/platforms/microwatt/Kconfig
@@ -6,6 +6,7 @@ config PPC_MICROWATT
select PPC_ICS_NATIVE
select PPC_ICP_NATIVE
select PPC_NATIVE
+   select PPC_UDBG_16550
help
This option enables support for FPGA-based Microwatt 
implementations.
  
diff --git a/arch/powerpc/platforms/microwatt/setup.c b/arch/powerpc/platforms/microwatt/setup.c

index 1c1b7791fa57..0b02603bdb74 100644
--- a/arch/powerpc/platforms/microwatt/setup.c
+++ b/arch/powerpc/platforms/microwatt/setup.c
@@ -14,6 +14,7 @@
  #include 
  #include 
  #include 
+#include 
  
  static void __init microwatt_init_IRQ(void)

  {
@@ -35,5 +36,6 @@ define_machine(microwatt) {
.name   = "microwatt",
.probe  = microwatt_probe,
.init_IRQ   = microwatt_init_IRQ,
+   .progress   = udbg_progress,
.calibrate_decr = generic_calibrate_decr,
  };



Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE

2021-08-12 Thread Aneesh Kumar K.V

On 8/12/21 6:19 PM, Michael Ellerman wrote:

"Puvichakravarthy Ramachandran"  writes:

With shared mapping, even though we are unmapping a large range, the kernel
will force a TLB flush with ptl lock held to avoid the race mentioned in
commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory 
freeing parts")
This results in the kernel issuing a high number of TLB flushes even for a large
range. This can be improved by making sure the kernel switch to pid based flush 
if the
kernel is unmapping a 2M range.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/book3s64/radix_tlb.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..21d0f098e43b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
   * invalidating a full PID, so it has a far lower threshold to change > from
   * individual page flushes to full-pid flushes.
   */
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
+static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = 
POWER9_TLB_SETS_RADIX * 2;

  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
@@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > 
mm_struct *mm,
   if (fullmm)
   flush_pid = true;
   else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
   else
   flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;


Additional details on the test environment. This was tested on a 2 Node/8
socket Power10 system.
The LPAR had 105 cores and the LPAR spanned across all the sockets.

# perf stat -I 1000 -a -e cycles,instructions -e
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
  Rate of work: = 176
#   time counts unit events
  1.029206442 4198594519  cycles
  1.029206442 2458254252  instructions  # 0.59 insn 
per cycle
  1.029206442 3004031488  PM_EXEC_STALL
  1.029206442 1798186036  PM_EXEC_STALL_TLBIE
  Rate of work: = 181
  2.054288539 4183883450  cycles
  2.054288539 2472178171  instructions  # 0.59 insn 
per cycle
  2.054288539 3014609313  PM_EXEC_STALL
  2.054288539 1797851642  PM_EXEC_STALL_TLBIE
  Rate of work: = 180
  3.078306883 4171250717  cycles
  3.078306883 2468341094  instructions  # 0.59 insn 
per cycle
  3.078306883 2993036205  PM_EXEC_STALL
  3.078306883 1798181890  PM_EXEC_STALL_TLBIE
.
.

# cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
34

# echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling

# perf stat -I 1000 -a -e cycles,instructions -e
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
  Rate of work: = 313
#   time counts unit events
  1.030310506 4206071143  cycles
  1.030310506 4314716958  instructions  # 1.03 insn 
per cycle
  1.030310506 2157762167  PM_EXEC_STALL
  1.030310506  110825573  PM_EXEC_STALL_TLBIE
  Rate of work: = 322
  2.056034068 4331745630  cycles
  2.056034068 4531658304  instructions  # 1.05 insn 
per cycle
  2.056034068 2288971361  PM_EXEC_STALL
  2.056034068  111267927  PM_EXEC_STALL_TLBIE
  Rate of work: = 321
  3.081216434 4327050349  cycles
  3.081216434 4379679508  instructions  # 1.01 insn 
per cycle
  3.081216434 2252602550  PM_EXEC_STALL
  3.081216434  110974887  PM_EXEC_STALL_TLBIE



What is the tlbie test actually doing?

Does it do anything to measure the cost of refilling after the full mm flush?




That is essentially

for ()
{
  shmat()
  fillshm()
  shmdt()

}

for a 256MB range. So it is not really a fair benchmark because it 
doesn't take into account the impact of throwing away the full pid 
translation. But even then the TLBIE stalls is an important data point?


-aneesh




[PATCH v8 0/5] Add support for FORM2 associativity

2021-08-12 Thread Aneesh Kumar K.V
Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1071644672,
"uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
"raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
"daxregion":{
  "id":0,
  "size":1071644672,
  "devices":[
{
  "chardev":"dax0.0",
  "size":1071644672,
  "target_node":4,
  "mode":"devdax"
}
  ]
},
"align":2097152,
"numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa 
dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa 
dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa 
dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa 
dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa 
dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object 
memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}
  \
-device 
nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at 
https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/

Changes from v7:
* Address review feedback 
* fold patch 6 to patch 3

Changes from v6:
* Address review feedback 

Changes from v5:
* Fix build error reported by kernel test robot
* Address review feedback 

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance 
information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (5):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst   | 104 +
 arch/powerpc/include/asm/firmware.h   |   7 +-
 arch/powerpc/include/asm/prom.h   |   3 +-
 arch/powerpc/include/asm/topology.h   |   6 +-
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 432 ++
 arch/powerpc/platforms/pseries/firmware.c |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 arch/powerpc/platforms/pseries/lpar.c |   4 +-
 10 files changed, 455 insertions(+), 111 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1



[PATCH v8 1/5] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-08-12 Thread Aneesh Kumar K.V
No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/numa.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
if (!numa_enabled)
goto out;
 
-   if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number(&associativity[min_common_depth], 1);
+   if (of_read_number(associativity, 1) >= primary_domain_index)
+   nid = of_read_number(&associativity[primary_domain_index], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
if (nid == 0x || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-   int depth;
+   int index;
struct device_node *root;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
}
 
if (form1_affinity) {
-   depth = of_read_number(distance_ref_points, 1);
+   index = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
goto err;
}
 
-   depth = of_read_number(&distance_ref_points[1], 1);
+   index = of_read_number(&distance_ref_points[1], 1);
}
 
/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
}
 
of_node_put(root);
-   return depth;
+   return index;
 
 err:
of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
int nid = default_nid;
int rc, index;
 
-   if ((min_common_depth < 0) || !numa_enabled)
+   if ((primary_domain_index < 0) || !numa_enabled)
return default_nid;
 
rc = of_get_assoc_arrays(&aa);
if (rc)
return default_nid;
 
-   if (min_common_depth <= aa.array_sz &&
+   if (primary_domain_index <= aa.array_sz &&
!(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
-   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+   index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
nid = of_read_number(&aa.arrays[index], 1);
 
if (nid == 0x || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
return -1;
}
 
-   min_common_depth = find_min_common_depth();
+   primary_domain_index = find_primary_domain_index();
 
-   if (min_common_depth < 0) {
+   if (primary_domain_index < 0) {
/*
-* if we fail to parse min_common_depth from device tree
+* if we fail to parse primary_domain_index from device tree
 * mark the numa disabled, boot with numa disabled.
 */
numa_enabled = false;
-   return min_common_depth;
+   return primary_domain_index;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   dbg("NUMA associativity depth for CPU/Memory: %d\n", 
primary_domain_index);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
goto out;
}
 
-   max_nodes = of_read_number(&domains[min_common_depth], 1);
+   max_nodes = of_read_number(&domains[primary_domain_index], 1);
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
prop_length /= sizeof(int);
-   if (prop_length > min_common_depth + 2)
+   if (prop_length > primary_domain_index + 2)
coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
goto out;
 
index = of_read_number(associativity, 1);
-   if (index > min_common_depth + 1)
+   if (index > primary_domain_index + 1)
return of_read_number(&associativity[index - 1], 1);
 
 out:
-- 
2.31.1



[PATCH v8 2/5] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

2021-08-12 Thread Aneesh Kumar K.V
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   |  4 +--
 arch/powerpc/include/asm/prom.h   |  2 +-
 arch/powerpc/kernel/prom_init.c   |  2 +-
 arch/powerpc/mm/numa.c| 35 ++-
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPALASM_CONST(0x1000)
 #define FW_FEATURE_SET_MODEASM_CONST(0x4000)
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
@@ -69,7 +69,7 @@ enum {
FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI0x0201  /* PCIe/MSI support */
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
-#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY 0x0580  /* FORM1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index a5bf355ce1d6..57db605ad33a 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1096,7 +1096,7 @@ static const struct ibm_arch_vec 
ibm_architecture_vec_template __initconst = {
 #else
0,
 #endif
-   .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
+   .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..368719b14dcc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
int i;
int distance = LOCAL_DISTANCE;
 
-   if (!form1_affinity)
+   if (affinity_form == FORM0_AFFINITY)
return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
int i;
 
-   if (!form1_affinity)
+   if (affinity_form != FORM1_AFFINITY)
return;
 
for (i = 0; i < distance_ref_points_depth; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
int index;
struct device_node *root;
 
+   /*
+* Check for which form of affinity.
+*/
+   if (firmware_has_feature(FW_FEATURE_OPAL)) {
+   affinity_form = FORM1_AFFINITY;
+   } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
+   affinity_form = FORM1_AFFINITY;
+   } else
+   affinity_form = FORM0_AFFINITY;
+
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal");
else
@@ -318,23 +332,16 @@ static i

[PATCH v8 3/5] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-08-12 Thread Aneesh Kumar K.V
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |   2 +
 arch/powerpc/mm/numa.c| 212 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c|   2 +
 4 files changed, 161 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..a6425a70c37b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -64,6 +64,7 @@ static inline int early_cpu_to_node(int cpu)
 }
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+void update_numa_distance(struct device_node *node);
 
 #else
 
@@ -93,6 +94,7 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+static inline void update_numa_distance(struct device_node *node) {}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 368719b14dcc..c0a89839310c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,50 +208,35 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-   const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+ int max_array_sz)
 {
-   int i;
+   int nid;
+   /*
+* primary_domain_index is 1 based array index.
+*/
+   int index = primary_domain_index  - 1;
 
-   if (affinity_form != FORM1_AFFINITY)
-   return;
+   if (!numa_enabled || index >= max_array_sz)
+   return NUMA_NO_NODE;
 
-   for (i = 0; i < distance_ref_points_depth; i++) {
-   const __be32 *entry;
+   nid = of_read_number(&associativity[index], 1);
 
-   entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
-   distance_lookup_table[nid][i] = of_read_number(entry, 1);
-   }
+   /* POWER4 LPAR uses 0x as invalid node */
+   if (nid == 0x || nid >= nr_node_ids)
+   nid = NUMA_NO_NODE;
+   return nid;
 }
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
  */
 static int associativity_to_nid(const __be32 *associativity)
 {
-   int nid = NUMA_NO_NODE;
-
-   if (!numa_enabled)
-   goto out;
-
-   if (of_read_number(associativity, 1) >= primary_domain_index)
-   nid = of_read_number(&associativity[primary_domain_index], 1);
-
-   /* POWER4 LPAR uses 0x as invalid node */
-   if (nid == 0x || nid >= nr_node_ids)
-   nid = NUMA_NO_NODE;
-
-   if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
-   /*
-* Skip the length field and send start of associativity array
-*/
-   initialize_distance_lookup_table(nid, associativity + 1);
-   }
+   int array_sz = of_read_number(associativity, 1);
 
-out:
-   return nid;
+   /* Skip the first element in the associativity array */
+   return __associativity_to_nid((associativity + 1), array_sz);
 }
 
 /* Returns the nid associated with the given device tree node,
@@ -287,6 +272,60 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity,
+int max_array_sz)
+{
+   int i, nid;
+
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
+   nid = __associativity_to_nid(associativity, max_array_sz);
+   if (nid != NUMA_NO_NODE) {
+   for (i = 0; i < distance_ref_points_depth; i++) {
+   const __be32 *entry;
+   int index = be32_to_cpu(distance_ref_points[i]) - 1;
+
+   /*
+* broken hierarchy, return with broken distance table
+*/
+   if (WARN(index >= max_array_sz, "Broken 
ibm,associativity property"))
+   

[PATCH v8 4/5] powerpc/pseries: Add a helper for form1 cpu distance

2021-08-12 Thread Aneesh Kumar K.V
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c| 10 +-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index a6425a70c37b..a4712ecad3e9 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -84,7 +84,7 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c0a89839310c..fdb2472befa4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 {
int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+   /* We should not get called with FORM0 */
+   VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+   return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
last_disp_cpu, int cur_disp_cpu)
if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
return -EIO;
 
-   return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+   return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
if (!disp_cpu_assoc || !vcpu_assoc)
return -EIO;
 
-   return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+   return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1



[PATCH v8 5/5] powerpc/pseries: Add support for FORM2 associativity

2021-08-12 Thread Aneesh Kumar K.V
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/powerpc/associativity.rst   | 104 
 arch/powerpc/include/asm/firmware.h   |   3 +-
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/kernel/prom_init.c   |   3 +-
 arch/powerpc/mm/numa.c| 187 ++
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 262 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..07e7dd3d6c87
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,104 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the oldest format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via 
"ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points, and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or 
more numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID indexes represents an increasing hierarchy of resource 
grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of 
primary domain
+ids at the same domainID index representing resource groups of different 
performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is
+used as an index while computing numa distance information via 
"ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 
(2) is used when
+computing the distance of domain 8 from other domains present in the system. 
For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, follow

[PATCH v2 2/2] powerpc: rename powerpc_debugfs_root to arch_debugfs_dir

2021-08-12 Thread Aneesh Kumar K.V
No functional change in this patch. arch_debugfs_dir is the generic kernel
name declared in linux/debugfs.h for arch-specific debugfs directory.
Architectures like x86/s390 already use the name. Rename powerpc
specific powerpc_debugfs_root to arch_debugfs_dir.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/debugfs.h  | 13 -
 arch/powerpc/kernel/Makefile|  3 ++-
 arch/powerpc/kernel/dawr.c  |  3 +--
 arch/powerpc/kernel/eeh.c   | 16 
 arch/powerpc/kernel/eeh_cache.c |  4 ++--
 arch/powerpc/kernel/fadump.c|  4 ++--
 arch/powerpc/kernel/hw_breakpoint.c |  1 -
 arch/powerpc/kernel/kdebugfs.c  | 14 ++
 arch/powerpc/kernel/security.c  | 16 
 arch/powerpc/kernel/setup-common.c  | 13 -
 arch/powerpc/kernel/setup_64.c  |  1 -
 arch/powerpc/kernel/traps.c |  4 ++--
 arch/powerpc/kvm/book3s_xics.c  |  6 +++---
 arch/powerpc/kvm/book3s_xive.c  |  3 +--
 arch/powerpc/kvm/book3s_xive_native.c   |  3 +--
 arch/powerpc/mm/book3s64/hash_utils.c   |  4 ++--
 arch/powerpc/mm/book3s64/pgtable.c  |  4 ++--
 arch/powerpc/mm/book3s64/radix_tlb.c|  6 +++---
 arch/powerpc/mm/ptdump/bats.c   |  4 ++--
 arch/powerpc/mm/ptdump/segment_regs.c   |  4 ++--
 arch/powerpc/platforms/cell/axon_msi.c  |  4 ++--
 arch/powerpc/platforms/powernv/memtrace.c   |  3 +--
 arch/powerpc/platforms/powernv/opal-imc.c   |  4 ++--
 arch/powerpc/platforms/powernv/opal-lpc.c   |  4 ++--
 arch/powerpc/platforms/powernv/opal-xscom.c |  4 ++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 ++--
 arch/powerpc/platforms/pseries/dtl.c|  4 ++--
 arch/powerpc/platforms/pseries/lpar.c   |  5 +++--
 arch/powerpc/sysdev/xive/common.c   |  3 +--
 arch/powerpc/xmon/xmon.c|  6 +++---
 30 files changed, 75 insertions(+), 92 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/debugfs.h
 create mode 100644 arch/powerpc/kernel/kdebugfs.c

diff --git a/arch/powerpc/include/asm/debugfs.h 
b/arch/powerpc/include/asm/debugfs.h
deleted file mode 100644
index 2c5c48571d75..
--- a/arch/powerpc/include/asm/debugfs.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_DEBUGFS_H
-#define _ASM_POWERPC_DEBUGFS_H
-
-/*
- * Copyright 2017, Michael Ellerman, IBM Corporation.
- */
-
-#include 
-
-extern struct dentry *powerpc_debugfs_root;
-
-#endif /* _ASM_POWERPC_DEBUGFS_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index f66b63e81c3b..7be36c1e1db6 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -46,7 +46,8 @@ obj-y := cputable.o syscalls.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
   of_platform.o prom_parse.o firmware.o \
-  hw_breakpoint_constraints.o interrupt.o
+  hw_breakpoint_constraints.o interrupt.o \
+  kdebugfs.o
 obj-y  += ptrace/
 obj-$(CONFIG_PPC64)+= setup_64.o \
   paca.o nvram_64.o note.o
diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c
index cdc2dccb987d..64e423d2fe0f 100644
--- a/arch/powerpc/kernel/dawr.c
+++ b/arch/powerpc/kernel/dawr.c
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -101,7 +100,7 @@ static int __init dawr_force_setup(void)
if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) {
/* Turn DAWR off by default, but allow admin to turn it on */
debugfs_create_file_unsafe("dawr_enable_dangerous", 0600,
-  powerpc_debugfs_root,
+  arch_debugfs_dir,
   &dawr_force_enable,
   &dawr_enable_fops);
}
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3bbdcc86d01b..e9b597ed423c 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -21,9 +21,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -1901,24 +1901,24 @@ static int __init eeh_init_proc(void)
proc_create_single("powerpc/eeh", 0, NULL, proc_eeh_show);
 #ifdef CONFIG_DEBUG_FS
debugfs_create_file_unsafe("eeh_enable", 0600,
-  powerpc_debugfs_root, NULL,
+  arch_debugfs_dir, NULL,
   &eeh_enable_dbgfs_ops);
debugf

[PATCH v2 1/2] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry

2021-08-12 Thread Aneesh Kumar K.V
Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.
Also add a debugfs entry for tlb_local_single_page_flush_ceiling.

Signed-off-by: Aneesh Kumar K.V 
---
Changes from v1:
* switch to debugfs_create_u32

 arch/powerpc/mm/book3s64/radix_tlb.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..1fa2bc6a969e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1106,8 +1107,8 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
  * invalidating a full PID, so it has a far lower threshold to change from
  * individual page flushes to full-pid flushes.
  */
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
-static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = 
POWER9_TLB_SETS_RADIX * 2;
+static u32 tlb_single_page_flush_ceiling __read_mostly = 33;
+static u32 tlb_local_single_page_flush_ceiling __read_mostly = 
POWER9_TLB_SETS_RADIX * 2;
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
unsigned long start, unsigned long 
end)
@@ -1524,3 +1525,14 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned 
long lpid,
 EXPORT_SYMBOL_GPL(do_h_rpt_invalidate_prt);
 
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+   debugfs_create_u32("tlb_single_page_flush_ceiling", 0600,
+  powerpc_debugfs_root, 
&tlb_single_page_flush_ceiling);
+   debugfs_create_u32("tlb_local_single_page_flush_ceiling", 0600,
+  powerpc_debugfs_root, 
&tlb_local_single_page_flush_ceiling);
+   return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
+
-- 
2.31.1



Re: [PATCH v2 1/2] KVM: Refactor kvm_arch_vcpu_fault() to return a struct page pointer

2021-08-12 Thread Paolo Bonzini

On 12/08/21 11:04, David Hildenbrand wrote:


Reviewed-by: David Hildenbrand 

But at the same time I wonder if we should just get rid of 
CONFIG_KVM_S390_UCONTROL and consequently kvm_arch_vcpu_fault().


In practice CONFIG_KVM_S390_UCONTROL, is never enabled in any reasonable 
kernel build and consequently it's never tested; further, exposing the 
sie_block to user space allows user space to generate random SIE 
validity intercepts.


CONFIG_KVM_S390_UCONTROL feels like something that should just be 
maintained out of tree by someone who really needs to hack deep into hw 
virtualization for testing purposes etc.


I have no preference either way.  It should definitely have selftests, 
but in x86 land there are some features that are not covered by QEMU and 
were nevertheless accepted upstream with selftests.


Paolo



Re: [PATCH v2 5/9] powerpc/microwatt: Use standard 16550 UART for console

2021-08-12 Thread Segher Boessenkool
On Thu, Aug 12, 2021 at 03:14:44PM +0200, Christophe Leroy wrote:
> Le 18/06/2021 à 05:46, Paul Mackerras a écrit :
> >+static u8 udbg_uart_in_isa300_rm(unsigned int reg)
> >+{
> >+uint64_t msr = mfmsr();
> >+uint8_t  c;
> >+
> >+mtmsr(msr & ~(MSR_EE|MSR_DR));
> >+isync();
> >+eieio();
> >+c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2));
> >+mtmsr(msr);
> >+isync();
> >+return c;
> >+}
> 
> How do you make sure that GCC won't emit any access to the stack between 
> the two mtmsr() ?

The mtmsr are asm with a memory clobber so nothing will be moved between
these, and it is very unlikely anything will sprout up here out of
nothing.  But yes, this whole thing should be written as real asm (or as
one huge inline asm, but ugh).


Segher


[PATCH] ppc: add "-z notext" flag to disable diagnostic

2021-08-12 Thread Bill Wendling
The "-z notext" flag disables reporting an error if DT_TEXTREL is set on
PPC with CONFIG=kdump:

  ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against
local symbol in readonly segment; recompile object files with -fPIC
or pass '-Wl,-z,notext' to allow text relocations in the output
  >>> defined in built-in.a(arch/powerpc/kernel/misc.o)
  >>> referenced by arch/powerpc/kernel/misc.o:(.text+0x20) in archive
  built-in.a

The BFD linker disables this by default (though it's configurable in
current versions). LLD enables this by default. So we add the flag to
keep LLD from emitting the error.

Signed-off-by: Bill Wendling 

---
The output of the "get_maintainer.pl" run just in case no one believes me. :-)

$ ./scripts/get_maintainer.pl arch/powerpc/Makefile
Michael Ellerman  (supporter:LINUX FOR POWERPC (32-BIT AND 
64-BIT))
Benjamin Herrenschmidt  (reviewer:LINUX FOR POWERPC 
(32-BIT AND 64-BIT))
Paul Mackerras  (reviewer:LINUX FOR POWERPC (32-BIT AND 
64-BIT))
Nathan Chancellor  (supporter:CLANG/LLVM BUILD SUPPORT)
Nick Desaulniers  (supporter:CLANG/LLVM BUILD SUPPORT)
linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 64-BIT))
linux-ker...@vger.kernel.org (open list)
clang-built-li...@googlegroups.com (open list:CLANG/LLVM BUILD SUPPORT)
---
 arch/powerpc/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 6505d66f1193..17a9fbf9b789 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -122,6 +122,7 @@ endif
 
 LDFLAGS_vmlinux-y := -Bstatic
 LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) := -pie
+LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) += -z notext
 LDFLAGS_vmlinux:= $(LDFLAGS_vmlinux-y)
 
 ifdef CONFIG_PPC64
-- 
2.33.0.rc1.237.g0d66db33f3-goog



Re: [PATCH -next] trap: Cleanup trap_init()

2021-08-12 Thread Vineet Gupta

On 8/12/21 5:36 AM, Kefeng Wang wrote:

There are some empty trap_init() in different ARCHs, introduce
a new weak trap_init() function to cleanup them.

Cc: Vineet Gupta
Cc: Russell King
Cc: Yoshinori Sato
Cc: Ley Foon Tan
Cc: Jonas Bonn
Cc: Stefan Kristiansson
Cc: Stafford Horne
Cc: James E.J. Bottomley
Cc: Helge Deller
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Anton Ivanov
Cc: Andrew Morton
Signed-off-by: Kefeng Wang
---
  arch/arc/kernel/traps.c  | 5 -


Acked-by: Vineet Gupta   #arch/arc


Re: [PATCH] ppc: add "-z notext" flag to disable diagnostic

2021-08-12 Thread Nick Desaulniers
On Thu, Aug 12, 2021 at 1:49 PM Bill Wendling  wrote:
>
> The "-z notext" flag disables reporting an error if DT_TEXTREL is set on
> PPC with CONFIG=kdump:
>
>   ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against
> local symbol in readonly segment; recompile object files with -fPIC
> or pass '-Wl,-z,notext' to allow text relocations in the output
>   >>> defined in built-in.a(arch/powerpc/kernel/misc.o)
>   >>> referenced by arch/powerpc/kernel/misc.o:(.text+0x20) in archive
>   built-in.a
>
> The BFD linker disables this by default (though it's configurable in
> current versions). LLD enables this by default. So we add the flag to
> keep LLD from emitting the error.
>
> Signed-off-by: Bill Wendling 

Link: https://github.com/ClangBuiltLinux/linux/issues/811
Reported-by: Itaru Kitayama 
Reviewed-by: Nick Desaulniers 

>
> ---
> The output of the "get_maintainer.pl" run just in case no one believes me. :-)

LOL!

>
> $ ./scripts/get_maintainer.pl arch/powerpc/Makefile
> Michael Ellerman  (supporter:LINUX FOR POWERPC (32-BIT 
> AND 64-BIT))
> Benjamin Herrenschmidt  (reviewer:LINUX FOR POWERPC 
> (32-BIT AND 64-BIT))
> Paul Mackerras  (reviewer:LINUX FOR POWERPC (32-BIT AND 
> 64-BIT))
> Nathan Chancellor  (supporter:CLANG/LLVM BUILD SUPPORT)
> Nick Desaulniers  (supporter:CLANG/LLVM BUILD 
> SUPPORT)
> linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
> 64-BIT))
> linux-ker...@vger.kernel.org (open list)
> clang-built-li...@googlegroups.com (open list:CLANG/LLVM BUILD SUPPORT)
> ---
>  arch/powerpc/Makefile | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 6505d66f1193..17a9fbf9b789 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -122,6 +122,7 @@ endif
>
>  LDFLAGS_vmlinux-y := -Bstatic
>  LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) := -pie
> +LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) += -z notext
>  LDFLAGS_vmlinux:= $(LDFLAGS_vmlinux-y)
>
>  ifdef CONFIG_PPC64
> --
> 2.33.0.rc1.237.g0d66db33f3-goog
>


-- 
Thanks,
~Nick Desaulniers


Re: [PATCH -next] trap: Cleanup trap_init()

2021-08-12 Thread Michael Ellerman
Kefeng Wang  writes:
> There are some empty trap_init() in different ARCHs, introduce
> a new weak trap_init() function to cleanup them.
>
> Cc: Vineet Gupta 
> Cc: Russell King 
> Cc: Yoshinori Sato 
> Cc: Ley Foon Tan 
> Cc: Jonas Bonn 
> Cc: Stefan Kristiansson 
> Cc: Stafford Horne 
> Cc: James E.J. Bottomley 
> Cc: Helge Deller 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Paul Walmsley 
> Cc: Jeff Dike 
> Cc: Richard Weinberger 
> Cc: Anton Ivanov 
> Cc: Andrew Morton 
> Signed-off-by: Kefeng Wang 
> ---
>  arch/arc/kernel/traps.c  | 5 -
>  arch/arm/kernel/traps.c  | 5 -
>  arch/h8300/kernel/traps.c| 4 
>  arch/hexagon/kernel/traps.c  | 4 
>  arch/nds32/kernel/traps.c| 5 -
>  arch/nios2/kernel/traps.c| 5 -
>  arch/openrisc/kernel/traps.c | 5 -
>  arch/parisc/kernel/traps.c   | 4 
>  arch/powerpc/kernel/traps.c  | 5 -

Acked-by: Michael Ellerman  (powerpc)

cheers


[powerpc:fixes-test] BUILD SUCCESS cbc06f051c524dcfe52ef0d1f30647828e226d30

2021-08-12 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: cbc06f051c524dcfe52ef0d1f30647828e226d30  powerpc/xive: Do not 
skip CPU-less nodes when creating the IPIs

elapsed time: 721m

configs tested: 134
configs skipped: 108

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-20210812
i386 randconfig-c001-20210813
sh  polaris_defconfig
powerpc  bamboo_defconfig
powerpc   holly_defconfig
powerpcmpc7448_hpc2_defconfig
powerpccell_defconfig
powerpc mpc837x_rdb_defconfig
powerpc redwood_defconfig
arm at91_dt_defconfig
sh   secureedge5410_defconfig
powerpc pseries_defconfig
powerpc  makalu_defconfig
arm  jornada720_defconfig
m68k alldefconfig
sh  rsk7264_defconfig
shtitan_defconfig
powerpc asp8347_defconfig
arc  axs103_defconfig
powerpc mpc5200_defconfig
arm eseries_pxa_defconfig
powerpc mpc8540_ads_defconfig
arm vf610m4_defconfig
powerpc  ppc40x_defconfig
powerpc  ppc44x_defconfig
m68k   m5208evb_defconfig
x86_64   alldefconfig
mipsgpr_defconfig
mips  maltaaprp_defconfig
xtensageneric_kc705_defconfig
arm   stm32_defconfig
arm   omap1_defconfig
arc   tb10x_defconfig
arm  exynos_defconfig
mipsworkpad_defconfig
mips decstation_defconfig
mips  ath25_defconfig
powerpc ppa8548_defconfig
arcnsimosci_defconfig
arm davinci_all_defconfig
mips   gcw0_defconfig
armxcep_defconfig
riscv allnoconfig
powerpc64alldefconfig
powerpc ksi8560_defconfig
powerpc  pcm030_defconfig
arm s5pv210_defconfig
powerpcicon_defconfig
sh ap325rxa_defconfig
arm  gemini_defconfig
sh   alldefconfig
sh apsh4a3a_defconfig
powerpc  cm5200_defconfig
powerpc   allnoconfig
armzeus_defconfig
sh sh03_defconfig
arm   aspeed_g5_defconfig
mipsmaltaup_defconfig
arc nsimosci_hs_defconfig
x86_64allnoconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
x86_64   randconfig-a006-20210812
x86_64   randconfig-a004-20210812
x86_64   randconfig-a003-20210812
x86_64   randconfig-a005-20210812

[powerpc:next-test] BUILD REGRESSION aaf01a9f507c108cc8e86f420b0a81e21354be1b

2021-08-12 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next-test
branch HEAD: aaf01a9f507c108cc8e86f420b0a81e21354be1b  cpufreq: powernv: Fix 
init_chip_info initialization in numa=off

Error/Warning reports:

https://lore.kernel.org/lkml/202108130432.0md3hwu2-...@intel.com

Error/Warning in current branch:

arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: 
'node_to_cpumask_map' undeclared (first use in this function)

Error/Warning ids grouped by kconfigs:

gcc_recent_errors
`-- powerpc-randconfig-r005-20210812
`-- 
arch-powerpc-platforms-pseries-hotplug-cpu.c:error:node_to_cpumask_map-undeclared-(first-use-in-this-function)

elapsed time: 727m

configs tested: 115
configs skipped: 3

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-20210812
i386 randconfig-c001-20210813
powerpc mpc5200_defconfig
parisc  defconfig
powerpc wii_defconfig
arm  integrator_defconfig
armtrizeps4_defconfig
sh  polaris_defconfig
powerpc  bamboo_defconfig
powerpc   holly_defconfig
powerpcmpc7448_hpc2_defconfig
powerpccell_defconfig
powerpc mpc837x_rdb_defconfig
m68k alldefconfig
sh  rsk7264_defconfig
shtitan_defconfig
powerpc asp8347_defconfig
powerpcicon_defconfig
arc  axs103_defconfig
riscvnommu_k210_defconfig
m68k   m5208evb_defconfig
x86_64   alldefconfig
powerpc pseries_defconfig
mipsgpr_defconfig
mips  maltaaprp_defconfig
xtensageneric_kc705_defconfig
arm   stm32_defconfig
arm   omap1_defconfig
arc   tb10x_defconfig
arm  exynos_defconfig
mips  ath25_defconfig
mipsworkpad_defconfig
mips decstation_defconfig
powerpc ppa8548_defconfig
arcnsimosci_defconfig
armzeus_defconfig
sh sh03_defconfig
arm   aspeed_g5_defconfig
mipsmaltaup_defconfig
arc nsimosci_hs_defconfig
x86_64allnoconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
x86_64   randconfig-a006-20210812
x86_64   randconfig-a004-20210812
x86_64   randconfig-a003-20210812
x86_64   randconfig-a005-20210812
x86_64   randconfig-a002-20210812
x86_64   randconfig-a001-20210812
i386 randconfig-a004-20210812
i386 randconfig-a003-20210812
i386 randconfig-a002-20210812
i386 randconfig-a001-20210812
i386 randconfig-a006-20210812
i386 randconfig-a005-20210812
i386 randconfig-a011-20210812
i386 randconfig-a015-20210812
i386 randconfig-a013-202

Re: [RFC PATCH v0 5/5] pseries: Asynchronous page fault support

2021-08-12 Thread Nicholas Piggin
Excerpts from Bharata B Rao's message of August 5, 2021 5:24 pm:
> Add asynchronous page fault support for pseries guests.
> 
> 1. Setup the guest to handle async-pf
>- Issue H_REG_SNS hcall to register the SNS region.
>- Setup the subvention interrupt irq.
>- Enable async-pf by updating the byte_b9 of VPA for each
>  CPU.
> 2. Check if the page fault is an expropriation notification
>(SRR1_PROGTRAP set in SRR1) and if so put the task on
>wait queue based on the expropriation correlation number
>read from the VPA.
> 3. Handle subvention interrupt to wake any waiting tasks.
>The wait and wakeup mechanism from x86 async-pf implementation
>is being reused here.

I don't know too much about the background of this.

How much benefit does this give? What situations? Does PowerVM implement 
it? Do other architectures KVM have something similar?

The SRR1 setting for the DSI is in PAPR? In that case it should be okay,
it might be good to add a small comment in exceptions-64s.S.

[...]

> @@ -395,6 +395,11 @@ static int ___do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>   vm_fault_t fault, major = 0;
>   bool kprobe_fault = kprobe_page_fault(regs, 11);
>  
> +#ifdef CONFIG_PPC_PSERIES
> + if (handle_async_page_fault(regs, address))
> + return 0;
> +#endif
> +
>   if (unlikely(debugger_fault_handler(regs) || kprobe_fault))
>   return 0;

[...]

> +int handle_async_page_fault(struct pt_regs *regs, unsigned long addr)
> +{
> + struct async_pf_sleep_node n;
> + DECLARE_SWAITQUEUE(wait);
> + unsigned long exp_corr_nr;
> +
> + /* Is this Expropriation notification? */
> + if (!(mfspr(SPRN_SRR1) & SRR1_PROGTRAP))
> + return 0;

Yep this should be an inline that is guarded by a static key, and then 
probably have an inline check for SRR1_PROGTRAP. You shouldn't need to
mfspr here, but just use regs->msr.

> +
> + if (unlikely(!user_mode(regs)))
> + panic("Host injected async PF in kernel mode\n");

Hmm. Is there anything in the PAPR interface that specifies that the
OS can only deal with problem state access faults here? Or is that
inherent in the expropriation feature?

Thanks,
Nick


Re: [PATCH v1 17/55] KVM: PPC: Book3S HV P9: Implement PMU save/restore in C

2021-08-12 Thread Nicholas Piggin
Excerpts from Athira Rajeev's message of August 9, 2021 1:03 pm:
> 
> 
>> On 26-Jul-2021, at 9:19 AM, Nicholas Piggin  wrote:


>> +static void freeze_pmu(unsigned long mmcr0, unsigned long mmcra)
>> +{
>> +if (!(mmcr0 & MMCR0_FC))
>> +goto do_freeze;
>> +if (mmcra & MMCRA_SAMPLE_ENABLE)
>> +goto do_freeze;
>> +if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>> +if (!(mmcr0 & MMCR0_PMCCEXT))
>> +goto do_freeze;
>> +if (!(mmcra & MMCRA_BHRB_DISABLE))
>> +goto do_freeze;
>> +}
>> +return;
>> +
>> +do_freeze:
>> +mmcr0 = MMCR0_FC;
>> +mmcra = 0;
>> +if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>> +mmcr0 |= MMCR0_PMCCEXT;
>> +mmcra = MMCRA_BHRB_DISABLE;
>> +}
>> +
>> +mtspr(SPRN_MMCR0, mmcr0);
>> +mtspr(SPRN_MMCRA, mmcra);
>> +isync();
>> +}
>> +
> Hi Nick,
> 
> After feezing pmu, do we need to clear “pmcregs_in_use” as well?

Not until we save the values out of the registers. pmcregs_in_use = 0 
means our hypervisor is free to clear our PMU register contents.

> Also can’t we unconditionally do the MMCR0/MMCRA/ freeze settings in here ? 
> do we need the if conditions for FC/PMCCEXT/BHRB ?

I think it's possible, but pretty minimal advantage. I would prefer to 
set them the way perf does for now. If we can move this code into perf/
it should become easier for you to tweak things.

Thanks,
Nick


Re: [RFC PATCH v0 5/5] pseries: Asynchronous page fault support

2021-08-12 Thread Bharata B Rao
On Fri, Aug 13, 2021 at 02:06:40PM +1000, Nicholas Piggin wrote:
> Excerpts from Bharata B Rao's message of August 5, 2021 5:24 pm:
> > Add asynchronous page fault support for pseries guests.
> > 
> > 1. Setup the guest to handle async-pf
> >- Issue H_REG_SNS hcall to register the SNS region.
> >- Setup the subvention interrupt irq.
> >- Enable async-pf by updating the byte_b9 of VPA for each
> >  CPU.
> > 2. Check if the page fault is an expropriation notification
> >(SRR1_PROGTRAP set in SRR1) and if so put the task on
> >wait queue based on the expropriation correlation number
> >read from the VPA.
> > 3. Handle subvention interrupt to wake any waiting tasks.
> >The wait and wakeup mechanism from x86 async-pf implementation
> >is being reused here.
> 
> I don't know too much about the background of this.
> 
> How much benefit does this give? What situations?

I haven't yet gotten into measuring the benefit of this. Once
the patches are bit more stable than what they are currently,
we need to measure and evaluate the benefits.

> Does PowerVM implement it?

I suppose so, need to check though.

> Do other architectures KVM have something similar?

Yes, x86 and s390 KVM have had this feature for a while now
and generic KVM interfaces exist to support it.

> 
> The SRR1 setting for the DSI is in PAPR? In that case it should be okay,
> it might be good to add a small comment in exceptions-64s.S.

Yes, SRR1 setting is part of PAPR.

> 
> [...]
> 
> > @@ -395,6 +395,11 @@ static int ___do_page_fault(struct pt_regs *regs, 
> > unsigned long address,
> > vm_fault_t fault, major = 0;
> > bool kprobe_fault = kprobe_page_fault(regs, 11);
> >  
> > +#ifdef CONFIG_PPC_PSERIES
> > +   if (handle_async_page_fault(regs, address))
> > +   return 0;
> > +#endif
> > +
> > if (unlikely(debugger_fault_handler(regs) || kprobe_fault))
> > return 0;
> 
> [...]
> 
> > +int handle_async_page_fault(struct pt_regs *regs, unsigned long addr)
> > +{
> > +   struct async_pf_sleep_node n;
> > +   DECLARE_SWAITQUEUE(wait);
> > +   unsigned long exp_corr_nr;
> > +
> > +   /* Is this Expropriation notification? */
> > +   if (!(mfspr(SPRN_SRR1) & SRR1_PROGTRAP))
> > +   return 0;
> 
> Yep this should be an inline that is guarded by a static key, and then 
> probably have an inline check for SRR1_PROGTRAP. You shouldn't need to
> mfspr here, but just use regs->msr.

Right.

> 
> > +
> > +   if (unlikely(!user_mode(regs)))
> > +   panic("Host injected async PF in kernel mode\n");
> 
> Hmm. Is there anything in the PAPR interface that specifies that the
> OS can only deal with problem state access faults here? Or is that
> inherent in the expropriation feature?

Didn't see anything specific to that effect in PAPR. However since
this puts the faulting guest process to sleep until the page
becomes ready in the host, I have limited it to guest user space
faults.

Regards,
Bharata.


Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Jiri Slaby

Hi,

On 12. 08. 21, 14:26, kernel test robot wrote:

drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is uninitialized 
when used here [-Wuninitialized]

spin_unlock_irqrestore(&hp->c_lock, flags);
^~
drivers/tty/hvc/hvc_console.c:149:23: note: initialize the variable 'hp' to 
silence this warning
struct hvc_struct *hp;
 ^
  = NULL


So you clearly didn't test your change as it would crash quite 
instantly. I wonder, where do you intend to get hp from in the 
console::print() hook?


thanks,
--
js
suse labs


Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()

2021-08-12 Thread Xianting TIan



在 2021/8/13 下午1:53, Jiri Slaby 写道:

Hi,

On 12. 08. 21, 14:26, kernel test robot wrote:
drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is 
uninitialized when used here [-Wuninitialized]

    spin_unlock_irqrestore(&hp->c_lock, flags);
    ^~
    drivers/tty/hvc/hvc_console.c:149:23: note: initialize the 
variable 'hp' to silence this warning

    struct hvc_struct *hp;
 ^
  = NULL


So you clearly didn't test your change as it would crash quite 
instantly. I wonder, where do you intend to get hp from in the 
console::print() hook?


I am very sorry for the inconvenience caused.

This is caused by my carelessness:(

I take it for granted that there is no problem when I just switch to use 
array(cons_outbuf[]).


sorry agin.



thanks,


Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-12 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of April 14, 2021 2:38 am:
> powerpc BUG_ON() and WARN_ON() are based on using twnei instruction.
> 
> For catching simple conditions like a variable having value 0, this
> is efficient because it does the test and the trap at the same time.
> But most conditions used with BUG_ON or WARN_ON are more complex and
> forces GCC to format the condition into a 0 or 1 value in a register.
> This will usually require 2 to 3 instructions.
> 
> The most efficient solution would be to use __builtin_trap() because
> GCC is able to optimise the use of the different trap instructions
> based on the requested condition, but this is complex if not
> impossible for the following reasons:
> - __builtin_trap() is a non-recoverable instruction, so it can't be
> used for WARN_ON
> - Knowing which line of code generated the trap would require the
> analysis of DWARF information. This is not a feature we have today.
> 
> As mentioned in commit 8d4fbcfbe0a4 ("Fix WARN_ON() on bitfield ops")
> the way WARN_ON() is implemented is suboptimal. That commit also
> mentions an issue with 'long long' condition. It fixed it for
> WARN_ON() but the same problem still exists today with BUG_ON() on
> PPC32. It will be fixed by using the generic implementation.
> 
> By using the generic implementation, gcc will naturally generate a
> branch to the unconditional trap generated by BUG().
> 
> As modern powerpc implement zero-cycle branch,
> that's even more efficient.
> 
> And for the functions using WARN_ON() and its return, the test
> on return from WARN_ON() is now also used for the WARN_ON() itself.
> 
> On PPC64 we don't want it because we want to be able to use CFAR
> register to track how we entered the code that trapped. The CFAR
> register would be clobbered by the branch.
> 
> A simple test function:
> 
>   unsigned long test9w(unsigned long a, unsigned long b)
>   {
>   if (WARN_ON(!b))
>   return 0;
>   return a / b;
>   }
> 
> Before the patch:
> 
>   046c :
>46c:   7c 89 00 34 cntlzw  r9,r4
>470:   55 29 d9 7e rlwinm  r9,r9,27,5,31
>474:   0f 09 00 00 twnei   r9,0
>478:   2c 04 00 00 cmpwi   r4,0
>47c:   41 82 00 0c beq 488 
>480:   7c 63 23 96 divwu   r3,r3,r4
>484:   4e 80 00 20 blr
> 
>488:   38 60 00 00 li  r3,0
>48c:   4e 80 00 20 blr
> 
> After the patch:
> 
>   0468 :
>468:   2c 04 00 00 cmpwi   r4,0
>46c:   41 82 00 0c beq 478 
>470:   7c 63 23 96 divwu   r3,r3,r4
>474:   4e 80 00 20 blr
> 
>478:   0f e0 00 00 twuir0,0
>47c:   38 60 00 00 li  r3,0
>480:   4e 80 00 20 blr

That's clearly better because we have a branch anyway.

> 
> So we see before the patch we need 3 instructions on the likely path
> to handle the WARN_ON(). With the patch the trap goes on the unlikely
> path.
> 
> See below the difference at the entry of system_call_exception where
> we have several BUG_ON(), allthough less impressing.
> 
> With the patch:
> 
>    :
>  0:   81 6a 00 84 lwz r11,132(r10)
>  4:   90 6a 00 88 stw r3,136(r10)
>  8:   71 60 00 02 andi.   r0,r11,2
>  c:   41 82 00 70 beq 7c 
> 10:   71 60 40 00 andi.   r0,r11,16384
> 14:   41 82 00 6c beq 80 
> 18:   71 6b 80 00 andi.   r11,r11,32768
> 1c:   41 82 00 68 beq 84 
> 20:   94 21 ff e0 stwur1,-32(r1)
> 24:   93 e1 00 1c stw r31,28(r1)
> 28:   7d 8c 42 e6 mftbr12
>   ...
> 7c:   0f e0 00 00 twuir0,0
> 80:   0f e0 00 00 twuir0,0
> 84:   0f e0 00 00 twuir0,0
> 
> Without the patch:
> 
>    :
>  0:   94 21 ff e0 stwur1,-32(r1)
>  4:   93 e1 00 1c stw r31,28(r1)
>  8:   90 6a 00 88 stw r3,136(r10)
>  c:   81 6a 00 84 lwz r11,132(r10)
> 10:   69 60 00 02 xorir0,r11,2
> 14:   54 00 ff fe rlwinm  r0,r0,31,31,31
> 18:   0f 00 00 00 twnei   r0,0
> 1c:   69 60 40 00 xorir0,r11,16384
> 20:   54 00 97 fe rlwinm  r0,r0,18,31,31
> 24:   0f 00 00 00 twnei   r0,0
> 28:   69 6b 80 00 xorir11,r11,32768
> 2c:   55 6b 8f fe rlwinm  r11,r11,17,31,31
> 30:   0f 0b 00 00 twnei   r11,0
> 34:   7d 8c 42 e6 mftbr12

This one possibly the branches end up in predictors, whereas conditional 
trap is always just speculated not to hit. Branches may also have a
throughput limit on execution whereas trap could be more (1 per cycle
vs 4 per cycle on POWER9).

On typical ppc32 CPUs, maybe it's a more obvious win. As you say there
is the CFAR issue as well which makes 

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-12 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of April 14, 2021 2:38 am:
> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
> flexibility to GCC.
> 
> For that add an entry to the exception table so that
> program_check_exception() knowns where to resume execution
> after a WARNING.

Nice idea. How much does it bloat the exception table?
How easy would it be to make a different exception table for
unimportant stuff like WARN_ON faults?

> 
> Here are two exemples. The first one is done on PPC32 (which
> benefits from the previous patch), the second is on PPC64.
> 
>   unsigned long test(struct pt_regs *regs)
>   {
>   int ret;
> 
>   WARN_ON(regs->msr & MSR_PR);
> 
>   return regs->gpr[3];
>   }
> 
>   unsigned long test9w(unsigned long a, unsigned long b)
>   {
>   if (WARN_ON(!b))
>   return 0;
>   return a / b;
>   }
> 
> Before the patch:
> 
>   03a8 :
>3a8:   81 23 00 84 lwz r9,132(r3)
>3ac:   71 29 40 00 andi.   r9,r9,16384
>3b0:   40 82 00 0c bne 3bc 
>3b4:   80 63 00 0c lwz r3,12(r3)
>3b8:   4e 80 00 20 blr
> 
>3bc:   0f e0 00 00 twuir0,0
>3c0:   80 63 00 0c lwz r3,12(r3)
>3c4:   4e 80 00 20 blr
> 
>   0bf0 <.test9w>:
>bf0:   7c 89 00 74 cntlzd  r9,r4
>bf4:   79 29 d1 82 rldicl  r9,r9,58,6
>bf8:   0b 09 00 00 tdnei   r9,0
>bfc:   2c 24 00 00 cmpdi   r4,0
>c00:   41 82 00 0c beq c0c <.test9w+0x1c>
>c04:   7c 63 23 92 divdu   r3,r3,r4
>c08:   4e 80 00 20 blr
> 
>c0c:   38 60 00 00 li  r3,0
>c10:   4e 80 00 20 blr
> 
> After the patch:
> 
>   03a8 :
>3a8:   81 23 00 84 lwz r9,132(r3)
>3ac:   71 29 40 00 andi.   r9,r9,16384
>3b0:   40 82 00 0c bne 3bc 
>3b4:   80 63 00 0c lwz r3,12(r3)
>3b8:   4e 80 00 20 blr
> 
>3bc:   0f e0 00 00 twuir0,0
> 
>   0c50 <.test9w>:
>c50:   7c 89 00 74 cntlzd  r9,r4
>c54:   79 29 d1 82 rldicl  r9,r9,58,6
>c58:   0b 09 00 00 tdnei   r9,0
>c5c:   7c 63 23 92 divdu   r3,r3,r4
>c60:   4e 80 00 20 blr
> 
>c70:   38 60 00 00 li  r3,0
>c74:   4e 80 00 20 blr

One thing that would be nice for WARN_ON_ONCE is to patch out the trap
after the program interrupt. I've seen sometimes the WARN_ON_ONCE starts
firing a lot and slows down the kernel. That gets harder to do if the
trap is now part of the control flow.

I guess that's less important case for performance though. And in theory
you might be able to replace the trap with an equivalent cmp and branch
(but that's probably going too over engineering it).

Thanks,
Nick