Re: [PATCH v3 4/8] PCI: replace pci_dev::driver usage that gets the driver name
On Wed, Aug 11, 2021 at 10:06:33AM +0200, Uwe Kleine-K??nig wrote: > static inline const char *eeh_driver_name(struct pci_dev *pdev) > { > - return (pdev && pdev->driver) ? pdev->driver->name : ""; > + const char *drvstr = pdev ? dev_driver_string(&pdev->dev) : ""; > + > + if (*drvstr == '\0') > + return ""; > + > + return drvstr; This looks rather obsfucated due to the fact that dev_driver_string never returns '\0', and due to the strange mix of a tenary operation and the if on a related condition. > } > > #endif /* CONFIG_EEH */ > diff --git a/drivers/bcma/host_pci.c b/drivers/bcma/host_pci.c > index 69c10a7b7c61..dc2ffa686964 100644 > --- a/drivers/bcma/host_pci.c > +++ b/drivers/bcma/host_pci.c > @@ -175,9 +175,10 @@ static int bcma_host_pci_probe(struct pci_dev *dev, > if (err) > goto err_kfree_bus; > > - name = dev_name(&dev->dev); > - if (dev->driver && dev->driver->name) > - name = dev->driver->name; > + name = dev_driver_string(&dev->dev); > + if (*name == '\0') > + name = dev_name(&dev->dev); Where does this '\0' check come from? > + > + name = dev_driver_string(&dev->dev); > + if (*name == '\0') > + name = dev_name(&dev->dev); > + More of this weirdness.
Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry
Christophe Leroy writes: > Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit : >> Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/mm/book3s64/radix_tlb.c | 48 >> 1 file changed, 48 insertions(+) >> >> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c >> b/arch/powerpc/mm/book3s64/radix_tlb.c >> index aefc100d79a7..5cca0fe130e7 100644 >> --- a/arch/powerpc/mm/book3s64/radix_tlb.c >> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c >> @@ -17,6 +17,7 @@ ... >> + >> +static int __init create_tlb_single_page_flush_ceiling(void) >> +{ >> +debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, >> +powerpc_debugfs_root, NULL, &fops_tlbflush); > > Could you just use debugfs_create_u32() instead of re-implementing simple > read and write ? Yeah AFAICS that should work fine. It could probably even be a u16? cheers
Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
在 2021/8/6 下午10:51, Arnd Bergmann 写道: On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian wrote: @@ -163,6 +155,13 @@ static void hvc_console_print(struct console *co, const char *b, if (vtermnos[index] == -1) return; + list_for_each_entry(hp, &hvc_structs, next) + if (hp->vtermno == vtermnos[index]) + break; + + c = hp->c; + + spin_lock_irqsave(&hp->c_lock, flags); The loop looks like it might race against changes to the list. It seems strange that the print function has to actually search for the structure here. It may be better to have yet another array for the buffer pointers next to the cons_ops[] and vtermnos[] arrays. +/* + * These sizes are most efficient for vio, because they are the + * native transfer size. We could make them selectable in the + * future to better deal with backends that want other buffer sizes. + */ +#define N_OUTBUF 16 +#define N_INBUF16 + +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long I think you need a higher alignment for DMA buffers, instead of sizeof(long), I would suggest ARCH_DMA_MINALIGN. As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think it 's better remain the code unchanged, I will send v5 patch soon. Arnd
Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry
On 8/12/21 12:58 PM, Michael Ellerman wrote: Christophe Leroy writes: Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit : Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_tlb.c | 48 1 file changed, 48 insertions(+) diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index aefc100d79a7..5cca0fe130e7 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -17,6 +17,7 @@ ... + +static int __init create_tlb_single_page_flush_ceiling(void) +{ + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, + powerpc_debugfs_root, NULL, &fops_tlbflush); Could you just use debugfs_create_u32() instead of re-implementing simple read and write ? Yeah AFAICS that should work fine. It could probably even be a u16? I was looking at switching all that to u64. Should i fallback to u16, considering a tlb_signle_page_flush_ceiling value larger that 2**16 doesn't make sense? -aneesh
Re: [PATCH v3 4/8] PCI: replace pci_dev::driver usage that gets the driver name
On Thu, Aug 12, 2021 at 08:07:20AM +0100, Christoph Hellwig wrote: > On Wed, Aug 11, 2021 at 10:06:33AM +0200, Uwe Kleine-K??nig wrote: > > static inline const char *eeh_driver_name(struct pci_dev *pdev) > > { > > - return (pdev && pdev->driver) ? pdev->driver->name : ""; > > + const char *drvstr = pdev ? dev_driver_string(&pdev->dev) : ""; > > + > > + if (*drvstr == '\0') > > + return ""; > > + > > + return drvstr; > > This looks rather obsfucated due to the fact that dev_driver_string > never returns '\0', and due to the strange mix of a tenary operation > and the if on a related condition. dev_driver_string() might return "" (via dev_bus_name()). If that happens *drvstr == '\0' becomes true. Would the following be better?: const char *drvstr; if (pdev) return ""; drvstr = dev_driver_string(&pdev->dev); if (!strcmp(drvstr, "")) return ""; return drvstr; When I thought about this hunk I considered it ugly to have "" in it twice. > > } > > > > #endif /* CONFIG_EEH */ > > diff --git a/drivers/bcma/host_pci.c b/drivers/bcma/host_pci.c > > index 69c10a7b7c61..dc2ffa686964 100644 > > --- a/drivers/bcma/host_pci.c > > +++ b/drivers/bcma/host_pci.c > > @@ -175,9 +175,10 @@ static int bcma_host_pci_probe(struct pci_dev *dev, > > if (err) > > goto err_kfree_bus; > > > > - name = dev_name(&dev->dev); > > - if (dev->driver && dev->driver->name) > > - name = dev->driver->name; > > + name = dev_driver_string(&dev->dev); > > + if (*name == '\0') > > + name = dev_name(&dev->dev); > > Where does this '\0' check come from? The original code is equivalent to if (dev->driver && dev->driver->name) name = dev->driver->name; else: name = dev_name(...); As dev_driver_string() implements something like: if (dev->driver && dev->driver->name) return dev->driver->name; else return ""; the change looks fine to me. (One could wonder if it's sensible to fall back to the device name if the driver has no nice name, but this isn't new with my change.) > > + name = dev_driver_string(&dev->dev); > > + if (*name == '\0') > > + name = dev_name(&dev->dev); > > + > > More of this weirdness. I admit it's not pretty. Would it help to use !strcmp(name, "") instead of *name == '\0'? Any other constructive suggestion? Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König| Industrial Linux Solutions | https://www.pengutronix.de/ | signature.asc Description: PGP signature
[PATCH v5 1/2] tty: hvc: pass DMA capable memory to put_chars()
As well known, hvc backend can register its opertions to hvc backend. the opertions contain put_chars(), get_chars() and so on. Some hvc backend may do dma in its opertions. eg, put_chars() of virtio-console. But in the code of hvc framework, it may pass DMA incapable memory to put_chars() under a specific configuration, which is explained in commit c4baad5029(virtio-console: avoid DMA from stack): 1, c[] is on stack, hvc_console_print(): char c[N_OUTBUF] __ALIGNED__; cons_ops[index]->put_chars(vtermnos[index], c, i); 2, ch is on stack, static void hvc_poll_put_char(,,char ch) { struct tty_struct *tty = driver->ttys[0]; struct hvc_struct *hp = tty->driver_data; int n; do { n = hp->ops->put_chars(hp->vtermno, &ch, 1); } while (n <= 0); } Commit c4baad5029 is just the fix to avoid DMA from stack memory, which is passed to virtio-console by hvc framework in above code. But I think the fix is aggressive, it directly uses kmemdup() to alloc new buffer from kmalloc area and do memcpy no matter the memory is in kmalloc area or not. But most importantly, it should better be fixed in the hvc framework, by changing it to never pass stack memory to the put_chars() function in the first place. Otherwise, we still face the same issue if a new hvc backend using dma added in the furture. We make 'char c[N_OUTBUF]' part of 'struct hvc_struct', so hp->c is no longer the stack memory. we can use it in above two cases. Other cleanup is use ARCH_DMA_MINALIGN for align, and make 'hp->outbuf' aligned and use struct_size() for the size parameter of kzalloc(). With the patch, we can remove the fix c4baad5029. Signed-off-by: Xianting Tian Tested-by: Xianting Tian --- drivers/tty/hvc/hvc_console.c | 40 +-- drivers/tty/hvc/hvc_console.h | 16 -- 2 files changed, 38 insertions(+), 18 deletions(-) diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c index 5bb8c4e44..c56564eb7 100644 --- a/drivers/tty/hvc/hvc_console.c +++ b/drivers/tty/hvc/hvc_console.c @@ -41,16 +41,6 @@ */ #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */ -/* - * These sizes are most efficient for vio, because they are the - * native transfer size. We could make them selectable in the - * future to better deal with backends that want other buffer sizes. - */ -#define N_OUTBUF 16 -#define N_INBUF16 - -#define __ALIGNED__ __attribute__((__aligned__(sizeof(long - static struct tty_driver *hvc_driver; static struct task_struct *hvc_task; @@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp) static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES]; static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] = {[0 ... MAX_NR_HVC_CONSOLES - 1] = -1}; +static char *cons_outbuf[MAX_NR_HVC_CONSOLES]; /* * Console APIs, NOT TTY. These APIs are available immediately when @@ -151,18 +142,23 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] = static void hvc_console_print(struct console *co, const char *b, unsigned count) { - char c[N_OUTBUF] __ALIGNED__; + char *c; unsigned i = 0, n = 0; int r, donecr = 0, index = co->index; + unsigned long flags; + struct hvc_struct *hp; /* Console access attempt outside of acceptable console range. */ if (index >= MAX_NR_HVC_CONSOLES) return; /* This console adapter was removed so it is not usable. */ - if (vtermnos[index] == -1) + if (vtermnos[index] == -1 || !cons_outbuf[index]) return; + c = cons_outbuf[index]; + + spin_lock_irqsave(&hp->c_lock, flags); while (count > 0 || i > 0) { if (count > 0 && i < sizeof(c)) { if (b[n] == '\n' && !donecr) { @@ -191,6 +187,7 @@ static void hvc_console_print(struct console *co, const char *b, } } } + spin_unlock_irqrestore(&hp->c_lock, flags); hvc_console_flush(cons_ops[index], vtermnos[index]); } @@ -878,9 +875,19 @@ static void hvc_poll_put_char(struct tty_driver *driver, int line, char ch) struct tty_struct *tty = driver->ttys[0]; struct hvc_struct *hp = tty->driver_data; int n; + unsigned long flags; + char *c; + + if (!hp || !cons_outbuf[hp->index]) + return; + + c = cons_outbuf[hp->index]; do { - n = hp->ops->put_chars(hp->vtermno, &ch, 1); + spin_lock_irqsave(&hp->c_lock, flags); + c[0] = ch; + n = hp->ops->put_chars(hp->vtermno, c, 1); + spin_unlock_irqrestore(&hp->c_lock, flags); } while (n <= 0); } #endif @@ -922,8 +929,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data, return ERR_PTR(err); } - hp = kzalloc(ALIGN(size
[PATCH v5 0/2] make hvc pass dma capable memory to its backend
Dear all, This patch series make hvc framework pass DMA capable memory to put_chars() of hvc backend(eg, virtio-console), and revert commit c4baad5029 ("virtio-console: avoid DMA from stack”) V1 virtio-console: avoid DMA from vmalloc area https://lkml.org/lkml/2021/7/27/494 For v1 patch, Arnd Bergmann suggests to fix the issue in the first place: Make hvc pass DMA capable memory to put_chars() The fix suggestion is included in v2. V2 [PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars() https://lkml.org/lkml/2021/8/1/8 [PATCH 2/2] virtio-console: remove unnecessary kmemdup() https://lkml.org/lkml/2021/8/1/9 For v2 patch, Arnd Bergmann suggests to make new buf part of the hvc_struct structure, and fix the compile issue. The fix suggestion is included in v3. V3 [PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars() https://lkml.org/lkml/2021/8/3/1347 [PATCH v3 2/2] virtio-console: remove unnecessary kmemdup() https://lkml.org/lkml/2021/8/3/1348 For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to calculate the size of hvc_struct. The fix suggestion is included in v4. V4 [PATCH v4 0/2] make hvc pass dma capable memory to its backend https://lkml.org/lkml/2021/8/5/1350 [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars() https://lkml.org/lkml/2021/8/5/1351 [PATCH v4 2/2] virtio-console: remove unnecessary kmemdup() https://lkml.org/lkml/2021/8/5/1352 For v4 patch, Arnd Bergmann suggests to introduce another array(cons_outbuf[]) for the buffer pointers next to the cons_ops[] and vtermnos[] arrays. This fix included in this v5 patch. drivers/char/virtio_console.c | 12 ++-- drivers/tty/hvc/hvc_console.c | 40 +-- drivers/tty/hvc/hvc_console.h | 16 -- 3 file changed
[PATCH v5 2/2] virtio-console: remove unnecessary kmemdup()
hvc framework will never pass stack memory to the put_chars() function, So the calling of kmemdup() is unnecessary, we can remove it. This revert commit c4baad5029 ("virtio-console: avoid DMA from stack") Signed-off-by: Xianting Tian --- drivers/char/virtio_console.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index 7eaf303a7..4ed3ffb1d 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int count) { struct port *port; struct scatterlist sg[1]; - void *data; - int ret; if (unlikely(early_put_chars)) return early_put_chars(vtermno, buf, count); @@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int count) if (!port) return -EPIPE; - data = kmemdup(buf, count, GFP_ATOMIC); - if (!data) - return -ENOMEM; - - sg_init_one(sg, data, count); - ret = __send_to_port(port, sg, 1, count, data, false); - kfree(data); - return ret; + sg_init_one(sg, buf, count); + return __send_to_port(port, sg, 1, count, (void *)buf, false); } /* -- 2.17.1
Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
On Thu, Aug 12, 2021 at 10:08 AM Xianting TIan wrote: > 在 2021/8/6 下午10:51, Arnd Bergmann 写道: > > On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian > >> +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long > > I think you need a higher alignment for DMA buffers, instead of > > sizeof(long), > > I would suggest ARCH_DMA_MINALIGN. > > As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think > it 's better remain the code unchanged, > > I will send v5 patch soon. I think you could just use "L1_CACHE_BYTES" as the alignment in this case. This will make the structure slightly larger for architectures that do not have alignment constraints on DMA buffers, but using a smaller alignment is clearly wrong. Another option would be to use ARCH_KMALLOC_MINALIGN. Note that there is a patch to add ARCH_DMA_MINALIGN to riscv already, as some implementations do not have coherent DMA. I had failed to realized though that on x86 you do not get an ARCH_DMA_MINALIGN definition. Arnd
Re: [PATCH v2 1/2] KVM: Refactor kvm_arch_vcpu_fault() to return a struct page pointer
On 12.08.21 06:02, Hou Wenlong wrote: From: Sean Christopherson Refactor kvm_arch_vcpu_fault() to return 'struct page *' instead of 'vm_fault_t' to simplify architecture specific implementations that do more than return SIGBUS. Currently this only applies to s390, but a future patch will move x86's pio_data handling into x86 where it belongs. No functional changed intended. Cc: Hou Wenlong Signed-off-by: Sean Christopherson Signed-off-by: Hou Wenlong --- arch/arm64/kvm/arm.c | 4 ++-- arch/mips/kvm/mips.c | 4 ++-- arch/powerpc/kvm/powerpc.c | 4 ++-- arch/s390/kvm/kvm-s390.c | 12 arch/x86/kvm/x86.c | 4 ++-- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c| 5 - 7 files changed, 17 insertions(+), 18 deletions(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index e9a2b8f27792..83f4ffe3e4f2 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -161,9 +161,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) return ret; } -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) { - return VM_FAULT_SIGBUS; + return NULL; } diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index af9dd029a4e1..ae79874e6fd2 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1053,9 +1053,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) return -ENOIOCTLCMD; } -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) { - return VM_FAULT_SIGBUS; + return NULL; } int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index be33b5321a76..b9c21f9ab784 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -2090,9 +2090,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp, return r; } -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) { - return VM_FAULT_SIGBUS; + return NULL; } static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 02574d7b3612..e1b69833e228 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -4979,17 +4979,13 @@ long kvm_arch_vcpu_ioctl(struct file *filp, return r; } -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) { #ifdef CONFIG_KVM_S390_UCONTROL - if ((vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET) -&& (kvm_is_ucontrol(vcpu->kvm))) { - vmf->page = virt_to_page(vcpu->arch.sie_block); - get_page(vmf->page); - return 0; - } + if (vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET && kvm_is_ucontrol(vcpu->kvm)) + return virt_to_page(vcpu->arch.sie_block); #endif - return VM_FAULT_SIGBUS; + return NULL; } /* Section: memory related */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3cedc7cc132a..1e3bbe5cd33a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5347,9 +5347,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp, return r; } -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) { - return VM_FAULT_SIGBUS; + return NULL; } static int kvm_vm_ioctl_set_tss_addr(struct kvm *kvm, unsigned long addr) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 492d183dd7d0..a949de534722 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -995,7 +995,7 @@ long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf); +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf); int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 30d322519253..f7d21418971b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3448,7 +3448,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf) &vcpu->dirty_ring, vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET); else - return kvm_arch_vcpu_fault(vcpu, vmf); + page = kvm_arch_vcpu_fault(vcpu, vmf); + if (!page) +
Re: [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars()
在 2021/8/12 下午4:54, Arnd Bergmann 写道: On Thu, Aug 12, 2021 at 10:08 AM Xianting TIan wrote: 在 2021/8/6 下午10:51, Arnd Bergmann 写道: On Fri, Aug 6, 2021 at 5:01 AM Xianting Tian +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long I think you need a higher alignment for DMA buffers, instead of sizeof(long), I would suggest ARCH_DMA_MINALIGN. As some ARCH(eg, x86, riscv) doesn't define ARCH_DMA_MINALIG, so i think it 's better remain the code unchanged, I will send v5 patch soon. I think you could just use "L1_CACHE_BYTES" as the alignment in this case. This will make the structure slightly larger for architectures that do not have alignment constraints on DMA buffers, but using a smaller alignment is clearly wrong. Another option would be to use ARCH_KMALLOC_MINALIGN. yes, I unstand you, the align size must L1_CACHE_BYTES at least. Note that there is a patch to add ARCH_DMA_MINALIGN to riscv already, yes, I summited this patch, it is discussing, seems they don't want to apply it. as some implementations do not have coherent DMA. I had failed to realized though that on x86 you do not get an ARCH_DMA_MINALIGN definition. I didn't find the definition in arch/x86/include/asm/cache.h and other place, x86 is dma coherent, it may doesn't need it. Arnd
[PATCH v6 0/2] make hvc pass dma capable memory to its backend
Dear all, This patch series make hvc framework pass DMA capable memory to put_chars() of hvc backend(eg, virtio-console), and revert commit c4baad5029 ("virtio-console: avoid DMA from stack”) V1 virtio-console: avoid DMA from vmalloc area https://lkml.org/lkml/2021/7/27/494 For v1 patch, Arnd Bergmann suggests to fix the issue in the first place: Make hvc pass DMA capable memory to put_chars() The fix suggestion is included in v2. V2 [PATCH 1/2] tty: hvc: pass DMA capable memory to put_chars() https://lkml.org/lkml/2021/8/1/8 [PATCH 2/2] virtio-console: remove unnecessary kmemdup() https://lkml.org/lkml/2021/8/1/9 For v2 patch, Arnd Bergmann suggests to make new buf part of the hvc_struct structure, and fix the compile issue. The fix suggestion is included in v3. V3 [PATCH v3 1/2] tty: hvc: pass DMA capable memory to put_chars() https://lkml.org/lkml/2021/8/3/1347 [PATCH v3 2/2] virtio-console: remove unnecessary kmemdup() https://lkml.org/lkml/2021/8/3/1348 For v3 patch, Jiri Slaby suggests to make 'char c[N_OUTBUF]' part of hvc_struct, and make 'hp->outbuf' aligned and use struct_size() to calculate the size of hvc_struct. The fix suggestion is included in v4. V4 [PATCH v4 0/2] make hvc pass dma capable memory to its backend https://lkml.org/lkml/2021/8/5/1350 [PATCH v4 1/2] tty: hvc: pass DMA capable memory to put_chars() https://lkml.org/lkml/2021/8/5/1351 [PATCH v4 2/2] virtio-console: remove unnecessary kmemdup() https://lkml.org/lkml/2021/8/5/1352 For v4 patch, Arnd Bergmann suggests to introduce another array(cons_outbuf[]) for the buffer pointers next to the cons_ops[] and vtermnos[] arrays. This fix included in this v5 patch. V5 Arnd Bergmann suggests to use "L1_CACHE_BYTES" as dma alignment, use 'sizeof(long)' as dma alignment is wrong. fix it in v6. drivers/char/virtio_console.c | 12 ++-- drivers/tty/hvc/hvc_console.c | 40 +-- drivers/tty/hvc/hvc_console.h | 16 -- 3 file changed
[PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()
As well known, hvc backend can register its opertions to hvc backend. the opertions contain put_chars(), get_chars() and so on. Some hvc backend may do dma in its opertions. eg, put_chars() of virtio-console. But in the code of hvc framework, it may pass DMA incapable memory to put_chars() under a specific configuration, which is explained in commit c4baad5029(virtio-console: avoid DMA from stack): 1, c[] is on stack, hvc_console_print(): char c[N_OUTBUF] __ALIGNED__; cons_ops[index]->put_chars(vtermnos[index], c, i); 2, ch is on stack, static void hvc_poll_put_char(,,char ch) { struct tty_struct *tty = driver->ttys[0]; struct hvc_struct *hp = tty->driver_data; int n; do { n = hp->ops->put_chars(hp->vtermno, &ch, 1); } while (n <= 0); } Commit c4baad5029 is just the fix to avoid DMA from stack memory, which is passed to virtio-console by hvc framework in above code. But I think the fix is aggressive, it directly uses kmemdup() to alloc new buffer from kmalloc area and do memcpy no matter the memory is in kmalloc area or not. But most importantly, it should better be fixed in the hvc framework, by changing it to never pass stack memory to the put_chars() function in the first place. Otherwise, we still face the same issue if a new hvc backend using dma added in the furture. We make 'char c[N_OUTBUF]' part of 'struct hvc_struct', so hp->c is no longer the stack memory. we can use it in above two cases. Other fix is use L1_CACHE_BYTES as the alignment, use 'sizeof(long)' as dma alignment is wrong. And use struct_size() to calculate size of hvc_struct. Introduce another array(cons_outbuf[]) for the hp->c pointers next to the cons_ops[] and vtermnos[] arrays. With the patch, we can remove the fix c4baad5029. Signed-off-by: Xianting Tian Tested-by: Xianting Tian --- drivers/tty/hvc/hvc_console.c | 40 +-- drivers/tty/hvc/hvc_console.h | 16 -- 2 files changed, 38 insertions(+), 18 deletions(-) diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c index 5bb8c4e44..c56564eb7 100644 --- a/drivers/tty/hvc/hvc_console.c +++ b/drivers/tty/hvc/hvc_console.c @@ -41,16 +41,6 @@ */ #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */ -/* - * These sizes are most efficient for vio, because they are the - * native transfer size. We could make them selectable in the - * future to better deal with backends that want other buffer sizes. - */ -#define N_OUTBUF 16 -#define N_INBUF16 - -#define __ALIGNED__ __attribute__((__aligned__(sizeof(long - static struct tty_driver *hvc_driver; static struct task_struct *hvc_task; @@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp) static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES]; static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] = {[0 ... MAX_NR_HVC_CONSOLES - 1] = -1}; +static char *cons_outbuf[MAX_NR_HVC_CONSOLES]; /* * Console APIs, NOT TTY. These APIs are available immediately when @@ -151,18 +142,23 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] = static void hvc_console_print(struct console *co, const char *b, unsigned count) { - char c[N_OUTBUF] __ALIGNED__; + char *c; unsigned i = 0, n = 0; int r, donecr = 0, index = co->index; + unsigned long flags; + struct hvc_struct *hp; /* Console access attempt outside of acceptable console range. */ if (index >= MAX_NR_HVC_CONSOLES) return; /* This console adapter was removed so it is not usable. */ - if (vtermnos[index] == -1) + if (vtermnos[index] == -1 || !cons_outbuf[index]) return; + c = cons_outbuf[index]; + + spin_lock_irqsave(&hp->c_lock, flags); while (count > 0 || i > 0) { if (count > 0 && i < sizeof(c)) { if (b[n] == '\n' && !donecr) { @@ -191,6 +187,7 @@ static void hvc_console_print(struct console *co, const char *b, } } } + spin_unlock_irqrestore(&hp->c_lock, flags); hvc_console_flush(cons_ops[index], vtermnos[index]); } @@ -878,9 +875,19 @@ static void hvc_poll_put_char(struct tty_driver *driver, int line, char ch) struct tty_struct *tty = driver->ttys[0]; struct hvc_struct *hp = tty->driver_data; int n; + unsigned long flags; + char *c; + + if (!hp || !cons_outbuf[hp->index]) + return; + + c = cons_outbuf[hp->index]; do { - n = hp->ops->put_chars(hp->vtermno, &ch, 1); + spin_lock_irqsave(&hp->c_lock, flags); + c[0] = ch; + n = hp->ops->put_chars(hp->vtermno, c, 1); + spin_unlock_irqrestore(&hp->c_lock, flags); } while (n <= 0); } #endif @@ -922,8 +929,7 @@ struct hvc_struct *hvc_a
[PATCH v6 2/2] virtio-console: remove unnecessary kmemdup()
hvc framework will never pass stack memory to the put_chars() function, So the calling of kmemdup() is unnecessary, we can remove it. This revert commit c4baad5029 ("virtio-console: avoid DMA from stack") Signed-off-by: Xianting Tian --- drivers/char/virtio_console.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index 7eaf303a7..4ed3ffb1d 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -1117,8 +1117,6 @@ static int put_chars(u32 vtermno, const char *buf, int count) { struct port *port; struct scatterlist sg[1]; - void *data; - int ret; if (unlikely(early_put_chars)) return early_put_chars(vtermno, buf, count); @@ -1127,14 +1125,8 @@ static int put_chars(u32 vtermno, const char *buf, int count) if (!port) return -EPIPE; - data = kmemdup(buf, count, GFP_ATOMIC); - if (!data) - return -ENOMEM; - - sg_init_one(sg, data, count); - ret = __send_to_port(port, sg, 1, count, data, false); - kfree(data); - return ret; + sg_init_one(sg, buf, count); + return __send_to_port(port, sg, 1, count, (void *)buf, false); } /* -- 2.17.1
Re: [PATCH] powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()
Excerpts from Christophe Leroy's message of August 11, 2021 2:13 am: > An interrupt handler shall not be called from another interrupt > handler otherwise this leads to problems like the following: > > Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: > 1000) > [ cut here ] > Bug: Write fault blocked by KUAP! > WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 > do_page_fault+0x484/0x720 > Modules linked in: > CPU: 0 PID: 1617 Comm: sshd Tainted: GW > 5.13.0-pmac-00010-g8393422eb77 #7 > NIP: c001b77c LR: c001b77c CTR: > REGS: cb9e5bc0 TRAP: 0700 Tainted: GW > (5.13.0-pmac-00010-g8393422eb77) > MSR: 00021032 CR: 24942424 XER: > > GPR00: c001b77c cb9e5c80 c1582c00 0021 3bff 085b 0027 > c8eb644c > GPR08: 0023 24942424 0063f8c8 > 000186a0 > GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 > cb9e5e90 > GPR24: 0040 afd4fa96 0040 0200 c1fda6c0 afd4fa84 0300 > cb9e5cc0 > NIP [c001b77c] do_page_fault+0x484/0x720 > LR [c001b77c] do_page_fault+0x484/0x720 > Call Trace: > [cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable) > [cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4 > --- interrupt: 300 at __copy_tofrom_user+0x110/0x20c > NIP: c001f9b4 LR: c03250a0 CTR: 0004 > REGS: cb9e5cc0 TRAP: 0300 Tainted: GW > (5.13.0-pmac-00010-g8393422eb77) > MSR: 9032 CR: 48028468 XER: 2000 > DAR: afd4fa84 DSISR: 0a00 > GPR00: 20726f6f cb9e5d80 c1582c00 0004 cb9e5e3a 0016 afd4fa80 > > GPR08: 3835202d 72777872 2d78722d 0004 28028464 0063f8c8 > 000186a0 > GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 > cb9e5e90 > GPR24: 0040 afd4fa96 0040 cb9e5e0c 0daa a000 cb9e5e98 > afd4fa56 > NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c > LR [c03250a0] _copy_to_iter+0x144/0x990 > --- interrupt: 300 > [cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable) > [cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4 > [cb9e5e80] [c0156bf8] vfs_read+0x274/0x340 > [cb9e5f00] [c01571ac] ksys_read+0x70/0x118 > [cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28 > --- interrupt: c00 at 0xa7855c88 > NIP: a7855c88 LR: a7855c5c CTR: > REGS: cb9e5f40 TRAP: 0c00 Tainted: GW > (5.13.0-pmac-00010-g8393422eb77) > MSR: d032 CR: 2402446c XER: > > GPR00: 0003 afd4ec70 a72137d0 000b afd4ecac 4000 0065a990 > 0800 > GPR08: a7947930 0004 c15831b0 0063f8c8 > 000186a0 > GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 0001 > 0065fac0 > GPR24: 0089 00664050 00668e30 a720c8dc a7943ff4 > 0065f9b0 > NIP [a7855c88] 0xa7855c88 > LR [a7855c5c] 0xa7855c5c > --- interrupt: c00 > Instruction dump: > 3884aa88 38630178 48076861 807f0080 48042e45 2f83 419e0148 3c80c079 > 3c60c076 38841be4 386301c0 4801f705 <0fe0> 386b 4bfffe30 > 3c80c06b > ---[ end trace fd69b91a8046c2e5 ]--- > > Here the problem is that by re-enterring an exception handler, > kuap_save_and_lock() is called a second time with this time KUAP > access locked, leading to regs->kuap being overwritten hence > KUAP not being unlocked at exception exit as expected. > > Do not call do_IRQ() from timer_interrupt() directly. Instead, > redefine do_IRQ() as a standard function named __do_IRQ(), and > call it from both do_IRQ() and time_interrupt() handlers. Reviewed-by: Nicholas Piggin > > Reported-by: Stan Johnson > Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers") > Cc: sta...@vger.kernel.org > Cc: Nicholas Piggin > Cc: Finn Thain > Signed-off-by: Christophe Leroy > --- > arch/powerpc/include/asm/interrupt.h | 3 +++ > arch/powerpc/include/asm/irq.h | 2 +- > arch/powerpc/kernel/irq.c| 7 ++- > arch/powerpc/kernel/time.c | 2 +- > 4 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/include/asm/interrupt.h > b/arch/powerpc/include/asm/interrupt.h > index fc4702bdd119..1e984a35a39f 100644 > --- a/arch/powerpc/include/asm/interrupt.h > +++ b/arch/powerpc/include/asm/interrupt.h > @@ -590,6 +590,9 @@ DECLARE_INTERRUPT_HANDLER_NMI(hmi_exception_realmode); > > DECLARE_INTERRUPT_HANDLER_ASYNC(TAUException); > > +/* irq.c */ > +DECLARE_INTERRUPT_HANDLER_ASYNC(do_IRQ); > + > void __noreturn unrecoverable_exception(struct pt_regs *regs); > > void replay_system_reset(void); > diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h > index 4982f3711fc3..2b32785
Re: [PATCH] powerpc/interrupt: Do not call single_step_exception() from other exceptions
Excerpts from Christophe Leroy's message of August 11, 2021 2:13 am: > single_step_exception() is called by emulate_single_step() which > is called from (at least) alignment exception() handler and > program_check_exception() handler. > > Redefine it as a regular __single_step_exception() which is called > by both single_step_exception() handler and emulate_single_step() > function. > Reviewed-by: Nicholas Piggin > Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers") > Cc: sta...@vger.kernel.org > Cc: Stan Johnson > Cc: Nicholas Piggin > Cc: Finn Thain > Signed-off-by: Christophe Leroy > --- > arch/powerpc/kernel/traps.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c > index dfbce527c98e..d56254f05e17 100644 > --- a/arch/powerpc/kernel/traps.c > +++ b/arch/powerpc/kernel/traps.c > @@ -1104,7 +1104,7 @@ DEFINE_INTERRUPT_HANDLER(RunModeException) > _exception(SIGTRAP, regs, TRAP_UNK, 0); > } > > -DEFINE_INTERRUPT_HANDLER(single_step_exception) > +static void __single_step_exception(struct pt_regs *regs) > { > clear_single_step(regs); > clear_br_trace(regs); > @@ -1121,6 +1121,11 @@ DEFINE_INTERRUPT_HANDLER(single_step_exception) > _exception(SIGTRAP, regs, TRAP_TRACE, regs->nip); > } > > +DEFINE_INTERRUPT_HANDLER(single_step_exception) > +{ > + __single_step_exception(regs); > +} > + > /* > * After we have successfully emulated an instruction, we have to > * check if the instruction was being single-stepped, and if so, > @@ -1130,7 +1135,7 @@ DEFINE_INTERRUPT_HANDLER(single_step_exception) > static void emulate_single_step(struct pt_regs *regs) > { > if (single_stepping(regs)) > - single_step_exception(regs); > + __single_step_exception(regs); > } > > static inline int __parse_fpscr(unsigned long fpscr) > -- > 2.25.0 > >
Re: [PATCH 07/11] treewide: Replace the use of mem_encrypt_active() with prot_guest_has()
On Wed, Aug 11, 2021 at 10:52:55AM -0500, Tom Lendacky wrote: > On 8/11/21 7:19 AM, Kirill A. Shutemov wrote: > > On Tue, Aug 10, 2021 at 02:48:54PM -0500, Tom Lendacky wrote: > >> On 8/10/21 1:45 PM, Kuppuswamy, Sathyanarayanan wrote: > >>> > >>> > >>> On 7/27/21 3:26 PM, Tom Lendacky wrote: > diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c > index de01903c3735..cafed6456d45 100644 > --- a/arch/x86/kernel/head64.c > +++ b/arch/x86/kernel/head64.c > @@ -19,7 +19,7 @@ > #include > #include > #include > -#include > +#include > #include > #include > @@ -285,7 +285,7 @@ unsigned long __head __startup_64(unsigned long > physaddr, > * there is no need to zero it after changing the memory encryption > * attribute. > */ > - if (mem_encrypt_active()) { > + if (prot_guest_has(PATTR_MEM_ENCRYPT)) { > vaddr = (unsigned long)__start_bss_decrypted; > vaddr_end = (unsigned long)__end_bss_decrypted; > >>> > >>> > >>> Since this change is specific to AMD, can you replace PATTR_MEM_ENCRYPT > >>> with > >>> prot_guest_has(PATTR_SME) || prot_guest_has(PATTR_SEV). It is not used in > >>> TDX. > >> > >> This is a direct replacement for now. > > > > With current implementation of prot_guest_has() for TDX it breaks boot for > > me. > > > > Looking at code agains, now I *think* the reason is accessing a global > > variable from __startup_64() inside TDX version of prot_guest_has(). > > > > __startup_64() is special. If you access any global variable you need to > > use fixup_pointer(). See comment before __startup_64(). > > > > I'm not sure how you get away with accessing sme_me_mask directly from > > there. Any clues? Maybe just a luck and complier generates code just right > > for your case, I donno. > > Hmm... yeah, could be that the compiler is using rip-relative addressing > for it because it lives in the .data section? I guess. It has to be fixed. It may break with complier upgrade or any random change around the code. BTW, does it work with clang for you? > For the static variables in mem_encrypt_identity.c I did an assembler rip > relative LEA, but probably could have passed physaddr to sme_enable() and > used a fixup_pointer() style function, instead. Sounds like a plan. > > A separate point is that TDX version of prot_guest_has() relies on > > cpu_feature_enabled() which is not ready at this point. > > Does TDX have to do anything special to make memory able to be shared with > the hypervisor? Yes. But there's nothing that required any changes in early boot. It handled in ioremap/set_memory. > You might have to use something that is available earlier > than cpu_feature_enabled() in that case (should you eventually support > kvmclock). Maybe. > > I think __bss_decrypted fixup has to be done if sme_me_mask is non-zero. > > Or just do it uncoditionally because it's NOP for sme_me_mask == 0. > > For SNP, we'll have to additionally call the HV to update the RMP to make > the memory shared. But that could also be done unconditionally since the > early_snp_set_memory_shared() routine will check for SNP before doing > anything. -- Kirill A. Shutemov
Re: [PATCH] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry
"Aneesh Kumar K.V" writes: > On 8/12/21 12:58 PM, Michael Ellerman wrote: >> Christophe Leroy writes: >>> Le 10/08/2021 à 06:53, Aneesh Kumar K.V a écrit : Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_tlb.c | 48 1 file changed, 48 insertions(+) diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index aefc100d79a7..5cca0fe130e7 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -17,6 +17,7 @@ >> ... + +static int __init create_tlb_single_page_flush_ceiling(void) +{ + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, + powerpc_debugfs_root, NULL, &fops_tlbflush); >>> >>> Could you just use debugfs_create_u32() instead of re-implementing simple >>> read and write ? >> >> Yeah AFAICS that should work fine. >> >> It could probably even be a u16? > > I was looking at switching all that to u64. Should i fallback to u16, > considering a tlb_signle_page_flush_ceiling value larger that 2**16 > doesn't make sense? Hmm, if we make it u16 and someone writes a value >= 2^16 it just truncates the value to 0, which is a bit unfortunate. So maybe just make it u32, that way if someone writes a stupidly large value it stays large. cheers
Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()
Hi Xianting, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tty/tty-testing] [also build test WARNING on char-misc/char-misc-testing soc/for-next v5.14-rc5 next-20210812] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Xianting-Tian/make-hvc-pass-dma-capable-memory-to-its-backend/20210812-174847 base: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git tty-testing config: hexagon-randconfig-r041-20210812 (attached as .config) compiler: clang version 12.0.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/9f2925b5429149ceb0ea6eeaa8c81d422c3124fc git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Xianting-Tian/make-hvc-pass-dma-capable-memory-to-its-backend/20210812-174847 git checkout 9f2925b5429149ceb0ea6eeaa8c81d422c3124fc # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=hexagon If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All warnings (new ones prefixed by >>): >> drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is >> uninitialized when used here [-Wuninitialized] spin_unlock_irqrestore(&hp->c_lock, flags); ^~ drivers/tty/hvc/hvc_console.c:149:23: note: initialize the variable 'hp' to silence this warning struct hvc_struct *hp; ^ = NULL 1 warning generated. vim +/hp +190 drivers/tty/hvc/hvc_console.c 136 137 /* 138 * Console APIs, NOT TTY. These APIs are available immediately when 139 * hvc_console_setup() finds adapters. 140 */ 141 142 static void hvc_console_print(struct console *co, const char *b, 143unsigned count) 144 { 145 char *c; 146 unsigned i = 0, n = 0; 147 int r, donecr = 0, index = co->index; 148 unsigned long flags; 149 struct hvc_struct *hp; 150 151 /* Console access attempt outside of acceptable console range. */ 152 if (index >= MAX_NR_HVC_CONSOLES) 153 return; 154 155 /* This console adapter was removed so it is not usable. */ 156 if (vtermnos[index] == -1 || !cons_outbuf[index]) 157 return; 158 159 c = cons_outbuf[index]; 160 161 spin_lock_irqsave(&hp->c_lock, flags); 162 while (count > 0 || i > 0) { 163 if (count > 0 && i < sizeof(c)) { 164 if (b[n] == '\n' && !donecr) { 165 c[i++] = '\r'; 166 donecr = 1; 167 } else { 168 c[i++] = b[n++]; 169 donecr = 0; 170 --count; 171 } 172 } else { 173 r = cons_ops[index]->put_chars(vtermnos[index], c, i); 174 if (r <= 0) { 175 /* throw away characters on error 176 * but spin in case of -EAGAIN */ 177 if (r != -EAGAIN) { 178 i = 0; 179 } else { 180 hvc_console_flush(cons_ops[index], 181vtermnos[index]); 182 } 183 } else if (r > 0) { 184 i -= r; 185 if (i > 0) 186 memmove(c, c+r, i); 187 } 188 } 189 } > 190 spin_unlock_irqrestore(&hp->c_lock, flags); 191 hvc_console_flush(cons_ops[index], vtermnos[index]); 192 } 193 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip
[PATCH -next] trap: Cleanup trap_init()
There are some empty trap_init() in different ARCHs, introduce a new weak trap_init() function to cleanup them. Cc: Vineet Gupta Cc: Russell King Cc: Yoshinori Sato Cc: Ley Foon Tan Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: James E.J. Bottomley Cc: Helge Deller Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Jeff Dike Cc: Richard Weinberger Cc: Anton Ivanov Cc: Andrew Morton Signed-off-by: Kefeng Wang --- arch/arc/kernel/traps.c | 5 - arch/arm/kernel/traps.c | 5 - arch/h8300/kernel/traps.c| 4 arch/hexagon/kernel/traps.c | 4 arch/nds32/kernel/traps.c| 5 - arch/nios2/kernel/traps.c| 5 - arch/openrisc/kernel/traps.c | 5 - arch/parisc/kernel/traps.c | 4 arch/powerpc/kernel/traps.c | 5 - arch/riscv/kernel/traps.c| 5 - arch/um/kernel/trap.c| 4 init/main.c | 2 ++ 12 files changed, 2 insertions(+), 51 deletions(-) diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c index 57235e5c0cea..6b83e3f2b41c 100644 --- a/arch/arc/kernel/traps.c +++ b/arch/arc/kernel/traps.c @@ -20,11 +20,6 @@ #include #include -void __init trap_init(void) -{ - return; -} - void die(const char *str, struct pt_regs *regs, unsigned long address) { show_kernel_fault_diag(str, regs, address); diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c index 64308e3a5d0c..e9b4f2b49bd8 100644 --- a/arch/arm/kernel/traps.c +++ b/arch/arm/kernel/traps.c @@ -781,11 +781,6 @@ void abort(void) panic("Oops failed to kill thread"); } -void __init trap_init(void) -{ - return; -} - #ifdef CONFIG_KUSER_HELPERS static void __init kuser_init(void *vectors) { diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c index 5d8b969cd8f3..bdbe988d8dbc 100644 --- a/arch/h8300/kernel/traps.c +++ b/arch/h8300/kernel/traps.c @@ -39,10 +39,6 @@ void __init base_trap_init(void) { } -void __init trap_init(void) -{ -} - asmlinkage void set_esp0(unsigned long ssp) { current->thread.esp0 = ssp; diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c index 904134b37232..edfc35dafeb1 100644 --- a/arch/hexagon/kernel/traps.c +++ b/arch/hexagon/kernel/traps.c @@ -28,10 +28,6 @@ #define TRAP_SYSCALL 1 #define TRAP_DEBUG 0xdb -void __init trap_init(void) -{ -} - #ifdef CONFIG_GENERIC_BUG /* Maybe should resemble arch/sh/kernel/traps.c ?? */ int is_valid_bugaddr(unsigned long addr) diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c index ee0d9ae192a5..f06421c645af 100644 --- a/arch/nds32/kernel/traps.c +++ b/arch/nds32/kernel/traps.c @@ -183,11 +183,6 @@ void __pgd_error(const char *file, int line, unsigned long val) } extern char *exception_vector, *exception_vector_end; -void __init trap_init(void) -{ - return; -} - void __init early_trap_init(void) { unsigned long ivb = 0; diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c index b172da4eb1a9..596986a74a26 100644 --- a/arch/nios2/kernel/traps.c +++ b/arch/nios2/kernel/traps.c @@ -105,11 +105,6 @@ void show_stack(struct task_struct *task, unsigned long *stack, printk("%s\n", loglvl); } -void __init trap_init(void) -{ - /* Nothing to do here */ -} - /* Breakpoint handler */ asmlinkage void breakpoint_c(struct pt_regs *fp) { diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c index 4d61333c2623..aa1e709405ac 100644 --- a/arch/openrisc/kernel/traps.c +++ b/arch/openrisc/kernel/traps.c @@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs *regs, int ea, int vector) die("Oops", regs, 9); } -void __init trap_init(void) -{ - /* Nothing needs to be done */ -} - asmlinkage void do_trap(struct pt_regs *regs, unsigned long address) { force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc); diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c index 8d8441d4562a..747c328fb886 100644 --- a/arch/parisc/kernel/traps.c +++ b/arch/parisc/kernel/traps.c @@ -859,7 +859,3 @@ void __init early_trap_init(void) initialize_ivt(&fault_vector_20); } - -void __init trap_init(void) -{ -} diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index e103b89234cd..91efb5c6f2f3 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -2209,11 +2209,6 @@ DEFINE_INTERRUPT_HANDLER(kernel_bad_stack) die("Bad kernel stack pointer", regs, SIGABRT); } -void __init trap_init(void) -{ -} - - #ifdef CONFIG_PPC_EMULATED_STATS #define WARN_EMULATED_SETUP(type) .type = { .name = #type } diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index 0a98fd0ddfe9..0daaa3e4630d 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -199,11 +199,6 @@ int is_valid_bugaddr(unsigned long pc) } #endif /* CONFIG_GENERIC_BUG */ -/* stvec
Re: [PATCH -next] trap: Cleanup trap_init()
On Thu, Aug 12, 2021 at 08:36:02PM +0800, Kefeng Wang wrote: > There are some empty trap_init() in different ARCHs, introduce > a new weak trap_init() function to cleanup them. > > Cc: Vineet Gupta > Cc: Russell King > Cc: Yoshinori Sato > Cc: Ley Foon Tan > Cc: Jonas Bonn > Cc: Stefan Kristiansson > Cc: Stafford Horne > Cc: James E.J. Bottomley > Cc: Helge Deller > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Paul Walmsley > Cc: Jeff Dike > Cc: Richard Weinberger > Cc: Anton Ivanov > Cc: Andrew Morton > Signed-off-by: Kefeng Wang For 32-bit arm: Acked-by: Russell King (Oracle) Thanks! -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
"Puvichakravarthy Ramachandran" writes: >> With shared mapping, even though we are unmapping a large range, the kernel >> will force a TLB flush with ptl lock held to avoid the race mentioned in >> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and >> memory freeing parts") >> This results in the kernel issuing a high number of TLB flushes even for a >> large >> range. This can be improved by making sure the kernel switch to pid based >> flush if the >> kernel is unmapping a 2M range. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/mm/book3s64/radix_tlb.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > >> b/arch/powerpc/mm/book3s64/radix_tlb.c >> index aefc100d79a7..21d0f098e43b 100644 >> --- a/arch/powerpc/mm/book3s64/radix_tlb.c >> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c >> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range); >> * invalidating a full PID, so it has a far lower threshold to change > from >> * individual page flushes to full-pid flushes. >> */ >> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; >> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32; >> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = >> POWER9_TLB_SETS_RADIX * 2; >> >> static inline void __radix__flush_tlb_range(struct mm_struct *mm, >> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > >> mm_struct *mm, >> if (fullmm) >> flush_pid = true; >> else if (type == FLUSH_TYPE_GLOBAL) >> - flush_pid = nr_pages > tlb_single_page_flush_ceiling; >> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling; >> else >> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling; > > Additional details on the test environment. This was tested on a 2 Node/8 > socket Power10 system. > The LPAR had 105 cores and the LPAR spanned across all the sockets. > > # perf stat -I 1000 -a -e cycles,instructions -e > "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e > "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1 > Rate of work: = 176 > # time counts unit events > 1.029206442 4198594519 cycles > 1.029206442 2458254252 instructions # 0.59 > insn per cycle > 1.029206442 3004031488 PM_EXEC_STALL > 1.029206442 1798186036 PM_EXEC_STALL_TLBIE > Rate of work: = 181 > 2.054288539 4183883450 cycles > 2.054288539 2472178171 instructions # 0.59 > insn per cycle > 2.054288539 3014609313 PM_EXEC_STALL > 2.054288539 1797851642 PM_EXEC_STALL_TLBIE > Rate of work: = 180 > 3.078306883 4171250717 cycles > 3.078306883 2468341094 instructions # 0.59 > insn per cycle > 3.078306883 2993036205 PM_EXEC_STALL > 3.078306883 1798181890 PM_EXEC_STALL_TLBIE > . > . > > # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling > 34 > > # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling > > # perf stat -I 1000 -a -e cycles,instructions -e > "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e > "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1 > Rate of work: = 313 > # time counts unit events > 1.030310506 4206071143 cycles > 1.030310506 4314716958 instructions # 1.03 > insn per cycle > 1.030310506 2157762167 PM_EXEC_STALL > 1.030310506 110825573 PM_EXEC_STALL_TLBIE > Rate of work: = 322 > 2.056034068 4331745630 cycles > 2.056034068 4531658304 instructions # 1.05 > insn per cycle > 2.056034068 2288971361 PM_EXEC_STALL > 2.056034068 111267927 PM_EXEC_STALL_TLBIE > Rate of work: = 321 > 3.081216434 4327050349 cycles > 3.081216434 4379679508 instructions # 1.01 > insn per cycle > 3.081216434 2252602550 PM_EXEC_STALL > 3.081216434 110974887 PM_EXEC_STALL_TLBIE What is the tlbie test actually doing? Does it do anything to measure the cost of refilling after the full mm flush? cheers
Re: [PATCH v2 5/9] powerpc/microwatt: Use standard 16550 UART for console
Le 18/06/2021 à 05:46, Paul Mackerras a écrit : From: Benjamin Herrenschmidt This adds support to the Microwatt platform to use the standard 16550-style UART which available in the standalone Microwatt FPGA. Signed-off-by: Benjamin Herrenschmidt Signed-off-by: Paul Mackerras --- arch/powerpc/boot/dts/microwatt.dts | 27 arch/powerpc/kernel/udbg_16550.c | 39 arch/powerpc/platforms/microwatt/Kconfig | 1 + arch/powerpc/platforms/microwatt/setup.c | 2 ++ 4 files changed, 63 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/boot/dts/microwatt.dts b/arch/powerpc/boot/dts/microwatt.dts index 04e5dd92270e..974abbdda249 100644 --- a/arch/powerpc/boot/dts/microwatt.dts +++ b/arch/powerpc/boot/dts/microwatt.dts @@ -6,6 +6,10 @@ / { model-name = "microwatt"; compatible = "microwatt-soc"; + aliases { + serial0 = &UART0; + }; + reserved-memory { #size-cells = <0x02>; #address-cells = <0x02>; @@ -89,12 +93,6 @@ PowerPC,Microwatt@0 { }; }; - chosen { - bootargs = ""; - ibm,architecture-vec-5 = [19 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 40 00 40]; - }; - soc@c000 { compatible = "simple-bus"; #address-cells = <1>; @@ -119,5 +117,22 @@ ICS: interrupt-controller@5000 { #interrupt-cells = <2>; }; + UART0: serial@2000 { + device_type = "serial"; + compatible = "ns16550"; + reg = <0x2000 0x8>; + clock-frequency = <1>; + current-speed = <115200>; + reg-shift = <2>; + fifo-size = <16>; + interrupts = <0x10 0x1>; + }; + }; + + chosen { + bootargs = ""; + ibm,architecture-vec-5 = [19 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 + 00 00 00 00 00 00 00 00 40 00 40]; + stdout-path = &UART0; }; }; diff --git a/arch/powerpc/kernel/udbg_16550.c b/arch/powerpc/kernel/udbg_16550.c index 9356b60d6030..8513aa49614e 100644 --- a/arch/powerpc/kernel/udbg_16550.c +++ b/arch/powerpc/kernel/udbg_16550.c @@ -296,3 +296,42 @@ void __init udbg_init_40x_realmode(void) } #endif /* CONFIG_PPC_EARLY_DEBUG_40x */ + +#ifdef CONFIG_PPC_EARLY_DEBUG_MICROWATT + +#define UDBG_UART_MW_ADDR ((void __iomem *)0xc0002000) + +static u8 udbg_uart_in_isa300_rm(unsigned int reg) +{ + uint64_t msr = mfmsr(); + uint8_t c; + + mtmsr(msr & ~(MSR_EE|MSR_DR)); + isync(); + eieio(); + c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2)); + mtmsr(msr); + isync(); + return c; +} How do you make sure that GCC won't emit any access to the stack between the two mtmsr() ? What about using real_205_readb() and real_205_writeb() instead ? + +static void udbg_uart_out_isa300_rm(unsigned int reg, u8 val) +{ + uint64_t msr = mfmsr(); + + mtmsr(msr & ~(MSR_EE|MSR_DR)); + isync(); + eieio(); + __raw_rm_writeb(val, UDBG_UART_MW_ADDR + (reg << 2)); + mtmsr(msr); + isync(); +} + +void __init udbg_init_debug_microwatt(void) +{ + udbg_uart_in = udbg_uart_in_isa300_rm; + udbg_uart_out = udbg_uart_out_isa300_rm; + udbg_use_uart(); +} + +#endif /* CONFIG_PPC_EARLY_DEBUG_MICROWATT */ diff --git a/arch/powerpc/platforms/microwatt/Kconfig b/arch/powerpc/platforms/microwatt/Kconfig index b52c869c0eb8..50ed0cedb5f1 100644 --- a/arch/powerpc/platforms/microwatt/Kconfig +++ b/arch/powerpc/platforms/microwatt/Kconfig @@ -6,6 +6,7 @@ config PPC_MICROWATT select PPC_ICS_NATIVE select PPC_ICP_NATIVE select PPC_NATIVE + select PPC_UDBG_16550 help This option enables support for FPGA-based Microwatt implementations. diff --git a/arch/powerpc/platforms/microwatt/setup.c b/arch/powerpc/platforms/microwatt/setup.c index 1c1b7791fa57..0b02603bdb74 100644 --- a/arch/powerpc/platforms/microwatt/setup.c +++ b/arch/powerpc/platforms/microwatt/setup.c @@ -14,6 +14,7 @@ #include #include #include +#include static void __init microwatt_init_IRQ(void) { @@ -35,5 +36,6 @@ define_machine(microwatt) { .name = "microwatt", .probe = microwatt_probe, .init_IRQ = microwatt_init_IRQ, + .progress = udbg_progress, .calibrate_decr = generic_calibrate_decr, };
Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
On 8/12/21 6:19 PM, Michael Ellerman wrote: "Puvichakravarthy Ramachandran" writes: With shared mapping, even though we are unmapping a large range, the kernel will force a TLB flush with ptl lock held to avoid the race mentioned in commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts") This results in the kernel issuing a high number of TLB flushes even for a large range. This can be improved by making sure the kernel switch to pid based flush if the kernel is unmapping a 2M range. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_tlb.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c index aefc100d79a7..21d0f098e43b 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range); * invalidating a full PID, so it has a far lower threshold to change > from * individual page flushes to full-pid flushes. */ -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32; static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2; static inline void __radix__flush_tlb_range(struct mm_struct *mm, @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm, if (fullmm) flush_pid = true; else if (type == FLUSH_TYPE_GLOBAL) - flush_pid = nr_pages > tlb_single_page_flush_ceiling; + flush_pid = nr_pages >= tlb_single_page_flush_ceiling; else flush_pid = nr_pages > tlb_local_single_page_flush_ceiling; Additional details on the test environment. This was tested on a 2 Node/8 socket Power10 system. The LPAR had 105 cores and the LPAR spanned across all the sockets. # perf stat -I 1000 -a -e cycles,instructions -e "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1 Rate of work: = 176 # time counts unit events 1.029206442 4198594519 cycles 1.029206442 2458254252 instructions # 0.59 insn per cycle 1.029206442 3004031488 PM_EXEC_STALL 1.029206442 1798186036 PM_EXEC_STALL_TLBIE Rate of work: = 181 2.054288539 4183883450 cycles 2.054288539 2472178171 instructions # 0.59 insn per cycle 2.054288539 3014609313 PM_EXEC_STALL 2.054288539 1797851642 PM_EXEC_STALL_TLBIE Rate of work: = 180 3.078306883 4171250717 cycles 3.078306883 2468341094 instructions # 0.59 insn per cycle 3.078306883 2993036205 PM_EXEC_STALL 3.078306883 1798181890 PM_EXEC_STALL_TLBIE . . # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling 34 # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling # perf stat -I 1000 -a -e cycles,instructions -e "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1 Rate of work: = 313 # time counts unit events 1.030310506 4206071143 cycles 1.030310506 4314716958 instructions # 1.03 insn per cycle 1.030310506 2157762167 PM_EXEC_STALL 1.030310506 110825573 PM_EXEC_STALL_TLBIE Rate of work: = 322 2.056034068 4331745630 cycles 2.056034068 4531658304 instructions # 1.05 insn per cycle 2.056034068 2288971361 PM_EXEC_STALL 2.056034068 111267927 PM_EXEC_STALL_TLBIE Rate of work: = 321 3.081216434 4327050349 cycles 3.081216434 4379679508 instructions # 1.01 insn per cycle 3.081216434 2252602550 PM_EXEC_STALL 3.081216434 110974887 PM_EXEC_STALL_TLBIE What is the tlbie test actually doing? Does it do anything to measure the cost of refilling after the full mm flush? That is essentially for () { shmat() fillshm() shmdt() } for a 256MB range. So it is not really a fair benchmark because it doesn't take into account the impact of throwing away the full pid translation. But even then the TLBIE stalls is an important data point? -aneesh
[PATCH v8 0/5] Add support for FORM2 associativity
Form2 associativity adds a much more flexible NUMA topology layout than what is provided by Form1. More details can be found in patch 7. $ numactl -H ... node distances: node 0 1 2 3 0: 10 11 222 33 1: 44 10 55 66 2: 77 88 10 99 3: 101 121 132 10 $ After DAX kmem memory add # numactl -H available: 5 nodes (0-4) ... node distances: node 0 1 2 3 4 0: 10 11 222 33 240 1: 44 10 55 66 255 2: 77 88 10 99 255 3: 101 121 132 10 230 4: 255 255 255 230 10 PAPR SCM now use the numa distance details to find the numa_node and target_node for the device. kvaneesh@ubuntu-guest:~$ ndctl list -N -v [ { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":1071644672, "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2", "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362", "daxregion":{ "id":0, "size":1071644672, "devices":[ { "chardev":"dax0.0", "size":1071644672, "target_node":4, "mode":"devdax" } ] }, "align":2097152, "numa_node":3 } ] kvaneesh@ubuntu-guest:~$ The above output is with a Qemu command line -numa node,nodeid=4 \ -numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \ -numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \ -numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \ -numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \ -numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \ -object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE} \ -device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4 Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb...@gmail.com/ Changes from v7: * Address review feedback * fold patch 6 to patch 3 Changes from v6: * Address review feedback Changes from v5: * Fix build error reported by kernel test robot * Address review feedback Changes from v4: * Drop DLPAR related device tree property for now because both Qemu nor PowerVM will provide the distance details of all possible NUMA nodes during boot. * Rework numa distance code based on review feedback. Changes from v3: * Drop PAPR SCM specific changes and depend completely on NUMA distance information. Changes from v2: * Add nvdimm list to Cc: * update PATCH 8 commit message. Changes from v1: * Update FORM2 documentation. * rename max_domain_index to max_associativity_domain_index Aneesh Kumar K.V (5): powerpc/pseries: rename min_common_depth to primary_domain_index powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY powerpc/pseries: Consolidate different NUMA distance update code paths powerpc/pseries: Add a helper for form1 cpu distance powerpc/pseries: Add support for FORM2 associativity Documentation/powerpc/associativity.rst | 104 + arch/powerpc/include/asm/firmware.h | 7 +- arch/powerpc/include/asm/prom.h | 3 +- arch/powerpc/include/asm/topology.h | 6 +- arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 432 ++ arch/powerpc/platforms/pseries/firmware.c | 3 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 + .../platforms/pseries/hotplug-memory.c| 2 + arch/powerpc/platforms/pseries/lpar.c | 4 +- 10 files changed, 455 insertions(+), 111 deletions(-) create mode 100644 Documentation/powerpc/associativity.rst -- 2.31.1
[PATCH v8 1/5] powerpc/pseries: rename min_common_depth to primary_domain_index
No functional change in this patch. Reviewed-by: David Gibson Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/numa.c | 38 +++--- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f2bf98bdcea2..8365b298ec48 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); -static int min_common_depth; +static int primary_domain_index; static int n_mem_addr_cells, n_mem_size_cells; static int form1_affinity; @@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity) if (!numa_enabled) goto out; - if (of_read_number(associativity, 1) >= min_common_depth) - nid = of_read_number(&associativity[min_common_depth], 1); + if (of_read_number(associativity, 1) >= primary_domain_index) + nid = of_read_number(&associativity[primary_domain_index], 1); /* POWER4 LPAR uses 0x as invalid node */ if (nid == 0x || nid >= nr_node_ids) @@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device) } EXPORT_SYMBOL(of_node_to_nid); -static int __init find_min_common_depth(void) +static int __init find_primary_domain_index(void) { - int depth; + int index; struct device_node *root; if (firmware_has_feature(FW_FEATURE_OPAL)) @@ -326,7 +326,7 @@ static int __init find_min_common_depth(void) } if (form1_affinity) { - depth = of_read_number(distance_ref_points, 1); + index = of_read_number(distance_ref_points, 1); } else { if (distance_ref_points_depth < 2) { printk(KERN_WARNING "NUMA: " @@ -334,7 +334,7 @@ static int __init find_min_common_depth(void) goto err; } - depth = of_read_number(&distance_ref_points[1], 1); + index = of_read_number(&distance_ref_points[1], 1); } /* @@ -348,7 +348,7 @@ static int __init find_min_common_depth(void) } of_node_put(root); - return depth; + return index; err: of_node_put(root); @@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb) int nid = default_nid; int rc, index; - if ((min_common_depth < 0) || !numa_enabled) + if ((primary_domain_index < 0) || !numa_enabled) return default_nid; rc = of_get_assoc_arrays(&aa); if (rc) return default_nid; - if (min_common_depth <= aa.array_sz && + if (primary_domain_index <= aa.array_sz && !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) { - index = lmb->aa_index * aa.array_sz + min_common_depth - 1; + index = lmb->aa_index * aa.array_sz + primary_domain_index - 1; nid = of_read_number(&aa.arrays[index], 1); if (nid == 0x || nid >= nr_node_ids) @@ -708,18 +708,18 @@ static int __init parse_numa_properties(void) return -1; } - min_common_depth = find_min_common_depth(); + primary_domain_index = find_primary_domain_index(); - if (min_common_depth < 0) { + if (primary_domain_index < 0) { /* -* if we fail to parse min_common_depth from device tree +* if we fail to parse primary_domain_index from device tree * mark the numa disabled, boot with numa disabled. */ numa_enabled = false; - return min_common_depth; + return primary_domain_index; } - dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth); + dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index); /* * Even though we connect cpus to numa domains later in SMP @@ -919,14 +919,14 @@ static void __init find_possible_nodes(void) goto out; } - max_nodes = of_read_number(&domains[min_common_depth], 1); + max_nodes = of_read_number(&domains[primary_domain_index], 1); for (i = 0; i < max_nodes; i++) { if (!node_possible(i)) node_set(i, node_possible_map); } prop_length /= sizeof(int); - if (prop_length > min_common_depth + 2) + if (prop_length > primary_domain_index + 2) coregroup_enabled = 1; out: @@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu) goto out; index = of_read_number(associativity, 1); - if (index > min_common_depth + 1) + if (index > primary_domain_index + 1) return of_read_number(&associativity[index - 1], 1); out: -- 2.31.1
[PATCH v8 2/5] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
Also make related code cleanup that will allow adding FORM2_AFFINITY in later patches. No functional change in this patch. Reviewed-by: David Gibson Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/firmware.h | 4 +-- arch/powerpc/include/asm/prom.h | 2 +- arch/powerpc/kernel/prom_init.c | 2 +- arch/powerpc/mm/numa.c| 35 ++- arch/powerpc/platforms/pseries/firmware.c | 2 +- 5 files changed, 26 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 7604673787d6..60b631161360 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -44,7 +44,7 @@ #define FW_FEATURE_OPALASM_CONST(0x1000) #define FW_FEATURE_SET_MODEASM_CONST(0x4000) #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000) -#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001) +#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0001) #define FW_FEATURE_PRRNASM_CONST(0x0002) #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004) #define FW_FEATURE_DRC_INFOASM_CONST(0x0008) @@ -69,7 +69,7 @@ enum { FW_FEATURE_SPLPAR | FW_FEATURE_LPAR | FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO | FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY | - FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | + FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN | FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h index 324a13351749..df9fec9d232c 100644 --- a/arch/powerpc/include/asm/prom.h +++ b/arch/powerpc/include/asm/prom.h @@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop, #define OV5_MSI0x0201 /* PCIe/MSI support */ #define OV5_CMO0x0480 /* Cooperative Memory Overcommitment */ #define OV5_XCMO 0x0440 /* Page Coalescing */ -#define OV5_TYPE1_AFFINITY 0x0580 /* Type 1 NUMA affinity */ +#define OV5_FORM1_AFFINITY 0x0580 /* FORM1 NUMA affinity */ #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ #define OV5_HP_EVT 0x0604 /* Hot Plug Event support */ #define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index a5bf355ce1d6..57db605ad33a 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -1096,7 +1096,7 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = { #else 0, #endif - .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN), + .associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN), .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT), .micro_checkpoint = 0, .reserved0 = 0, diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8365b298ec48..368719b14dcc 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data); static int primary_domain_index; static int n_mem_addr_cells, n_mem_size_cells; -static int form1_affinity; + +#define FORM0_AFFINITY 0 +#define FORM1_AFFINITY 1 +static int affinity_form; #define MAX_DISTANCE_REF_POINTS 4 static int distance_ref_points_depth; @@ -190,7 +193,7 @@ int __node_distance(int a, int b) int i; int distance = LOCAL_DISTANCE; - if (!form1_affinity) + if (affinity_form == FORM0_AFFINITY) return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE); for (i = 0; i < distance_ref_points_depth; i++) { @@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid, { int i; - if (!form1_affinity) + if (affinity_form != FORM1_AFFINITY) return; for (i = 0; i < distance_ref_points_depth; i++) { @@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void) int index; struct device_node *root; + /* +* Check for which form of affinity. +*/ + if (firmware_has_feature(FW_FEATURE_OPAL)) { + affinity_form = FORM1_AFFINITY; + } else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) { + dbg("Using form 1 affinity\n"); + affinity_form = FORM1_AFFINITY; + } else + affinity_form = FORM0_AFFINITY; + if (firmware_has_feature(FW_FEATURE_OPAL)) root = of_find_node_by_path("/ibm,opal"); else @@ -318,23 +332,16 @@ static i
[PATCH v8 3/5] powerpc/pseries: Consolidate different NUMA distance update code paths
The associativity details of the newly added resourced are collected from the hypervisor via "ibm,configure-connector" rtas call. Update the numa distance details of the newly added numa node after the above call. Instead of updating NUMA distance every time we lookup a node id from the associativity property, add helpers that can be used during boot which does this only once. Also remove the distance update from node id lookup helpers. Currently, we duplicate parsing code for ibm,associativity and ibm,associativity-lookup-arrays in the kernel. The associativity array provided by these device tree properties are very similar and hence can use a helper to parse the node id and numa distance details. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/topology.h | 2 + arch/powerpc/mm/numa.c| 212 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 + .../platforms/pseries/hotplug-memory.c| 2 + 4 files changed, 161 insertions(+), 57 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index e4db64c0e184..a6425a70c37b 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -64,6 +64,7 @@ static inline int early_cpu_to_node(int cpu) } int of_drconf_to_nid_single(struct drmem_lmb *lmb); +void update_numa_distance(struct device_node *node); #else @@ -93,6 +94,7 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb) return first_online_node; } +static inline void update_numa_distance(struct device_node *node) {} #endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 368719b14dcc..c0a89839310c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -208,50 +208,35 @@ int __node_distance(int a, int b) } EXPORT_SYMBOL(__node_distance); -static void initialize_distance_lookup_table(int nid, - const __be32 *associativity) +static int __associativity_to_nid(const __be32 *associativity, + int max_array_sz) { - int i; + int nid; + /* +* primary_domain_index is 1 based array index. +*/ + int index = primary_domain_index - 1; - if (affinity_form != FORM1_AFFINITY) - return; + if (!numa_enabled || index >= max_array_sz) + return NUMA_NO_NODE; - for (i = 0; i < distance_ref_points_depth; i++) { - const __be32 *entry; + nid = of_read_number(&associativity[index], 1); - entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1]; - distance_lookup_table[nid][i] = of_read_number(entry, 1); - } + /* POWER4 LPAR uses 0x as invalid node */ + if (nid == 0x || nid >= nr_node_ids) + nid = NUMA_NO_NODE; + return nid; } - /* * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA * info is found. */ static int associativity_to_nid(const __be32 *associativity) { - int nid = NUMA_NO_NODE; - - if (!numa_enabled) - goto out; - - if (of_read_number(associativity, 1) >= primary_domain_index) - nid = of_read_number(&associativity[primary_domain_index], 1); - - /* POWER4 LPAR uses 0x as invalid node */ - if (nid == 0x || nid >= nr_node_ids) - nid = NUMA_NO_NODE; - - if (nid > 0 && - of_read_number(associativity, 1) >= distance_ref_points_depth) { - /* -* Skip the length field and send start of associativity array -*/ - initialize_distance_lookup_table(nid, associativity + 1); - } + int array_sz = of_read_number(associativity, 1); -out: - return nid; + /* Skip the first element in the associativity array */ + return __associativity_to_nid((associativity + 1), array_sz); } /* Returns the nid associated with the given device tree node, @@ -287,6 +272,60 @@ int of_node_to_nid(struct device_node *device) } EXPORT_SYMBOL(of_node_to_nid); +static void __initialize_form1_numa_distance(const __be32 *associativity, +int max_array_sz) +{ + int i, nid; + + if (affinity_form != FORM1_AFFINITY) + return; + + nid = __associativity_to_nid(associativity, max_array_sz); + if (nid != NUMA_NO_NODE) { + for (i = 0; i < distance_ref_points_depth; i++) { + const __be32 *entry; + int index = be32_to_cpu(distance_ref_points[i]) - 1; + + /* +* broken hierarchy, return with broken distance table +*/ + if (WARN(index >= max_array_sz, "Broken ibm,associativity property")) +
[PATCH v8 4/5] powerpc/pseries: Add a helper for form1 cpu distance
This helper is only used with the dispatch trace log collection. A later patch will add Form2 affinity support and this change helps in keeping that simpler. Also add a comment explaining we don't expect the code to be called with FORM0 Reviewed-by: David Gibson Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/topology.h | 4 ++-- arch/powerpc/mm/numa.c| 10 +- arch/powerpc/platforms/pseries/lpar.c | 4 ++-- 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index a6425a70c37b..a4712ecad3e9 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) -extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc); +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc); extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) @@ -84,7 +84,7 @@ static inline void sysfs_remove_device_from_node(struct device *dev, static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {} -static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) +static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) { return 0; } diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index c0a89839310c..fdb2472befa4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu) } #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */ -int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) +static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) { int dist = 0; @@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) return dist; } +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) +{ + /* We should not get called with FORM0 */ + VM_WARN_ON(affinity_form == FORM0_AFFINITY); + + return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc); +} + /* must hold reference to node during call */ static const __be32 *of_get_associativity(struct device_node *dev) { diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index dab356e3ff87..afefbdfe768d 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu) if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc) return -EIO; - return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc); + return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc); } static int cpu_home_node_dispatch_distance(int disp_cpu) @@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu) if (!disp_cpu_assoc || !vcpu_assoc) return -EIO; - return cpu_distance(disp_cpu_assoc, vcpu_assoc); + return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc); } static void update_vcpu_disp_stat(int disp_cpu) -- 2.31.1
[PATCH v8 5/5] powerpc/pseries: Add support for FORM2 associativity
PAPR interface currently supports two different ways of communicating resource grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping. Form 0 is the older format and is now considered deprecated. This patch adds another resource grouping named FORM2. Signed-off-by: Daniel Henrique Barboza Signed-off-by: Aneesh Kumar K.V --- Documentation/powerpc/associativity.rst | 104 arch/powerpc/include/asm/firmware.h | 3 +- arch/powerpc/include/asm/prom.h | 1 + arch/powerpc/kernel/prom_init.c | 3 +- arch/powerpc/mm/numa.c| 187 ++ arch/powerpc/platforms/pseries/firmware.c | 1 + 6 files changed, 262 insertions(+), 37 deletions(-) create mode 100644 Documentation/powerpc/associativity.rst diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst new file mode 100644 index ..07e7dd3d6c87 --- /dev/null +++ b/Documentation/powerpc/associativity.rst @@ -0,0 +1,104 @@ + +NUMA resource associativity += + +Associativity represents the groupings of the various platform resources into +domains of substantially similar mean performance relative to resources outside +of that domain. Resources subsets of a given domain that exhibit better +performance relative to each other than relative to other resources subsets +are represented as being members of a sub-grouping domain. This performance +characteristic is presented in terms of NUMA node distance within the Linux kernel. +From the platform view, these groups are also referred to as domains. + +PAPR interface currently supports different ways of communicating these resource +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 +associativity grouping. Form 0 is the oldest format and is now considered deprecated. + +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property". +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. + +Form 0 +- +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE). + +Form 1 +- +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity +device tree properties are used to determine the NUMA distance between resource groups/domains. + +The “ibm,associativity” property contains a list of one or more numbers (domainID) +representing the resource’s platform grouping domains. + +The “ibm,associativity-reference-points” property contains a list of one or more numbers +(domainID index) that represents the 1 based ordinal in the associativity lists. +The list of domainID indexes represents an increasing hierarchy of resource grouping. + +ex: +{ primary domainID index, secondary domainID index, tertiary domainID index.. } + +Linux kernel uses the domainID at the primary domainID index as the NUMA node id. +Linux kernel computes NUMA distance between two domains by recursively comparing +if they belong to the same higher-level domains. For mismatch at every higher +level of the resource group, the kernel doubles the NUMA distance between the +comparing domains. + +Form 2 +--- +Form 2 associativity format adds separate device tree properties representing NUMA node distance +thereby making the node distance computation flexible. Form 2 also allows flexible primary +domain numbering. With numa distance computation now detached from the index value in +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain +ids at the same domainID index representing resource groups of different performance/latency +characteristics. + +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the +"ibm,architecture-vec-5" property. + +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing +the domainIDs present in the system. The offset of the domainID in this property is +used as an index while computing numa distance information via "ibm,numa-distance-table". + +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by +N domainID encoded as with encode-int + +For ex: +"ibm,numa-lookup-index-table" = {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when +computing the distance of domain 8 from other domains present in the system. For the rest of +this document, this offset will be referred to as domain distance offset. + +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA +distance between resource groups/domains present in the system. + +prop-encoded-array: The number N of the distance values encoded as with encode-int, follow
[PATCH v2 2/2] powerpc: rename powerpc_debugfs_root to arch_debugfs_dir
No functional change in this patch. arch_debugfs_dir is the generic kernel name declared in linux/debugfs.h for arch-specific debugfs directory. Architectures like x86/s390 already use the name. Rename powerpc specific powerpc_debugfs_root to arch_debugfs_dir. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/debugfs.h | 13 - arch/powerpc/kernel/Makefile| 3 ++- arch/powerpc/kernel/dawr.c | 3 +-- arch/powerpc/kernel/eeh.c | 16 arch/powerpc/kernel/eeh_cache.c | 4 ++-- arch/powerpc/kernel/fadump.c| 4 ++-- arch/powerpc/kernel/hw_breakpoint.c | 1 - arch/powerpc/kernel/kdebugfs.c | 14 ++ arch/powerpc/kernel/security.c | 16 arch/powerpc/kernel/setup-common.c | 13 - arch/powerpc/kernel/setup_64.c | 1 - arch/powerpc/kernel/traps.c | 4 ++-- arch/powerpc/kvm/book3s_xics.c | 6 +++--- arch/powerpc/kvm/book3s_xive.c | 3 +-- arch/powerpc/kvm/book3s_xive_native.c | 3 +-- arch/powerpc/mm/book3s64/hash_utils.c | 4 ++-- arch/powerpc/mm/book3s64/pgtable.c | 4 ++-- arch/powerpc/mm/book3s64/radix_tlb.c| 6 +++--- arch/powerpc/mm/ptdump/bats.c | 4 ++-- arch/powerpc/mm/ptdump/segment_regs.c | 4 ++-- arch/powerpc/platforms/cell/axon_msi.c | 4 ++-- arch/powerpc/platforms/powernv/memtrace.c | 3 +-- arch/powerpc/platforms/powernv/opal-imc.c | 4 ++-- arch/powerpc/platforms/powernv/opal-lpc.c | 4 ++-- arch/powerpc/platforms/powernv/opal-xscom.c | 4 ++-- arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- arch/powerpc/platforms/pseries/dtl.c| 4 ++-- arch/powerpc/platforms/pseries/lpar.c | 5 +++-- arch/powerpc/sysdev/xive/common.c | 3 +-- arch/powerpc/xmon/xmon.c| 6 +++--- 30 files changed, 75 insertions(+), 92 deletions(-) delete mode 100644 arch/powerpc/include/asm/debugfs.h create mode 100644 arch/powerpc/kernel/kdebugfs.c diff --git a/arch/powerpc/include/asm/debugfs.h b/arch/powerpc/include/asm/debugfs.h deleted file mode 100644 index 2c5c48571d75.. --- a/arch/powerpc/include/asm/debugfs.h +++ /dev/null @@ -1,13 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-or-later */ -#ifndef _ASM_POWERPC_DEBUGFS_H -#define _ASM_POWERPC_DEBUGFS_H - -/* - * Copyright 2017, Michael Ellerman, IBM Corporation. - */ - -#include - -extern struct dentry *powerpc_debugfs_root; - -#endif /* _ASM_POWERPC_DEBUGFS_H */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index f66b63e81c3b..7be36c1e1db6 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -46,7 +46,8 @@ obj-y := cputable.o syscalls.o \ prom.o traps.o setup-common.o \ udbg.o misc.o io.o misc_$(BITS).o \ of_platform.o prom_parse.o firmware.o \ - hw_breakpoint_constraints.o interrupt.o + hw_breakpoint_constraints.o interrupt.o \ + kdebugfs.o obj-y += ptrace/ obj-$(CONFIG_PPC64)+= setup_64.o \ paca.o nvram_64.o note.o diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c index cdc2dccb987d..64e423d2fe0f 100644 --- a/arch/powerpc/kernel/dawr.c +++ b/arch/powerpc/kernel/dawr.c @@ -9,7 +9,6 @@ #include #include #include -#include #include #include @@ -101,7 +100,7 @@ static int __init dawr_force_setup(void) if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) { /* Turn DAWR off by default, but allow admin to turn it on */ debugfs_create_file_unsafe("dawr_enable_dangerous", 0600, - powerpc_debugfs_root, + arch_debugfs_dir, &dawr_force_enable, &dawr_enable_fops); } diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 3bbdcc86d01b..e9b597ed423c 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -21,9 +21,9 @@ #include #include #include +#include #include -#include #include #include #include @@ -1901,24 +1901,24 @@ static int __init eeh_init_proc(void) proc_create_single("powerpc/eeh", 0, NULL, proc_eeh_show); #ifdef CONFIG_DEBUG_FS debugfs_create_file_unsafe("eeh_enable", 0600, - powerpc_debugfs_root, NULL, + arch_debugfs_dir, NULL, &eeh_enable_dbgfs_ops); debugf
[PATCH v2 1/2] powerpc/book3s64/radix: make tlb_single_page_flush_ceiling a debugfs entry
Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling. Also add a debugfs entry for tlb_local_single_page_flush_ceiling. Signed-off-by: Aneesh Kumar K.V --- Changes from v1: * switch to debugfs_create_u32 arch/powerpc/mm/book3s64/radix_tlb.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index aefc100d79a7..1fa2bc6a969e 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -17,6 +17,7 @@ #include #include #include +#include #include "internal.h" @@ -1106,8 +1107,8 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range); * invalidating a full PID, so it has a far lower threshold to change from * individual page flushes to full-pid flushes. */ -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; -static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2; +static u32 tlb_single_page_flush_ceiling __read_mostly = 33; +static u32 tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2; static inline void __radix__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end) @@ -1524,3 +1525,14 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned long lpid, EXPORT_SYMBOL_GPL(do_h_rpt_invalidate_prt); #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ + +static int __init create_tlb_single_page_flush_ceiling(void) +{ + debugfs_create_u32("tlb_single_page_flush_ceiling", 0600, + powerpc_debugfs_root, &tlb_single_page_flush_ceiling); + debugfs_create_u32("tlb_local_single_page_flush_ceiling", 0600, + powerpc_debugfs_root, &tlb_local_single_page_flush_ceiling); + return 0; +} +late_initcall(create_tlb_single_page_flush_ceiling); + -- 2.31.1
Re: [PATCH v2 1/2] KVM: Refactor kvm_arch_vcpu_fault() to return a struct page pointer
On 12/08/21 11:04, David Hildenbrand wrote: Reviewed-by: David Hildenbrand But at the same time I wonder if we should just get rid of CONFIG_KVM_S390_UCONTROL and consequently kvm_arch_vcpu_fault(). In practice CONFIG_KVM_S390_UCONTROL, is never enabled in any reasonable kernel build and consequently it's never tested; further, exposing the sie_block to user space allows user space to generate random SIE validity intercepts. CONFIG_KVM_S390_UCONTROL feels like something that should just be maintained out of tree by someone who really needs to hack deep into hw virtualization for testing purposes etc. I have no preference either way. It should definitely have selftests, but in x86 land there are some features that are not covered by QEMU and were nevertheless accepted upstream with selftests. Paolo
Re: [PATCH v2 5/9] powerpc/microwatt: Use standard 16550 UART for console
On Thu, Aug 12, 2021 at 03:14:44PM +0200, Christophe Leroy wrote: > Le 18/06/2021 à 05:46, Paul Mackerras a écrit : > >+static u8 udbg_uart_in_isa300_rm(unsigned int reg) > >+{ > >+uint64_t msr = mfmsr(); > >+uint8_t c; > >+ > >+mtmsr(msr & ~(MSR_EE|MSR_DR)); > >+isync(); > >+eieio(); > >+c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2)); > >+mtmsr(msr); > >+isync(); > >+return c; > >+} > > How do you make sure that GCC won't emit any access to the stack between > the two mtmsr() ? The mtmsr are asm with a memory clobber so nothing will be moved between these, and it is very unlikely anything will sprout up here out of nothing. But yes, this whole thing should be written as real asm (or as one huge inline asm, but ugh). Segher
[PATCH] ppc: add "-z notext" flag to disable diagnostic
The "-z notext" flag disables reporting an error if DT_TEXTREL is set on PPC with CONFIG=kdump: ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against local symbol in readonly segment; recompile object files with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output >>> defined in built-in.a(arch/powerpc/kernel/misc.o) >>> referenced by arch/powerpc/kernel/misc.o:(.text+0x20) in archive built-in.a The BFD linker disables this by default (though it's configurable in current versions). LLD enables this by default. So we add the flag to keep LLD from emitting the error. Signed-off-by: Bill Wendling --- The output of the "get_maintainer.pl" run just in case no one believes me. :-) $ ./scripts/get_maintainer.pl arch/powerpc/Makefile Michael Ellerman (supporter:LINUX FOR POWERPC (32-BIT AND 64-BIT)) Benjamin Herrenschmidt (reviewer:LINUX FOR POWERPC (32-BIT AND 64-BIT)) Paul Mackerras (reviewer:LINUX FOR POWERPC (32-BIT AND 64-BIT)) Nathan Chancellor (supporter:CLANG/LLVM BUILD SUPPORT) Nick Desaulniers (supporter:CLANG/LLVM BUILD SUPPORT) linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)) linux-ker...@vger.kernel.org (open list) clang-built-li...@googlegroups.com (open list:CLANG/LLVM BUILD SUPPORT) --- arch/powerpc/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 6505d66f1193..17a9fbf9b789 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -122,6 +122,7 @@ endif LDFLAGS_vmlinux-y := -Bstatic LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) := -pie +LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) += -z notext LDFLAGS_vmlinux:= $(LDFLAGS_vmlinux-y) ifdef CONFIG_PPC64 -- 2.33.0.rc1.237.g0d66db33f3-goog
Re: [PATCH -next] trap: Cleanup trap_init()
On 8/12/21 5:36 AM, Kefeng Wang wrote: There are some empty trap_init() in different ARCHs, introduce a new weak trap_init() function to cleanup them. Cc: Vineet Gupta Cc: Russell King Cc: Yoshinori Sato Cc: Ley Foon Tan Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: James E.J. Bottomley Cc: Helge Deller Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Jeff Dike Cc: Richard Weinberger Cc: Anton Ivanov Cc: Andrew Morton Signed-off-by: Kefeng Wang --- arch/arc/kernel/traps.c | 5 - Acked-by: Vineet Gupta #arch/arc
Re: [PATCH] ppc: add "-z notext" flag to disable diagnostic
On Thu, Aug 12, 2021 at 1:49 PM Bill Wendling wrote: > > The "-z notext" flag disables reporting an error if DT_TEXTREL is set on > PPC with CONFIG=kdump: > > ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against > local symbol in readonly segment; recompile object files with -fPIC > or pass '-Wl,-z,notext' to allow text relocations in the output > >>> defined in built-in.a(arch/powerpc/kernel/misc.o) > >>> referenced by arch/powerpc/kernel/misc.o:(.text+0x20) in archive > built-in.a > > The BFD linker disables this by default (though it's configurable in > current versions). LLD enables this by default. So we add the flag to > keep LLD from emitting the error. > > Signed-off-by: Bill Wendling Link: https://github.com/ClangBuiltLinux/linux/issues/811 Reported-by: Itaru Kitayama Reviewed-by: Nick Desaulniers > > --- > The output of the "get_maintainer.pl" run just in case no one believes me. :-) LOL! > > $ ./scripts/get_maintainer.pl arch/powerpc/Makefile > Michael Ellerman (supporter:LINUX FOR POWERPC (32-BIT > AND 64-BIT)) > Benjamin Herrenschmidt (reviewer:LINUX FOR POWERPC > (32-BIT AND 64-BIT)) > Paul Mackerras (reviewer:LINUX FOR POWERPC (32-BIT AND > 64-BIT)) > Nathan Chancellor (supporter:CLANG/LLVM BUILD SUPPORT) > Nick Desaulniers (supporter:CLANG/LLVM BUILD > SUPPORT) > linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND > 64-BIT)) > linux-ker...@vger.kernel.org (open list) > clang-built-li...@googlegroups.com (open list:CLANG/LLVM BUILD SUPPORT) > --- > arch/powerpc/Makefile | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile > index 6505d66f1193..17a9fbf9b789 100644 > --- a/arch/powerpc/Makefile > +++ b/arch/powerpc/Makefile > @@ -122,6 +122,7 @@ endif > > LDFLAGS_vmlinux-y := -Bstatic > LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) := -pie > +LDFLAGS_vmlinux-$(CONFIG_RELOCATABLE) += -z notext > LDFLAGS_vmlinux:= $(LDFLAGS_vmlinux-y) > > ifdef CONFIG_PPC64 > -- > 2.33.0.rc1.237.g0d66db33f3-goog > -- Thanks, ~Nick Desaulniers
Re: [PATCH -next] trap: Cleanup trap_init()
Kefeng Wang writes: > There are some empty trap_init() in different ARCHs, introduce > a new weak trap_init() function to cleanup them. > > Cc: Vineet Gupta > Cc: Russell King > Cc: Yoshinori Sato > Cc: Ley Foon Tan > Cc: Jonas Bonn > Cc: Stefan Kristiansson > Cc: Stafford Horne > Cc: James E.J. Bottomley > Cc: Helge Deller > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Paul Walmsley > Cc: Jeff Dike > Cc: Richard Weinberger > Cc: Anton Ivanov > Cc: Andrew Morton > Signed-off-by: Kefeng Wang > --- > arch/arc/kernel/traps.c | 5 - > arch/arm/kernel/traps.c | 5 - > arch/h8300/kernel/traps.c| 4 > arch/hexagon/kernel/traps.c | 4 > arch/nds32/kernel/traps.c| 5 - > arch/nios2/kernel/traps.c| 5 - > arch/openrisc/kernel/traps.c | 5 - > arch/parisc/kernel/traps.c | 4 > arch/powerpc/kernel/traps.c | 5 - Acked-by: Michael Ellerman (powerpc) cheers
[powerpc:fixes-test] BUILD SUCCESS cbc06f051c524dcfe52ef0d1f30647828e226d30
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test branch HEAD: cbc06f051c524dcfe52ef0d1f30647828e226d30 powerpc/xive: Do not skip CPU-less nodes when creating the IPIs elapsed time: 721m configs tested: 134 configs skipped: 108 The following configs have been built successfully. More configs may be tested in the coming days. gcc tested configs: arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig i386 randconfig-c001-20210812 i386 randconfig-c001-20210813 sh polaris_defconfig powerpc bamboo_defconfig powerpc holly_defconfig powerpcmpc7448_hpc2_defconfig powerpccell_defconfig powerpc mpc837x_rdb_defconfig powerpc redwood_defconfig arm at91_dt_defconfig sh secureedge5410_defconfig powerpc pseries_defconfig powerpc makalu_defconfig arm jornada720_defconfig m68k alldefconfig sh rsk7264_defconfig shtitan_defconfig powerpc asp8347_defconfig arc axs103_defconfig powerpc mpc5200_defconfig arm eseries_pxa_defconfig powerpc mpc8540_ads_defconfig arm vf610m4_defconfig powerpc ppc40x_defconfig powerpc ppc44x_defconfig m68k m5208evb_defconfig x86_64 alldefconfig mipsgpr_defconfig mips maltaaprp_defconfig xtensageneric_kc705_defconfig arm stm32_defconfig arm omap1_defconfig arc tb10x_defconfig arm exynos_defconfig mipsworkpad_defconfig mips decstation_defconfig mips ath25_defconfig powerpc ppa8548_defconfig arcnsimosci_defconfig arm davinci_all_defconfig mips gcw0_defconfig armxcep_defconfig riscv allnoconfig powerpc64alldefconfig powerpc ksi8560_defconfig powerpc pcm030_defconfig arm s5pv210_defconfig powerpcicon_defconfig sh ap325rxa_defconfig arm gemini_defconfig sh alldefconfig sh apsh4a3a_defconfig powerpc cm5200_defconfig powerpc allnoconfig armzeus_defconfig sh sh03_defconfig arm aspeed_g5_defconfig mipsmaltaup_defconfig arc nsimosci_hs_defconfig x86_64allnoconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig arc allyesconfig nds32 allnoconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig s390 allmodconfig parisc allyesconfig s390defconfig i386 allyesconfig sparcallyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig x86_64 randconfig-a006-20210812 x86_64 randconfig-a004-20210812 x86_64 randconfig-a003-20210812 x86_64 randconfig-a005-20210812
[powerpc:next-test] BUILD REGRESSION aaf01a9f507c108cc8e86f420b0a81e21354be1b
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test branch HEAD: aaf01a9f507c108cc8e86f420b0a81e21354be1b cpufreq: powernv: Fix init_chip_info initialization in numa=off Error/Warning reports: https://lore.kernel.org/lkml/202108130432.0md3hwu2-...@intel.com Error/Warning in current branch: arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: 'node_to_cpumask_map' undeclared (first use in this function) Error/Warning ids grouped by kconfigs: gcc_recent_errors `-- powerpc-randconfig-r005-20210812 `-- arch-powerpc-platforms-pseries-hotplug-cpu.c:error:node_to_cpumask_map-undeclared-(first-use-in-this-function) elapsed time: 727m configs tested: 115 configs skipped: 3 gcc tested configs: arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig i386 randconfig-c001-20210812 i386 randconfig-c001-20210813 powerpc mpc5200_defconfig parisc defconfig powerpc wii_defconfig arm integrator_defconfig armtrizeps4_defconfig sh polaris_defconfig powerpc bamboo_defconfig powerpc holly_defconfig powerpcmpc7448_hpc2_defconfig powerpccell_defconfig powerpc mpc837x_rdb_defconfig m68k alldefconfig sh rsk7264_defconfig shtitan_defconfig powerpc asp8347_defconfig powerpcicon_defconfig arc axs103_defconfig riscvnommu_k210_defconfig m68k m5208evb_defconfig x86_64 alldefconfig powerpc pseries_defconfig mipsgpr_defconfig mips maltaaprp_defconfig xtensageneric_kc705_defconfig arm stm32_defconfig arm omap1_defconfig arc tb10x_defconfig arm exynos_defconfig mips ath25_defconfig mipsworkpad_defconfig mips decstation_defconfig powerpc ppa8548_defconfig arcnsimosci_defconfig armzeus_defconfig sh sh03_defconfig arm aspeed_g5_defconfig mipsmaltaup_defconfig arc nsimosci_hs_defconfig x86_64allnoconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig arc allyesconfig nds32 allnoconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig s390 allyesconfig s390 allmodconfig parisc allyesconfig s390defconfig i386 allyesconfig sparcallyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig x86_64 randconfig-a006-20210812 x86_64 randconfig-a004-20210812 x86_64 randconfig-a003-20210812 x86_64 randconfig-a005-20210812 x86_64 randconfig-a002-20210812 x86_64 randconfig-a001-20210812 i386 randconfig-a004-20210812 i386 randconfig-a003-20210812 i386 randconfig-a002-20210812 i386 randconfig-a001-20210812 i386 randconfig-a006-20210812 i386 randconfig-a005-20210812 i386 randconfig-a011-20210812 i386 randconfig-a015-20210812 i386 randconfig-a013-202
Re: [RFC PATCH v0 5/5] pseries: Asynchronous page fault support
Excerpts from Bharata B Rao's message of August 5, 2021 5:24 pm: > Add asynchronous page fault support for pseries guests. > > 1. Setup the guest to handle async-pf >- Issue H_REG_SNS hcall to register the SNS region. >- Setup the subvention interrupt irq. >- Enable async-pf by updating the byte_b9 of VPA for each > CPU. > 2. Check if the page fault is an expropriation notification >(SRR1_PROGTRAP set in SRR1) and if so put the task on >wait queue based on the expropriation correlation number >read from the VPA. > 3. Handle subvention interrupt to wake any waiting tasks. >The wait and wakeup mechanism from x86 async-pf implementation >is being reused here. I don't know too much about the background of this. How much benefit does this give? What situations? Does PowerVM implement it? Do other architectures KVM have something similar? The SRR1 setting for the DSI is in PAPR? In that case it should be okay, it might be good to add a small comment in exceptions-64s.S. [...] > @@ -395,6 +395,11 @@ static int ___do_page_fault(struct pt_regs *regs, > unsigned long address, > vm_fault_t fault, major = 0; > bool kprobe_fault = kprobe_page_fault(regs, 11); > > +#ifdef CONFIG_PPC_PSERIES > + if (handle_async_page_fault(regs, address)) > + return 0; > +#endif > + > if (unlikely(debugger_fault_handler(regs) || kprobe_fault)) > return 0; [...] > +int handle_async_page_fault(struct pt_regs *regs, unsigned long addr) > +{ > + struct async_pf_sleep_node n; > + DECLARE_SWAITQUEUE(wait); > + unsigned long exp_corr_nr; > + > + /* Is this Expropriation notification? */ > + if (!(mfspr(SPRN_SRR1) & SRR1_PROGTRAP)) > + return 0; Yep this should be an inline that is guarded by a static key, and then probably have an inline check for SRR1_PROGTRAP. You shouldn't need to mfspr here, but just use regs->msr. > + > + if (unlikely(!user_mode(regs))) > + panic("Host injected async PF in kernel mode\n"); Hmm. Is there anything in the PAPR interface that specifies that the OS can only deal with problem state access faults here? Or is that inherent in the expropriation feature? Thanks, Nick
Re: [PATCH v1 17/55] KVM: PPC: Book3S HV P9: Implement PMU save/restore in C
Excerpts from Athira Rajeev's message of August 9, 2021 1:03 pm: > > >> On 26-Jul-2021, at 9:19 AM, Nicholas Piggin wrote: >> +static void freeze_pmu(unsigned long mmcr0, unsigned long mmcra) >> +{ >> +if (!(mmcr0 & MMCR0_FC)) >> +goto do_freeze; >> +if (mmcra & MMCRA_SAMPLE_ENABLE) >> +goto do_freeze; >> +if (cpu_has_feature(CPU_FTR_ARCH_31)) { >> +if (!(mmcr0 & MMCR0_PMCCEXT)) >> +goto do_freeze; >> +if (!(mmcra & MMCRA_BHRB_DISABLE)) >> +goto do_freeze; >> +} >> +return; >> + >> +do_freeze: >> +mmcr0 = MMCR0_FC; >> +mmcra = 0; >> +if (cpu_has_feature(CPU_FTR_ARCH_31)) { >> +mmcr0 |= MMCR0_PMCCEXT; >> +mmcra = MMCRA_BHRB_DISABLE; >> +} >> + >> +mtspr(SPRN_MMCR0, mmcr0); >> +mtspr(SPRN_MMCRA, mmcra); >> +isync(); >> +} >> + > Hi Nick, > > After feezing pmu, do we need to clear “pmcregs_in_use” as well? Not until we save the values out of the registers. pmcregs_in_use = 0 means our hypervisor is free to clear our PMU register contents. > Also can’t we unconditionally do the MMCR0/MMCRA/ freeze settings in here ? > do we need the if conditions for FC/PMCCEXT/BHRB ? I think it's possible, but pretty minimal advantage. I would prefer to set them the way perf does for now. If we can move this code into perf/ it should become easier for you to tweak things. Thanks, Nick
Re: [RFC PATCH v0 5/5] pseries: Asynchronous page fault support
On Fri, Aug 13, 2021 at 02:06:40PM +1000, Nicholas Piggin wrote: > Excerpts from Bharata B Rao's message of August 5, 2021 5:24 pm: > > Add asynchronous page fault support for pseries guests. > > > > 1. Setup the guest to handle async-pf > >- Issue H_REG_SNS hcall to register the SNS region. > >- Setup the subvention interrupt irq. > >- Enable async-pf by updating the byte_b9 of VPA for each > > CPU. > > 2. Check if the page fault is an expropriation notification > >(SRR1_PROGTRAP set in SRR1) and if so put the task on > >wait queue based on the expropriation correlation number > >read from the VPA. > > 3. Handle subvention interrupt to wake any waiting tasks. > >The wait and wakeup mechanism from x86 async-pf implementation > >is being reused here. > > I don't know too much about the background of this. > > How much benefit does this give? What situations? I haven't yet gotten into measuring the benefit of this. Once the patches are bit more stable than what they are currently, we need to measure and evaluate the benefits. > Does PowerVM implement it? I suppose so, need to check though. > Do other architectures KVM have something similar? Yes, x86 and s390 KVM have had this feature for a while now and generic KVM interfaces exist to support it. > > The SRR1 setting for the DSI is in PAPR? In that case it should be okay, > it might be good to add a small comment in exceptions-64s.S. Yes, SRR1 setting is part of PAPR. > > [...] > > > @@ -395,6 +395,11 @@ static int ___do_page_fault(struct pt_regs *regs, > > unsigned long address, > > vm_fault_t fault, major = 0; > > bool kprobe_fault = kprobe_page_fault(regs, 11); > > > > +#ifdef CONFIG_PPC_PSERIES > > + if (handle_async_page_fault(regs, address)) > > + return 0; > > +#endif > > + > > if (unlikely(debugger_fault_handler(regs) || kprobe_fault)) > > return 0; > > [...] > > > +int handle_async_page_fault(struct pt_regs *regs, unsigned long addr) > > +{ > > + struct async_pf_sleep_node n; > > + DECLARE_SWAITQUEUE(wait); > > + unsigned long exp_corr_nr; > > + > > + /* Is this Expropriation notification? */ > > + if (!(mfspr(SPRN_SRR1) & SRR1_PROGTRAP)) > > + return 0; > > Yep this should be an inline that is guarded by a static key, and then > probably have an inline check for SRR1_PROGTRAP. You shouldn't need to > mfspr here, but just use regs->msr. Right. > > > + > > + if (unlikely(!user_mode(regs))) > > + panic("Host injected async PF in kernel mode\n"); > > Hmm. Is there anything in the PAPR interface that specifies that the > OS can only deal with problem state access faults here? Or is that > inherent in the expropriation feature? Didn't see anything specific to that effect in PAPR. However since this puts the faulting guest process to sleep until the page becomes ready in the host, I have limited it to guest user space faults. Regards, Bharata.
Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()
Hi, On 12. 08. 21, 14:26, kernel test robot wrote: drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is uninitialized when used here [-Wuninitialized] spin_unlock_irqrestore(&hp->c_lock, flags); ^~ drivers/tty/hvc/hvc_console.c:149:23: note: initialize the variable 'hp' to silence this warning struct hvc_struct *hp; ^ = NULL So you clearly didn't test your change as it would crash quite instantly. I wonder, where do you intend to get hp from in the console::print() hook? thanks, -- js suse labs
Re: [PATCH v6 1/2] tty: hvc: pass DMA capable memory to put_chars()
在 2021/8/13 下午1:53, Jiri Slaby 写道: Hi, On 12. 08. 21, 14:26, kernel test robot wrote: drivers/tty/hvc/hvc_console.c:190:26: warning: variable 'hp' is uninitialized when used here [-Wuninitialized] spin_unlock_irqrestore(&hp->c_lock, flags); ^~ drivers/tty/hvc/hvc_console.c:149:23: note: initialize the variable 'hp' to silence this warning struct hvc_struct *hp; ^ = NULL So you clearly didn't test your change as it would crash quite instantly. I wonder, where do you intend to get hp from in the console::print() hook? I am very sorry for the inconvenience caused. This is caused by my carelessness:( I take it for granted that there is no problem when I just switch to use array(cons_outbuf[]). sorry agin. thanks,
Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
Excerpts from Christophe Leroy's message of April 14, 2021 2:38 am: > powerpc BUG_ON() and WARN_ON() are based on using twnei instruction. > > For catching simple conditions like a variable having value 0, this > is efficient because it does the test and the trap at the same time. > But most conditions used with BUG_ON or WARN_ON are more complex and > forces GCC to format the condition into a 0 or 1 value in a register. > This will usually require 2 to 3 instructions. > > The most efficient solution would be to use __builtin_trap() because > GCC is able to optimise the use of the different trap instructions > based on the requested condition, but this is complex if not > impossible for the following reasons: > - __builtin_trap() is a non-recoverable instruction, so it can't be > used for WARN_ON > - Knowing which line of code generated the trap would require the > analysis of DWARF information. This is not a feature we have today. > > As mentioned in commit 8d4fbcfbe0a4 ("Fix WARN_ON() on bitfield ops") > the way WARN_ON() is implemented is suboptimal. That commit also > mentions an issue with 'long long' condition. It fixed it for > WARN_ON() but the same problem still exists today with BUG_ON() on > PPC32. It will be fixed by using the generic implementation. > > By using the generic implementation, gcc will naturally generate a > branch to the unconditional trap generated by BUG(). > > As modern powerpc implement zero-cycle branch, > that's even more efficient. > > And for the functions using WARN_ON() and its return, the test > on return from WARN_ON() is now also used for the WARN_ON() itself. > > On PPC64 we don't want it because we want to be able to use CFAR > register to track how we entered the code that trapped. The CFAR > register would be clobbered by the branch. > > A simple test function: > > unsigned long test9w(unsigned long a, unsigned long b) > { > if (WARN_ON(!b)) > return 0; > return a / b; > } > > Before the patch: > > 046c : >46c: 7c 89 00 34 cntlzw r9,r4 >470: 55 29 d9 7e rlwinm r9,r9,27,5,31 >474: 0f 09 00 00 twnei r9,0 >478: 2c 04 00 00 cmpwi r4,0 >47c: 41 82 00 0c beq 488 >480: 7c 63 23 96 divwu r3,r3,r4 >484: 4e 80 00 20 blr > >488: 38 60 00 00 li r3,0 >48c: 4e 80 00 20 blr > > After the patch: > > 0468 : >468: 2c 04 00 00 cmpwi r4,0 >46c: 41 82 00 0c beq 478 >470: 7c 63 23 96 divwu r3,r3,r4 >474: 4e 80 00 20 blr > >478: 0f e0 00 00 twuir0,0 >47c: 38 60 00 00 li r3,0 >480: 4e 80 00 20 blr That's clearly better because we have a branch anyway. > > So we see before the patch we need 3 instructions on the likely path > to handle the WARN_ON(). With the patch the trap goes on the unlikely > path. > > See below the difference at the entry of system_call_exception where > we have several BUG_ON(), allthough less impressing. > > With the patch: > > : > 0: 81 6a 00 84 lwz r11,132(r10) > 4: 90 6a 00 88 stw r3,136(r10) > 8: 71 60 00 02 andi. r0,r11,2 > c: 41 82 00 70 beq 7c > 10: 71 60 40 00 andi. r0,r11,16384 > 14: 41 82 00 6c beq 80 > 18: 71 6b 80 00 andi. r11,r11,32768 > 1c: 41 82 00 68 beq 84 > 20: 94 21 ff e0 stwur1,-32(r1) > 24: 93 e1 00 1c stw r31,28(r1) > 28: 7d 8c 42 e6 mftbr12 > ... > 7c: 0f e0 00 00 twuir0,0 > 80: 0f e0 00 00 twuir0,0 > 84: 0f e0 00 00 twuir0,0 > > Without the patch: > > : > 0: 94 21 ff e0 stwur1,-32(r1) > 4: 93 e1 00 1c stw r31,28(r1) > 8: 90 6a 00 88 stw r3,136(r10) > c: 81 6a 00 84 lwz r11,132(r10) > 10: 69 60 00 02 xorir0,r11,2 > 14: 54 00 ff fe rlwinm r0,r0,31,31,31 > 18: 0f 00 00 00 twnei r0,0 > 1c: 69 60 40 00 xorir0,r11,16384 > 20: 54 00 97 fe rlwinm r0,r0,18,31,31 > 24: 0f 00 00 00 twnei r0,0 > 28: 69 6b 80 00 xorir11,r11,32768 > 2c: 55 6b 8f fe rlwinm r11,r11,17,31,31 > 30: 0f 0b 00 00 twnei r11,0 > 34: 7d 8c 42 e6 mftbr12 This one possibly the branches end up in predictors, whereas conditional trap is always just speculated not to hit. Branches may also have a throughput limit on execution whereas trap could be more (1 per cycle vs 4 per cycle on POWER9). On typical ppc32 CPUs, maybe it's a more obvious win. As you say there is the CFAR issue as well which makes
Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
Excerpts from Christophe Leroy's message of April 14, 2021 2:38 am: > Using asm goto in __WARN_FLAGS() and WARN_ON() allows more > flexibility to GCC. > > For that add an entry to the exception table so that > program_check_exception() knowns where to resume execution > after a WARNING. Nice idea. How much does it bloat the exception table? How easy would it be to make a different exception table for unimportant stuff like WARN_ON faults? > > Here are two exemples. The first one is done on PPC32 (which > benefits from the previous patch), the second is on PPC64. > > unsigned long test(struct pt_regs *regs) > { > int ret; > > WARN_ON(regs->msr & MSR_PR); > > return regs->gpr[3]; > } > > unsigned long test9w(unsigned long a, unsigned long b) > { > if (WARN_ON(!b)) > return 0; > return a / b; > } > > Before the patch: > > 03a8 : >3a8: 81 23 00 84 lwz r9,132(r3) >3ac: 71 29 40 00 andi. r9,r9,16384 >3b0: 40 82 00 0c bne 3bc >3b4: 80 63 00 0c lwz r3,12(r3) >3b8: 4e 80 00 20 blr > >3bc: 0f e0 00 00 twuir0,0 >3c0: 80 63 00 0c lwz r3,12(r3) >3c4: 4e 80 00 20 blr > > 0bf0 <.test9w>: >bf0: 7c 89 00 74 cntlzd r9,r4 >bf4: 79 29 d1 82 rldicl r9,r9,58,6 >bf8: 0b 09 00 00 tdnei r9,0 >bfc: 2c 24 00 00 cmpdi r4,0 >c00: 41 82 00 0c beq c0c <.test9w+0x1c> >c04: 7c 63 23 92 divdu r3,r3,r4 >c08: 4e 80 00 20 blr > >c0c: 38 60 00 00 li r3,0 >c10: 4e 80 00 20 blr > > After the patch: > > 03a8 : >3a8: 81 23 00 84 lwz r9,132(r3) >3ac: 71 29 40 00 andi. r9,r9,16384 >3b0: 40 82 00 0c bne 3bc >3b4: 80 63 00 0c lwz r3,12(r3) >3b8: 4e 80 00 20 blr > >3bc: 0f e0 00 00 twuir0,0 > > 0c50 <.test9w>: >c50: 7c 89 00 74 cntlzd r9,r4 >c54: 79 29 d1 82 rldicl r9,r9,58,6 >c58: 0b 09 00 00 tdnei r9,0 >c5c: 7c 63 23 92 divdu r3,r3,r4 >c60: 4e 80 00 20 blr > >c70: 38 60 00 00 li r3,0 >c74: 4e 80 00 20 blr One thing that would be nice for WARN_ON_ONCE is to patch out the trap after the program interrupt. I've seen sometimes the WARN_ON_ONCE starts firing a lot and slows down the kernel. That gets harder to do if the trap is now part of the control flow. I guess that's less important case for performance though. And in theory you might be able to replace the trap with an equivalent cmp and branch (but that's probably going too over engineering it). Thanks, Nick