Re: [PATCH kernel v2 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
On Thu, 21 Jun 2018 12:03:21 +1000 David Gibson wrote: > On Sun, Jun 17, 2018 at 09:14:28PM +1000, Alexey Kardashevskiy wrote: > > At the moment we allocate the entire TCE table, twice (hardware part and > > userspace translation cache). This normally works as we normally have > > contigous memory and the guest will map entire RAM for 64bit DMA. > > > > However if we have sparse RAM (one example is a memory device), then > > we will allocate TCEs which will never be used as the guest only maps > > actual memory for DMA. If it is a single level TCE table, there is nothing > > we can really do but if it a multilevel table, we can skip allocating > > TCEs we know we won't need. > > > > This adds ability to allocate only first level, saving memory. > > > > This changes iommu_table::free() to avoid allocating of an extra level; > > iommu_table::set() will do this when needed. > > > > This adds @alloc parameter to iommu_table::exchange() to tell the callback > > if it can allocate an extra level; the flag is set to "false" for > > the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns > > H_TOO_HARD. > > > > This still requires the entire table to be counted in mm::locked_vm. > > > > To be conservative, this only does on-demand allocation when > > the usespace cache table is requested which is the case of VFIO. > > > > The example math for a system replicating a powernv setup with NVLink2 > > in a guest: > > 16GB RAM mapped at 0x0 > > 128GB GPU RAM window (16GB of actual RAM) mapped at 0x2440 > > > > the table to cover that all with 64K pages takes: > > (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB > > > > If we allocate only necessary TCE levels, we will only need: > > (((0x4 + 0x4) >> 16)*8)>>20 = 4MB (plus some for indirect > > levels). > > > > Signed-off-by: Alexey Kardashevskiy > > --- > > Changes: > > v2: > > * fixed bug in cleanup path which forced the entire table to be > > allocated right before destroying > > * added memory allocation error handling pnv_tce() > > --- > > arch/powerpc/include/asm/iommu.h | 7 ++- > > arch/powerpc/platforms/powernv/pci.h | 6 ++- > > arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +- > > arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 > > --- > > arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++-- > > drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- > > 6 files changed, 69 insertions(+), 27 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/iommu.h > > b/arch/powerpc/include/asm/iommu.h > > index 4bdcf22..daa3ee5 100644 > > --- a/arch/powerpc/include/asm/iommu.h > > +++ b/arch/powerpc/include/asm/iommu.h > > @@ -70,7 +70,7 @@ struct iommu_table_ops { > > unsigned long *hpa, > > enum dma_data_direction *direction); > > > > - __be64 *(*useraddrptr)(struct iommu_table *tbl, long index); > > + __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc); > > #endif > > void (*clear)(struct iommu_table *tbl, > > long index, long npages); > > @@ -122,10 +122,13 @@ struct iommu_table { > > __be64 *it_userspace; /* userspace view of the table */ > > struct iommu_table_ops *it_ops; > > struct krefit_kref; > > + int it_nid; > > }; > > > > +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \ > > + ((tbl)->it_ops->useraddrptr((tbl), (entry), false)) > > #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ > > - ((tbl)->it_ops->useraddrptr((tbl), (entry))) > > + ((tbl)->it_ops->useraddrptr((tbl), (entry), true)) > > > > /* Pure 2^n version of get_order */ > > static inline __attribute_const__ > > diff --git a/arch/powerpc/platforms/powernv/pci.h > > b/arch/powerpc/platforms/powernv/pci.h > > index 5e02408..1fa5590 100644 > > --- a/arch/powerpc/platforms/powernv/pci.h > > +++ b/arch/powerpc/platforms/powernv/pci.h > > @@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long > > index, long npages, > > unsigned long attrs); > > extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages); > > extern int pnv_tce_xchg(struct iommu_table *tbl, long index, > > - unsigned long *hpa, enum dma_data_direction *direction); > > -extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index); > > + unsigned long *hpa, enum dma_data_direction *direction, > > + bool alloc); > > +extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index, > > + bool alloc); > > extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index); > > > > extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset, > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c > > b/arch/powerpc/kvm/book3s_64_vio_hv.c > > index db0490c..05b4865 100644 > > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > > +++ b/arch/powerpc/kvm/book3s
[PATCH 1/2] powerpc/mm: Check memblock_add against MAX_PHYSMEM_BITS range
With SPARSEMEM config enabled, we make sure that we don't add sections beyond MAX_PHYSMEM_BITS range. This results in not building vmemmap mapping for range beyond max range. But our memblock layer looks the device tree and create mapping for the full memory range. Prevent this by checking against MAX_PHSYSMEM_BITS when doing memblock_add. We don't do similar check for memeblock_reserve_range. If reserve range is beyond MAX_PHYSMEM_BITS we expect that to be configured with 'nomap'. Any other reserved range should come from existing memblock ranges which we already filtered while adding. This avoids crash as below when running on a system with system ram config above MAX_PHSYSMEM_BITS Unable to handle kernel paging request for data at address 0xc00a00100440 Faulting instruction address: 0xc1034118 cpu 0x0: Vector: 300 (Data Access) at [c124fb30] pc: c1034118: __free_pages_bootmem+0xc0/0x1c0 lr: c103b258: free_all_bootmem+0x19c/0x22c sp: c124fdb0 msr: 92001033 dar: c00a00100440 dsisr: 4000 current = 0xc120dd00 paca= 0xc1f6^I irqmask: 0x03^I irq_happened: 0x01 pid = 0, comm = swapper [c124fe20] c103b258 free_all_bootmem+0x19c/0x22c [c124fee0] c1010a68 mem_init+0x3c/0x5c [c124ff00] c100401c start_kernel+0x298/0x5e4 [c124ff90] c000b57c start_here_common+0x1c/0x520 Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/kernel/prom.c | 32 +--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 05e7fb47a7a4..8f32f14ba508 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -440,6 +440,29 @@ static int __init early_init_dt_scan_chosen_ppc(unsigned long node, return 1; } +/* + * Compare the range against max mem limit and update + * size if it cross the limit. + */ + +#ifdef CONFIG_SPARSEMEM +static bool validate_mem_limit(u64 base, u64 *size) +{ + u64 max_mem = 1UL << (MAX_PHYSMEM_BITS); + + if (base >= max_mem) + return false; + if ((base + *size) > max_mem) + *size = max_mem - base; + return true; +} +#else +static bool validate_mem_limit(u64 base, u64 *size) +{ + return true; +} +#endif + #ifdef CONFIG_PPC_PSERIES /* * Interpret the ibm dynamic reconfiguration memory LMBs. @@ -494,7 +517,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb *lmb, } DBG("Adding: %llx -> %llx\n", base, size); - memblock_add(base, size); + if (validate_mem_limit(base, &size)) + memblock_add(base, size); } while (--rngs); } #endif /* CONFIG_PPC_PSERIES */ @@ -548,8 +572,10 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size) } /* Add the chunk to the MEMBLOCK list */ - if (add_mem_to_memblock) - memblock_add(base, size); + if (add_mem_to_memblock) { + if (validate_mem_limit(base, &size)) + memblock_add(base, size); + } } static void __init early_reserve_mem_dt(void) -- 2.17.1
[PATCH 2/2] powerpc/mm: Increase MAX_PHYSMEM_BITS to 128TB with SPARSEMEM_VMEMMAP config
We do this only with VMEMMAP config so that our page_to_[nid/section] etc are not impacted. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/sparsemem.h | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h index bc66712bdc3c..28f5dae25db6 100644 --- a/arch/powerpc/include/asm/sparsemem.h +++ b/arch/powerpc/include/asm/sparsemem.h @@ -6,13 +6,20 @@ #ifdef CONFIG_SPARSEMEM /* * SECTION_SIZE_BITS 2^N: how big each section will be - * MAX_PHYSADDR_BITS 2^N: how much physical address space we have * MAX_PHYSMEM_BITS2^N: how much memory we can have in that space */ #define SECTION_SIZE_BITS 24 - -#define MAX_PHYSADDR_BITS 46 +/* + * If we store section details in page->flags we can't increase the MAX_PHYSMEM_BITS + * if we increase SECTIONS_WIDTH we will not store node details in page->flags and + * page_to_nid does a page->section->node lookup + * Hence only increase for VMEMMAP. + */ +#ifdef CONFIG_SPARSEMEM_VMEMMAP +#define MAX_PHYSMEM_BITS47 +#else #define MAX_PHYSMEM_BITS46 +#endif #endif /* CONFIG_SPARSEMEM */ -- 2.17.1
[RESEND PATCH 1/3] powerpc: dts: use 'atmel' as at24 anufacturer for pdm360ng
Using 'at' as the part of the compatible string is now deprecated. Use a correct string: 'atmel,'. Signed-off-by: Bartosz Golaszewski --- arch/powerpc/boot/dts/pdm360ng.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/boot/dts/pdm360ng.dts b/arch/powerpc/boot/dts/pdm360ng.dts index 445b88114009..df1283b63d9b 100644 --- a/arch/powerpc/boot/dts/pdm360ng.dts +++ b/arch/powerpc/boot/dts/pdm360ng.dts @@ -98,7 +98,7 @@ fsl,preserve-clocking; eeprom@50 { - compatible = "at,24c01"; + compatible = "atmel,24c01"; reg = <0x50>; }; -- 2.17.1
[RESEND PATCH 2/3] powerpc: dts: use 'atmel' as at24 manufacturer for kmcent2
Using compatible strings without the part for at24 is now deprecated. Use a correct 'atmel,' value. Signed-off-by: Bartosz Golaszewski --- arch/powerpc/boot/dts/fsl/kmcent2.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/boot/dts/fsl/kmcent2.dts b/arch/powerpc/boot/dts/fsl/kmcent2.dts index 5922c1ea0e96..3094df05f5ea 100644 --- a/arch/powerpc/boot/dts/fsl/kmcent2.dts +++ b/arch/powerpc/boot/dts/fsl/kmcent2.dts @@ -130,7 +130,7 @@ #size-cells = <0>; eeprom@54 { - compatible = "24c02"; + compatible = "atmel,24c02"; reg = <0x54>; pagesize = <2>; read-only; -- 2.17.1
[RESEND PATCH 3/3] powerpc: dts: use a correct at24 compatible fallback in ac14xx
Using 'at24' as fallback is now deprecated - use the full 'atmel,' string. Signed-off-by: Bartosz Golaszewski --- arch/powerpc/boot/dts/ac14xx.dts | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/boot/dts/ac14xx.dts b/arch/powerpc/boot/dts/ac14xx.dts index 83bcfd865167..0be5c4f3265d 100644 --- a/arch/powerpc/boot/dts/ac14xx.dts +++ b/arch/powerpc/boot/dts/ac14xx.dts @@ -176,12 +176,12 @@ clock-frequency = <40>; at24@30 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x30>; }; at24@31 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x31>; }; @@ -191,42 +191,42 @@ }; at24@50 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x50>; }; at24@51 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x51>; }; at24@52 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x52>; }; at24@53 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x53>; }; at24@54 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x54>; }; at24@55 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x55>; }; at24@56 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x56>; }; at24@57 { - compatible = "at24,24c01"; + compatible = "atmel,24c01"; reg = <0x57>; }; -- 2.17.1
[PATCH kernel v2 REPOST] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
At the moment we allocate the entire TCE table, twice (hardware part and userspace translation cache). This normally works as we normally have contigous memory and the guest will map entire RAM for 64bit DMA. However if we have sparse RAM (one example is a memory device), then we will allocate TCEs which will never be used as the guest only maps actual memory for DMA. If it is a single level TCE table, there is nothing we can really do but if it a multilevel table, we can skip allocating TCEs we know we won't need. This adds ability to allocate only first level, saving memory. This changes iommu_table::free() to avoid allocating of an extra level; iommu_table::set() will do this when needed. This adds @alloc parameter to iommu_table::exchange() to tell the callback if it can allocate an extra level; the flag is set to "false" for the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns H_TOO_HARD. This still requires the entire table to be counted in mm::locked_vm. To be conservative, this only does on-demand allocation when the usespace cache table is requested which is the case of VFIO. The example math for a system replicating a powernv setup with NVLink2 in a guest: 16GB RAM mapped at 0x0 128GB GPU RAM window (16GB of actual RAM) mapped at 0x2440 the table to cover that all with 64K pages takes: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB If we allocate only necessary TCE levels, we will only need: (((0x4 + 0x4) >> 16)*8)>>20 = 4MB (plus some for indirect levels). Signed-off-by: Alexey Kardashevskiy --- This is what I meant to post few days ago, sorry for the noise. --- Changes: v2: * fixed bug in cleanup path which forced the entire table to be allocated right before destroying * added memory allocation error handling pnv_tce() --- arch/powerpc/include/asm/iommu.h | 7 ++- arch/powerpc/platforms/powernv/pci.h | 6 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 73 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 8 +-- drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- 6 files changed, 73 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 4bdcf22..daa3ee5 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -70,7 +70,7 @@ struct iommu_table_ops { unsigned long *hpa, enum dma_data_direction *direction); - __be64 *(*useraddrptr)(struct iommu_table *tbl, long index); + __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc); #endif void (*clear)(struct iommu_table *tbl, long index, long npages); @@ -122,10 +122,13 @@ struct iommu_table { __be64 *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; struct krefit_kref; + int it_nid; }; +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \ + ((tbl)->it_ops->useraddrptr((tbl), (entry), false)) #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ - ((tbl)->it_ops->useraddrptr((tbl), (entry))) + ((tbl)->it_ops->useraddrptr((tbl), (entry), true)) /* Pure 2^n version of get_order */ static inline __attribute_const__ diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 5e02408..1fa5590 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages, unsigned long attrs); extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages); extern int pnv_tce_xchg(struct iommu_table *tbl, long index, - unsigned long *hpa, enum dma_data_direction *direction); -extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index); + unsigned long *hpa, enum dma_data_direction *direction, + bool alloc); +extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index, + bool alloc); extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index); extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset, diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index 8cc1caf..efb90d8 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm, { struct mm_iommu_table_group_mem_t *mem = NULL; const unsigned long pgsize = 1ULL << tbl->it_page_shift; - __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); + __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry); if (!pua) /* it_userspace allocation might be
Re: [PATCH v2 0/6] powerpc/pkeys: fixes to pkeys
Florian Weimer writes: > On 06/19/2018 02:40 PM, Michael Ellerman wrote: >>> I tested the whole series with the new selftests, with the printamr.c >>> program I posted earlier, and the glibc test for pkey_alloc &c. The >>> latter required some test fixes, but now passes as well. As far as I >>> can tell, everything looks good now. >>> >>> Tested-By: Florian Weimer >> Thanks. I'll add that to each patch I guess, if you're happy with that? > > Sure, but I only tested the whole series as a whole. Yeah OK. We don't have a good way to express that, other than using a merge which I'd prefer to avoid. So I've tagged them all with your Tested-by. If any of them turn out to have bugs you can blame me :) cheers
[4.18.0-rc1][i40e][9b10df59] WARNING: CPU: 32 PID: 95246 at drivers/net/ethernet/intel/i40e/i40e_ethtool.c:1907 i40e_update_ethtool_fdir_entry.isra.12+0x2264/0x2270
Greeting's WARN_ONCE is triggered when interface is toggled when running 'ethtool -S' command on a powerpc machine running today's mainline kernel related commit looks to be: commit 9b10df596bd4d38f2a58cf87e0780510acc53d8d Author: Jacob Keller Date: Thu May 17 01:08:36 2018 -0700 i40e: use WARN_ONCE to replace the commented BUG_ON size check We don't really want to use BUG_ON here since that would completely crash the kernel, thus the reason we commented it out. We *can't* use BUILD_BUG_ON because at least now (a) the sizes aren't constant (we are fixing this) and (b) not all compilers are smart enough to understand that "p - data" is a constant. Machine: Power 8 bare-metal kernel: 4.18.0-rc1 ethtool: version 4.8 gcc: 4.8.5 config: attached Adapter: Ethernet controller [0200]: Intel Corporation Ethernet Controller X710/X557-AT 10GBASE-T [8086:1589] (rev 02) dmesg: -- i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Down i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Down i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None i40e 0002:01:00.0 net0: NIC Link is Down i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Down i40e 0002:01:00.3 enP2p1s0f3: NIC Link is Down i40e 0002:01:00.0 net0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None i40e 0002:01:00.3 enP2p1s0f3: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None i40e 0002:01:00.0 net0: NIC Link is Down i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Down i40e 0002:01:00.3 enP2p1s0f3: NIC Link is Down i40e 0002:01:00.0 net0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None i40e 0002:01:00.2 enP2p1s0f2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None i40e 0002:01:00.3 enP2p1s0f3: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None [ cut here ] stat strings count mismatch! WARNING: CPU: 81 PID: 24268 at drivers/net/ethernet/intel/i40e/i40e_ethtool.c:1907 i40e_update_ethtool_fdir_entry.isra.12+0x2264/0x2270 [i40e] Modules linked in: ipt_MASQUERADE tun bridge stp llc xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_filter btrfs xor zstd_decompress zstd_compress xxhash lzo_compress vmx_crypto raid6_pq powernv_rng rng_core kvm_hv kvm nfsd ip_tables x_tables xfs libcrc32c qla2xxx mpt3sas nvme_fc nvme_fabrics raid_class i40e nvme_core scsi_transport_sas CPU: 81 PID: 24268 Comm: ethtool Not tainted 4.18.0-rc1-autotest-autotest #1 NIP: d00010260634 LR: d00010260630 CTR: c09dc580 REGS: c07fe2c2f730 TRAP: 0700 Not tainted (4.18.0-rc1-autotest-autotest) MSR: 90029033 CR: 48028422 XER: 2000 CFAR: c00f1d98 IRQMASK: 0 GPR00: d00010260630 c07fe2c2f9b0 d0001029cc00 001c GPR04: 0001 000c 0001 c115be00 GPR08: 0001 0007 0006 90001003 GPR12: 8800 c07fff7fe580 GPR16: GPR20: d0001028e500 d0001028e518 d0001028e530 d0001028e548 GPR24: d0001028e598 d0001028e5b0 c03fec19d000 d0001028e5d0 GPR28: 0008 54c0 ab40 d0001b4a54c0 NIP [d00010260634] i40e_update_ethtool_fdir_entry.isra.12+0x2264/0x2270 [i40e] LR [d00010260630] i40e_update_ethtool_fdir_entry.isra.12+0x2260/0x2270 [i40e] Call Trace: [c07fe2c2f9b0] [d00010260630] i40e_update_ethtool_fdir_entry.isra.12+0x2260/0x2270 [i40e] (unreliable) [c07fe2c2fa70] [c08a3990] dev_ethtool+0x1880/0x27b0 [c07fe2c2fba0] [c08c86a0] dev_ioctl+0x560/0x7e0 [c07fe2c2fbf0] [c0869510] sock_do_ioctl+0xc0/0x1a0 [c07fe2c2fc60] [c0869ed0] sock_ioctl+0x1a0/0x460 [c07fe2c2fd20] [c0329dc8] do_vfs_ioctl+0xc8/0x8b0 [c07fe2c2fdc0] [c032a60c] ksys_ioctl+0x5c/0xe0 [c07fe2c2fe10] [c032a6b0] sys_ioctl+0x20/0x80 [c07fe2c2fe30] [c000b9e4] system_call+0x5c/0x70 Instruction dump: 41fefc1c 3d22 e9298e80 8949 2f8a 40fefc08 3c62 e8638e88 3941 9949 48022d45 e8410018 <0fe0> 4bfffbe8 6042 3c4c0004 ---[ end trace 1a1b8331349cf7fd ]--- -- Regard's Abdul Haleem IBM Linux Technology Centre # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.11.0-rc4 Kernel Configuration # CONFIG_PPC64=y # # Processor support # CONFIG_PPC_BOOK3S_64=y # CONFIG_PPC_BOOK3E_64 is not set # CONFIG_POWER7_CPU is not set CONFIG_POWER8_CPU=y CONFIG_PPC_BOOK3S=y CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_VSX=y # CONFIG_PPC_ICSWX is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_64=y CONFIG_PPC
Re: [PATCH v3] powerpc/32: Remove left over function prototypes
Mathieu Malaterre writes: > In commit 4aea909eeba3 ("powerpc: Add missing prototypes in setup_32.c") I don't have that commit ^ ? That might be because I squashed some of your fixes together or something? > diff --git a/arch/powerpc/kernel/setup.h b/arch/powerpc/kernel/setup.h > index 35ca309848d7..829ed66f0a40 100644 > --- a/arch/powerpc/kernel/setup.h > +++ b/arch/powerpc/kernel/setup.h > @@ -19,9 +19,6 @@ void irqstack_early_init(void); > void setup_power_save(void); > unsigned long __init early_init(unsigned long dt_ptr); > void __init machine_init(u64 dt_ptr); > -int __init ppc_setup_l2cr(char *str); > -int __init ppc_setup_l3cr(char *str); > -int __init ppc_init(void); > #else > static inline void setup_power_save(void) { }; > #endif I have: #ifdef CONFIG_PPC32 void setup_power_save(void); #else static inline void setup_power_save(void) { }; #endif cheers
Re: [PATCH 2/2] powerpc/mm: Increase MAX_PHYSMEM_BITS to 128TB with SPARSEMEM_VMEMMAP config
On Thu, Jun 21, 2018 at 6:31 PM, Aneesh Kumar K.V wrote: > > We do this only with VMEMMAP config so that our page_to_[nid/section] etc are > not > impacted. > > Signed-off-by: Aneesh Kumar K.V Why 128TB, given that it's sparse_vmemmap_extreme by default, why not 1PB directly (50 bits)? Balbir Singh
Re: [PATCH 0/7 v5] Support for fsl-mc bus and its devices in SMMU
Hi Nipun, On Thu, Jun 21, 2018 at 03:59:27AM +, Nipun Gupta wrote: > Will this patch-set be taken for the next kernel release (and via which > tree)? I think you need Acks from Robin and Joerg in order for this to be queued. Robin should be back at the beginning of next month, so there's still time for 4.19. Will
Re: [PATCH 2/2] powerpc/mm: Increase MAX_PHYSMEM_BITS to 128TB with SPARSEMEM_VMEMMAP config
On 06/21/2018 05:02 PM, Balbir Singh wrote: On Thu, Jun 21, 2018 at 6:31 PM, Aneesh Kumar K.V wrote: We do this only with VMEMMAP config so that our page_to_[nid/section] etc are not impacted. Signed-off-by: Aneesh Kumar K.V Why 128TB, given that it's sparse_vmemmap_extreme by default, why not 1PB directly (50 bits)? That will impact config with VMEMMAP_EXTREME with no real immediate benefit. We could possibly make MAX_PHYSMEM_BITS a Kconfig variable. s390 do that. Not sure we want to do that. -aneesh
Re: [PATCH 2/2] powerpc/mm: Increase MAX_PHYSMEM_BITS to 128TB with SPARSEMEM_VMEMMAP config
On 06/21/2018 05:02 PM, Balbir Singh wrote: On Thu, Jun 21, 2018 at 6:31 PM, Aneesh Kumar K.V wrote: We do this only with VMEMMAP config so that our page_to_[nid/section] etc are not impacted. Signed-off-by: Aneesh Kumar K.V Why 128TB, given that it's sparse_vmemmap_extreme by default, why not 1PB directly (50 bits)? That will impact config with VMEMMAP_EXTREME disabled with no real immediate benefit. We could possibly make MAX_PHYSMEM_BITS a Kconfig variable. s390 do that. Not sure we want to do that. -aneesh
Re: [PATCH v2 1/6] powerpc/pkeys: Enable all user-allocatable pkeys at init.
On Thu, Jun 21, 2018 at 02:14:53PM +1000, Michael Ellerman wrote: > Ram Pai writes: > > On Tue, Jun 19, 2018 at 10:39:52PM +1000, Michael Ellerman wrote: > >> Ram Pai writes: > >> > >> > In a multithreaded application, a key allocated by one thread must > >> > be activate and usable on all threads. > >> > > >> > Currently this is not the case, because the UAMOR bits for all keys are > >> > disabled by default. When a new key is allocated in one thread, though > >> > the corresponding UAMOR bits for that thread get enabled, the UAMOR bits > >> > for all other existing threads continue to have their bits disabled. > >> > Other threads have no way to set permissions on the key, effectively > >> > making the key useless. > >> > >> This all seems a bit strongly worded to me. It's arguable whether a key > >> should be usable by the thread that allocated it or all threads. > >> > >> You could conceivably have a design where threads are blocked from using > >> a key until they're given permission to do so by the thread that > >> allocated the key. > >> > >> But we're changing the behaviour to match x86 and because we don't have > >> an API to grant another thread access to a key. Right? > > > > correct. The other threads have no way to access or change the > > permissions on the key. > > OK. > > Though prior to patch 6 all threads have read/write permissions for all > keys, so they don't necessarily need to change permissions on a key > allocated by another thread. > > >> > Enable the UAMOR bits for all keys, at process creation. Since the > >> > contents of UAMOR are inherited at fork, all threads are capable of > >> > modifying the permissions on any key. > >> > > >> > BTW: changing the permission on unallocated keys has no effect, till > >> > those keys are not associated with any PTEs. The kernel will anyway > >> > disallow to association of unallocated keys with PTEs. > >> > >> This is an ABI change, which is bad, but I guess we call it a bug fix > >> because things didn't really work previously? > > > > Yes its a behaviorial change for the better. There is no downside > > to the change because no applications should break. Single threaded > > apps will continue to just work fine. Multithreaded applications, > > which were unable to consume the API/ABI, will now be able to do so. > > Multi-threaded applications were able to use the API, as long as they > were satisfied with the semantics it provided, ie. that restrictions on > a key were only possible on the thread that allocated the key. > > I'm not trying to argue for the sake of it, it's important that we > understand the subtleties of what we're changing and how it affects > existing software - even if we think there is essentially no existing > software. > > I'll try and massage the change log to capture it. > > I ended up with what's below. > > cheers > > powerpc/pkeys: Give all threads control of their key permissions > > Currently in a multithreaded application, a key allocated by one > thread is not usable by other threads. By "not usable" we mean that > other threads are unable to change the access permissions for that > key for themselves. > > When a new key is allocated in one thread, the corresponding UAMOR > bits for that thread get enabled, however the UAMOR bits for that key > for all other threads remain disabled. > > Other threads have no way to set permissions on the key, and the > current default permissions are that read/write is enabled for all > keys, which means the key has no effect for other threads. Although > that may be the desired behaviour in some circumstances, having all > threads able to control their permissions for the key is more > flexible. > > The current behaviour also differs from the x86 behaviour, which is > problematic for users. > > To fix this, enable the UAMOR bits for all keys, at process > creation (in start_thread(), ie exec time). Since the contents of > UAMOR are inherited at fork, all threads are capable of modifying the > permissions on any key. > > This is technically an ABI break on powerpc, but pkey support is > fairly new on powerpc and not widely used, and this brings us into > line with x86. Wow! yes it crisply captures the subtle API change and the reasoning behind it. RP
Re: [PATCH v2 2/6] powerpc/pkeys: Save the pkey registers before fork
On Thu, Jun 21, 2018 at 02:13:40PM +1000, Michael Ellerman wrote: > Ram Pai writes: > > > On Tue, Jun 19, 2018 at 10:39:56PM +1000, Michael Ellerman wrote: > >> Ram Pai writes: > >> > >> > When a thread forks the contents of AMR, IAMR, UAMOR registers in the > >> > newly forked thread are not inherited. > >> > > >> > Save the registers before forking, for content of those > >> > registers to be automatically copied into the new thread. > >> > > >> > CC: Michael Ellerman > >> > CC: Florian Weimer > >> > CC: Andy Lutomirski > >> > CC: Thiago Jung Bauermann > >> > Signed-off-by: Ram Pai > >> > >> Again this is an ABI change but we'll call it a bug fix I guess. > > > > yes. the same defense here too. its a behaviorial change for the better. > > Single threaded applications will not see any behaviorial change. > > Multithreaded apps, which were unable to consume, the behavior will now be > > able to do so. > > Well threads is one thing, but this also affects processes. > > And actually without this fix it's possible that a child process could > fault on a region protected in the parent, if the value in the AMR in > the thread struct happens to block access at the time of fork(). The > value in the thread struct would be whatever was in the AMR the last > time the parent was scheduled in. I think? right. Child processes will see stale value of AMR. Technically this behavior is a bug, since existing applications; if any, cannot rely on this stale AMR value. RP
Re: [PATCH v2 0/6] powerpc/pkeys: fixes to pkeys
On Thu, Jun 21, 2018 at 08:28:47PM +1000, Michael Ellerman wrote: > Florian Weimer writes: > > > On 06/19/2018 02:40 PM, Michael Ellerman wrote: > >>> I tested the whole series with the new selftests, with the printamr.c > >>> program I posted earlier, and the glibc test for pkey_alloc &c. The > >>> latter required some test fixes, but now passes as well. As far as I > >>> can tell, everything looks good now. > >>> > >>> Tested-By: Florian Weimer > >> Thanks. I'll add that to each patch I guess, if you're happy with that? > > > > Sure, but I only tested the whole series as a whole. > > Yeah OK. We don't have a good way to express that, other than using a > merge which I'd prefer to avoid. > > So I've tagged them all with your Tested-by. If any of them turn out to > have bugs you can blame me :) I just tested the patches incrementally using the pkey selftests. So I feel confident these patches are not bugs. I will take the blame if the blame lands on Mpe :) RP
Re: [PATCH v04 9/9] hotplug/pmt: Update topology after PMT
I posted the wrong copy of the file the first time. That is what broke here. I posted the correction almost immediately to the list. The correct one has Message ID <8c437fe5-632c-a7ed-1f11-66c4578a1...@linux.vnet.ibm.com> Sorry for the inconvenience. Michael On 06/20/2018 09:13 PM, kbuild test robot wrote: > Hi Michael, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on powerpc/next] > [also build test ERROR on v4.18-rc1 next-20180620] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Michael-Bringmann/powerpc-hotplug-Update-affinity-for-migrated-CPUs/20180621-085543 > base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next > config: powerpc-defconfig (attached as .config) > compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 > reproduce: > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > GCC_VERSION=7.2.0 make.cross ARCH=powerpc > > All errors (new ones prefixed by >>): > >arch/powerpc/platforms/pseries/dlpar.c: In function 'dlpar_pmt': >>> arch/powerpc/platforms/pseries/dlpar.c:453:2: error: implicit declaration >>> of function 'rebuild_sched_domains' [-Werror=implicit-function-declaration] > rebuild_sched_domains(); > ^ >cc1: all warnings being treated as errors > > vim +/rebuild_sched_domains +453 arch/powerpc/platforms/pseries/dlpar.c > >435 >436static int dlpar_pmt(struct pseries_hp_errorlog *work) >437{ >438struct list_head *pos, *q; >439 >440ssleep(15); >441 >442list_for_each_safe(pos, q, &dlpar_delayed_list) { >443struct pseries_hp_errorlog *tmp; >444 >445tmp = list_entry(pos, struct > pseries_hp_errorlog, list); >446handle_dlpar_errorlog(tmp); >447 >448list_del(pos); >449kfree(tmp); >450} >451 >452ssleep(5); > > 453rebuild_sched_domains(); >454 >455return 0; >456} >457 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation > -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com
Re: [PATCH v04 9/9] hotplug/pmt: Update topology after PMT
I posted the wrong copy of the file the first time. That is what broke here. I posted the correction almost immediately to the list. The correct one has Message ID <8c437fe5-632c-a7ed-1f11-66c4578a1...@linux.vnet.ibm.com> Sorry for the inconvenience. Michael On 06/20/2018 09:13 PM, kbuild test robot wrote: > Hi Michael, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on powerpc/next] > [also build test ERROR on v4.18-rc1 next-20180620] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Michael-Bringmann/powerpc-hotplug-Update-affinity-for-migrated-CPUs/20180621-085543 > base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next > config: powerpc-defconfig (attached as .config) > compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 > reproduce: > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > GCC_VERSION=7.2.0 make.cross ARCH=powerpc > > All errors (new ones prefixed by >>): > >arch/powerpc/platforms/pseries/dlpar.c: In function 'dlpar_pmt': >>> arch/powerpc/platforms/pseries/dlpar.c:453:2: error: implicit declaration >>> of function 'rebuild_sched_domains' [-Werror=implicit-function-declaration] > rebuild_sched_domains(); > ^ >cc1: all warnings being treated as errors > > vim +/rebuild_sched_domains +453 arch/powerpc/platforms/pseries/dlpar.c > >435 >436static int dlpar_pmt(struct pseries_hp_errorlog *work) >437{ >438struct list_head *pos, *q; >439 >440ssleep(15); >441 >442list_for_each_safe(pos, q, &dlpar_delayed_list) { >443struct pseries_hp_errorlog *tmp; >444 >445tmp = list_entry(pos, struct > pseries_hp_errorlog, list); >446handle_dlpar_errorlog(tmp); >447 >448list_del(pos); >449kfree(tmp); >450} >451 >452ssleep(5); > > 453rebuild_sched_domains(); >454 >455return 0; >456} >457 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation > -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com
[PATCH 13/26] ppc: Convert mmu context allocation to new IDA API
ida_alloc_range is the perfect fit for this use case. Eliminates a custom spinlock, a call to ida_pre_get and a local check for the allocated ID exceeding a maximum. Signed-off-by: Matthew Wilcox --- arch/powerpc/mm/mmu_context_book3s64.c | 44 +++--- 1 file changed, 4 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index f3d4b4a0e561..5a0cf2cc8ba0 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -26,48 +26,16 @@ #include #include -static DEFINE_SPINLOCK(mmu_context_lock); static DEFINE_IDA(mmu_context_ida); static int alloc_context_id(int min_id, int max_id) { - int index, err; - -again: - if (!ida_pre_get(&mmu_context_ida, GFP_KERNEL)) - return -ENOMEM; - - spin_lock(&mmu_context_lock); - err = ida_get_new_above(&mmu_context_ida, min_id, &index); - spin_unlock(&mmu_context_lock); - - if (err == -EAGAIN) - goto again; - else if (err) - return err; - - if (index > max_id) { - spin_lock(&mmu_context_lock); - ida_remove(&mmu_context_ida, index); - spin_unlock(&mmu_context_lock); - return -ENOMEM; - } - - return index; + return ida_alloc_range(&mmu_context_ida, min_id, max_id, GFP_KERNEL); } void hash__reserve_context_id(int id) { - int rc, result = 0; - - do { - if (!ida_pre_get(&mmu_context_ida, GFP_KERNEL)) - break; - - spin_lock(&mmu_context_lock); - rc = ida_get_new_above(&mmu_context_ida, id, &result); - spin_unlock(&mmu_context_lock); - } while (rc == -EAGAIN); + int result = ida_alloc_range(&mmu_context_ida, id, id, GFP_KERNEL); WARN(result != id, "mmu: Failed to reserve context id %d (rc %d)\n", id, result); } @@ -172,9 +140,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) void __destroy_context(int context_id) { - spin_lock(&mmu_context_lock); - ida_remove(&mmu_context_ida, context_id); - spin_unlock(&mmu_context_lock); + ida_free(&mmu_context_ida, context_id); } EXPORT_SYMBOL_GPL(__destroy_context); @@ -182,13 +148,11 @@ static void destroy_contexts(mm_context_t *ctx) { int index, context_id; - spin_lock(&mmu_context_lock); for (index = 0; index < ARRAY_SIZE(ctx->extended_id); index++) { context_id = ctx->extended_id[index]; if (context_id) - ida_remove(&mmu_context_ida, context_id); + ida_free(&mmu_context_ida, context_id); } - spin_unlock(&mmu_context_lock); } static void pte_frag_destroy(void *pte_frag) -- 2.17.1
[PATCH 15/26] ppc: Convert vas ID allocation to new IDA API
Removes a custom spinlock and simplifies the code. Signed-off-by: Matthew Wilcox --- arch/powerpc/platforms/powernv/vas-window.c | 26 - 1 file changed, 4 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/platforms/powernv/vas-window.c b/arch/powerpc/platforms/powernv/vas-window.c index ff9f48812331..2a5e68a2664d 100644 --- a/arch/powerpc/platforms/powernv/vas-window.c +++ b/arch/powerpc/platforms/powernv/vas-window.c @@ -515,35 +515,17 @@ int init_winctx_regs(struct vas_window *window, struct vas_winctx *winctx) return 0; } -static DEFINE_SPINLOCK(vas_ida_lock); - static void vas_release_window_id(struct ida *ida, int winid) { - spin_lock(&vas_ida_lock); - ida_remove(ida, winid); - spin_unlock(&vas_ida_lock); + ida_free(ida, winid); } static int vas_assign_window_id(struct ida *ida) { - int rc, winid; - - do { - rc = ida_pre_get(ida, GFP_KERNEL); - if (!rc) - return -EAGAIN; - - spin_lock(&vas_ida_lock); - rc = ida_get_new(ida, &winid); - spin_unlock(&vas_ida_lock); - } while (rc == -EAGAIN); - - if (rc) - return rc; + int winid = ida_alloc_max(ida, VAX_WINDOWS_PER_CHIP, GFP_KERNEL); - if (winid > VAS_WINDOWS_PER_CHIP) { - pr_err("Too many (%d) open windows\n", winid); - vas_release_window_id(ida, winid); + if (winid == -ENOSPC) { + pr_err("Too many (%d) open windows\n", VAX_WINDOWS_PER_CHIP); return -EAGAIN; } -- 2.17.1
Re: [PATCH] selftests/powerpc: Fix strncpy usage
On Wed, Jun 20, 2018 at 07:51:11PM -0300, Breno Leitao wrote: > - strncpy(prog, argv[0], strlen(argv[0])); > + strncpy(prog, argv[0], sizeof(prog) - 1); strncpy(prog, argv[0], sizeof prog); if (prog[sizeof prog - 1]) scream_bloody_murder(); Silently using the wrong data is a worse habit than not checking for overflows ;-) Segher
Re: [PATCH kernel v2 REPOST] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
On Thu, Jun 21, 2018 at 07:36:27PM +1000, Alexey Kardashevskiy wrote: > At the moment we allocate the entire TCE table, twice (hardware part and > userspace translation cache). This normally works as we normally have > contigous memory and the guest will map entire RAM for 64bit DMA. > > However if we have sparse RAM (one example is a memory device), then > we will allocate TCEs which will never be used as the guest only maps > actual memory for DMA. If it is a single level TCE table, there is nothing > we can really do but if it a multilevel table, we can skip allocating > TCEs we know we won't need. > > This adds ability to allocate only first level, saving memory. > > This changes iommu_table::free() to avoid allocating of an extra level; > iommu_table::set() will do this when needed. > > This adds @alloc parameter to iommu_table::exchange() to tell the callback > if it can allocate an extra level; the flag is set to "false" for > the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns > H_TOO_HARD. > > This still requires the entire table to be counted in mm::locked_vm. > > To be conservative, this only does on-demand allocation when > the usespace cache table is requested which is the case of VFIO. > > The example math for a system replicating a powernv setup with NVLink2 > in a guest: > 16GB RAM mapped at 0x0 > 128GB GPU RAM window (16GB of actual RAM) mapped at 0x2440 > > the table to cover that all with 64K pages takes: > (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB > > If we allocate only necessary TCE levels, we will only need: > (((0x4 + 0x4) >> 16)*8)>>20 = 4MB (plus some for indirect > levels). > > Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson > --- > > > This is what I meant to post few days ago, sorry for the noise. > > > --- > Changes: > v2: > * fixed bug in cleanup path which forced the entire table to be > allocated right before destroying > * added memory allocation error handling pnv_tce() > --- > arch/powerpc/include/asm/iommu.h | 7 ++- > arch/powerpc/platforms/powernv/pci.h | 6 ++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +- > arch/powerpc/platforms/powernv/pci-ioda-tce.c | 73 > +-- > arch/powerpc/platforms/powernv/pci-ioda.c | 8 +-- > drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- > 6 files changed, 73 insertions(+), 27 deletions(-) > > diff --git a/arch/powerpc/include/asm/iommu.h > b/arch/powerpc/include/asm/iommu.h > index 4bdcf22..daa3ee5 100644 > --- a/arch/powerpc/include/asm/iommu.h > +++ b/arch/powerpc/include/asm/iommu.h > @@ -70,7 +70,7 @@ struct iommu_table_ops { > unsigned long *hpa, > enum dma_data_direction *direction); > > - __be64 *(*useraddrptr)(struct iommu_table *tbl, long index); > + __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc); > #endif > void (*clear)(struct iommu_table *tbl, > long index, long npages); > @@ -122,10 +122,13 @@ struct iommu_table { > __be64 *it_userspace; /* userspace view of the table */ > struct iommu_table_ops *it_ops; > struct krefit_kref; > + int it_nid; > }; > > +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \ > + ((tbl)->it_ops->useraddrptr((tbl), (entry), false)) > #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ > - ((tbl)->it_ops->useraddrptr((tbl), (entry))) > + ((tbl)->it_ops->useraddrptr((tbl), (entry), true)) > > /* Pure 2^n version of get_order */ > static inline __attribute_const__ > diff --git a/arch/powerpc/platforms/powernv/pci.h > b/arch/powerpc/platforms/powernv/pci.h > index 5e02408..1fa5590 100644 > --- a/arch/powerpc/platforms/powernv/pci.h > +++ b/arch/powerpc/platforms/powernv/pci.h > @@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long > index, long npages, > unsigned long attrs); > extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages); > extern int pnv_tce_xchg(struct iommu_table *tbl, long index, > - unsigned long *hpa, enum dma_data_direction *direction); > -extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index); > + unsigned long *hpa, enum dma_data_direction *direction, > + bool alloc); > +extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index, > + bool alloc); > extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index); > > extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset, > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c > b/arch/powerpc/kvm/book3s_64_vio_hv.c > index 8cc1caf..efb90d8 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm > *kvm, > { > struct mm_iomm
Re: [PATCH v2] powerpc/numa: Correct kernel message severity
Vipin K Parashar writes: > printk() in unmap_cpu_from_node() uses KERN_ERR message severity, > for a WARNING message. Change it to pr_warn(). > > Signed-off-by: Vipin K Parashar > --- > arch/powerpc/mm/numa.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index edd8d0b..1632f4b 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -163,8 +163,7 @@ static void unmap_cpu_from_node(unsigned long cpu) > if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) { > cpumask_clear_cpu(cpu, node_to_cpumask_map[node]); > } else { > - printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n", > -cpu, node); > + pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node); > } > } The full function is: static void unmap_cpu_from_node(unsigned long cpu) { int node = numa_cpu_lookup_table[cpu]; dbg("removing cpu %lu from node %d\n", cpu, node); if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) { cpumask_clear_cpu(cpu, node_to_cpumask_map[node]); } else { printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n", cpu, node); } } So we look up what node the CPU is on, and then we lookup the set of CPUs on that node, and they don't match. That seems like a bug, not a warning. Have you looked at why we're seeing this warning? It seems like maybe something else is going wrong to get us into this situation to begin with. If there's some good reason why this is happening, and it's truly harmless, then we can just remove the printk() entirely. cheers
Re: [PATCH kernel] KVM: PPC: Fix hardware and emulated TCE tables matching
On Wed, Jun 20, 2018 at 06:42:58PM +1000, Alexey Kardashevskiy wrote: > When attaching a hardware table to LIOBN in KVM, we match table parameters > such as page size, table offset and table size. However the tables are > created via very different paths - VFIO and KVM - and the VFIO path goes > through the platform code which has minimum TCE page size requirement > (which is 4K but since we allocate memory by pages and cannot avoid > alignment anyway, we align to 64k pages for powernv_defconfig). > > So when we match the tables, one might be bigger that the other which > means the hardware table cannot get attached to LIOBN and DMA mapping > fails. > > This removes the table size alignment from the guest visible table. > This does not affect the memory allocation which is still aligned - > kvmppc_tce_pages() takes care of this. > > This relaxes the check we do when attaching tables to allow the hardware > table be bigger than the guest visible table. > > Ideally we want the KVM table to cover the same space as the hardware > table does but since the hardware table may use multiple levels, and > all levels must use the same table size (IODA2 design), the area it can > actually cover might get very different from the window size which > the guest requested, even though the guest won't map it all. > > Fixes: ca1fc489cf "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages > with smaller physical pages" > Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson > --- > arch/powerpc/kvm/book3s_64_vio.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > b/arch/powerpc/kvm/book3s_64_vio.c > index 8c456fa..8167ce8 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -180,7 +180,7 @@ extern long kvm_spapr_tce_attach_iommu_group(struct kvm > *kvm, int tablefd, > if ((tbltmp->it_page_shift <= stt->page_shift) && > (tbltmp->it_offset << tbltmp->it_page_shift == >stt->offset << stt->page_shift) && > - (tbltmp->it_size << tbltmp->it_page_shift == > + (tbltmp->it_size << tbltmp->it_page_shift >= >stt->size << stt->page_shift)) { > /* >* Reference the table to avoid races with > @@ -296,7 +296,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > { > struct kvmppc_spapr_tce_table *stt = NULL; > struct kvmppc_spapr_tce_table *siter; > - unsigned long npages, size; > + unsigned long npages, size = args->size; > int ret = -ENOMEM; > int i; > > @@ -304,7 +304,6 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > (args->offset + args->size > (ULLONG_MAX >> args->page_shift))) > return -EINVAL; > > - size = _ALIGN_UP(args->size, PAGE_SIZE >> 3); > npages = kvmppc_tce_pages(size); > ret = kvmppc_account_memlimit(kvmppc_stt_pages(npages), true); > if (ret) -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH 13/26] ppc: Convert mmu context allocation to new IDA API
On Thu, 21 Jun 2018 14:28:22 -0700 Matthew Wilcox wrote: > ida_alloc_range is the perfect fit for this use case. Eliminates > a custom spinlock, a call to ida_pre_get and a local check for the > allocated ID exceeding a maximum. > > Signed-off-by: Matthew Wilcox > --- > arch/powerpc/mm/mmu_context_book3s64.c | 44 +++--- > 1 file changed, 4 insertions(+), 40 deletions(-) > > diff --git a/arch/powerpc/mm/mmu_context_book3s64.c > b/arch/powerpc/mm/mmu_context_book3s64.c > index f3d4b4a0e561..5a0cf2cc8ba0 100644 > --- a/arch/powerpc/mm/mmu_context_book3s64.c > +++ b/arch/powerpc/mm/mmu_context_book3s64.c > @@ -26,48 +26,16 @@ > #include > #include > > -static DEFINE_SPINLOCK(mmu_context_lock); > static DEFINE_IDA(mmu_context_ida); > > static int alloc_context_id(int min_id, int max_id) > { > - int index, err; > - > -again: > - if (!ida_pre_get(&mmu_context_ida, GFP_KERNEL)) > - return -ENOMEM; > - > - spin_lock(&mmu_context_lock); > - err = ida_get_new_above(&mmu_context_ida, min_id, &index); > - spin_unlock(&mmu_context_lock); > - > - if (err == -EAGAIN) > - goto again; > - else if (err) > - return err; > - > - if (index > max_id) { > - spin_lock(&mmu_context_lock); > - ida_remove(&mmu_context_ida, index); > - spin_unlock(&mmu_context_lock); > - return -ENOMEM; > - } > - > - return index; > + return ida_alloc_range(&mmu_context_ida, min_id, max_id, GFP_KERNEL); > } > > void hash__reserve_context_id(int id) > { > - int rc, result = 0; > - > - do { > - if (!ida_pre_get(&mmu_context_ida, GFP_KERNEL)) > - break; > - > - spin_lock(&mmu_context_lock); > - rc = ida_get_new_above(&mmu_context_ida, id, &result); > - spin_unlock(&mmu_context_lock); > - } while (rc == -EAGAIN); > + int result = ida_alloc_range(&mmu_context_ida, id, id, GFP_KERNEL); > > WARN(result != id, "mmu: Failed to reserve context id %d (rc %d)\n", > id, result); > } > @@ -172,9 +140,7 @@ int init_new_context(struct task_struct *tsk, struct > mm_struct *mm) > > void __destroy_context(int context_id) > { > - spin_lock(&mmu_context_lock); > - ida_remove(&mmu_context_ida, context_id); > - spin_unlock(&mmu_context_lock); > + ida_free(&mmu_context_ida, context_id); > } > EXPORT_SYMBOL_GPL(__destroy_context); > > @@ -182,13 +148,11 @@ static void destroy_contexts(mm_context_t *ctx) > { > int index, context_id; > > - spin_lock(&mmu_context_lock); > for (index = 0; index < ARRAY_SIZE(ctx->extended_id); index++) { > context_id = ctx->extended_id[index]; > if (context_id) > - ida_remove(&mmu_context_ida, context_id); > + ida_free(&mmu_context_ida, context_id); > } > - spin_unlock(&mmu_context_lock); > } > > static void pte_frag_destroy(void *pte_frag) This hunk should be okay because the mmu_context_lock does not protect the extended_id array, right Aneesh? Reviewed-by: Nicholas Piggin Thanks, Nick
Re: [PATCH 13/26] ppc: Convert mmu context allocation to new IDA API
On Fri, Jun 22, 2018 at 12:15:11PM +1000, Nicholas Piggin wrote: > On Thu, 21 Jun 2018 14:28:22 -0700 > Matthew Wilcox wrote: > > static int alloc_context_id(int min_id, int max_id) ... > > - spin_lock(&mmu_context_lock); > > - err = ida_get_new_above(&mmu_context_ida, min_id, &index); > > - spin_unlock(&mmu_context_lock); ... > > @@ -182,13 +148,11 @@ static void destroy_contexts(mm_context_t *ctx) > > { > > int index, context_id; > > > > - spin_lock(&mmu_context_lock); > > for (index = 0; index < ARRAY_SIZE(ctx->extended_id); index++) { > > context_id = ctx->extended_id[index]; > > if (context_id) > > - ida_remove(&mmu_context_ida, context_id); > > + ida_free(&mmu_context_ida, context_id); > > } > > - spin_unlock(&mmu_context_lock); > > } > > > > static void pte_frag_destroy(void *pte_frag) > > This hunk should be okay because the mmu_context_lock does not protect > the extended_id array, right Aneesh? That's my understanding. The code today does this: static inline int alloc_extended_context(struct mm_struct *mm, unsigned long ea) { int context_id; int index = ea >> MAX_EA_BITS_PER_CONTEXT; context_id = hash__alloc_context_id(); if (context_id < 0) return context_id; VM_WARN_ON(mm->context.extended_id[index]); mm->context.extended_id[index] = context_id; so it's not currently protected by this lock. I suppose we are currently protected from destroy_contexts() being called twice simultaneously, but you'll notice that we don't zero the array elements in destroy_contexts(), so if we somehow had a code path which could call it concurrently, we'd be seeing warnings when the second caller tried to remove the context IDs from the IDA. I deduced that something else must be preventing this situation from occurring (like, oh i don't know, this function only being called on process exit, so implicitly only called once per context). > Reviewed-by: Nicholas Piggin Thanks.
Re: [PATCH 13/26] ppc: Convert mmu context allocation to new IDA API
On Thu, 21 Jun 2018 21:38:15 -0700 Matthew Wilcox wrote: > On Fri, Jun 22, 2018 at 12:15:11PM +1000, Nicholas Piggin wrote: > > On Thu, 21 Jun 2018 14:28:22 -0700 > > Matthew Wilcox wrote: > > > static int alloc_context_id(int min_id, int max_id) > ... > > > - spin_lock(&mmu_context_lock); > > > - err = ida_get_new_above(&mmu_context_ida, min_id, &index); > > > - spin_unlock(&mmu_context_lock); > ... > > > @@ -182,13 +148,11 @@ static void destroy_contexts(mm_context_t *ctx) > > > { > > > int index, context_id; > > > > > > - spin_lock(&mmu_context_lock); > > > for (index = 0; index < ARRAY_SIZE(ctx->extended_id); index++) { > > > context_id = ctx->extended_id[index]; > > > if (context_id) > > > - ida_remove(&mmu_context_ida, context_id); > > > + ida_free(&mmu_context_ida, context_id); > > > } > > > - spin_unlock(&mmu_context_lock); > > > } > > > > > > static void pte_frag_destroy(void *pte_frag) > > > > This hunk should be okay because the mmu_context_lock does not protect > > the extended_id array, right Aneesh? > > That's my understanding. The code today does this: > > static inline int alloc_extended_context(struct mm_struct *mm, > unsigned long ea) > { > int context_id; > > int index = ea >> MAX_EA_BITS_PER_CONTEXT; > > context_id = hash__alloc_context_id(); > if (context_id < 0) > return context_id; > > VM_WARN_ON(mm->context.extended_id[index]); > mm->context.extended_id[index] = context_id; > > so it's not currently protected by this lock. I suppose we are currently > protected from destroy_contexts() being called twice simultaneously, but > you'll notice that we don't zero the array elements in destroy_contexts(), > so if we somehow had a code path which could call it concurrently, we'd > be seeing warnings when the second caller tried to remove the context Yeah that'd be an existing bug. > IDs from the IDA. I deduced that something else must be preventing > this situation from occurring (like, oh i don't know, this function only > being called on process exit, so implicitly only called once per context). I think that's exactly right. Thanks, Nick
Re: [PATCH 2/3] [v2] m68k: mac: use time64_t in RTC handling
On Tue, 19 Jun 2018, Arnd Bergmann wrote: > The real-time clock on m68k (and powerpc) mac systems uses an unsigned > 32-bit value starting in 1904, which overflows in 2040, about two years > later than everyone else, but this gets wrapped around in the Linux code > in 2038 already because of the deprecated usage of time_t and/or long in > the conversion. > > Getting rid of the deprecated interfaces makes it work until 2040 as > documented, and it could be easily extended by reinterpreting the > resulting time64_t as a positive number. For the moment, I'm adding a > WARN_ON() that triggers if we encounter a time before 1970 or after 2040 > (the two are indistinguishable). > I really don't like the WARN_ON(), but I'd prefer to address that in a separate patch rather than impede the progress of this patch (or of this series, since 3/3 seems to be unrelated). BTW, have you considered using the same wrap-around test (i.e. YY < 70) that we use for the year register in the other RTC chips? > This brings it in line with the corresponding code that we have on > powerpc macintosh. > Your recent patches to the Mac RTC routines (which are duplicated under arch/m68k and arch/powerpc) conflict with my recent patch that deduplicates the same code. So I will rebase and resubmit after someone merges these fixes. Apparently the PowerMac routines work now, which is sufficient testing for me; the PowerMac routines will get tested on m68k Macs when that code gets deduplicated again. BTW, Joshua tells me that he is not doing code review. We should probably drop the "M68K ON APPLE MACINTOSH" entry from the MAINTAINERS file, like the Amiga and Atari ports... -- > Signed-off-by: Arnd Bergmann > --- > v2: Fix varargs passing bug pointed out by Andreas Schwab > Fix a typo that caused a build regression > --- > arch/m68k/mac/misc.c | 62 > +--- > 1 file changed, 39 insertions(+), 23 deletions(-) > > diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c > index c68054361615..0a2572a6bfe5 100644 > --- a/arch/m68k/mac/misc.c > +++ b/arch/m68k/mac/misc.c > @@ -26,33 +26,39 @@ > > #include > > -/* Offset between Unix time (1970-based) and Mac time (1904-based) */ > +/* Offset between Unix time (1970-based) and Mac time (1904-based). Cuda and > PMU > + * times wrap in 2040. If we need to handle later times, the read_time > functions > + * need to be changed to interpret wrapped times as post-2040. */ > > #define RTC_OFFSET 2082844800 > > static void (*rom_reset)(void); > > #ifdef CONFIG_ADB_CUDA > -static long cuda_read_time(void) > +static time64_t cuda_read_time(void) > { > struct adb_request req; > - long time; > + time64_t time; > > if (cuda_request(&req, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0) > return 0; > while (!req.complete) > cuda_poll(); > > - time = (req.reply[3] << 24) | (req.reply[4] << 16) | > -(req.reply[5] << 8) | req.reply[6]; > + time = (u32)((req.reply[3] << 24) | (req.reply[4] << 16) | > + (req.reply[5] << 8) | req.reply[6]); > + > + /* it's either after year 2040, or the RTC has gone backwards */ > + WARN_ON(time < RTC_OFFSET); > + > return time - RTC_OFFSET; > } > > -static void cuda_write_time(long data) > +static void cuda_write_time(time64_t time) > { > struct adb_request req; > + u32 data = lower_32_bits(time + RTC_OFFSET); > > - data += RTC_OFFSET; > if (cuda_request(&req, NULL, 6, CUDA_PACKET, CUDA_SET_TIME, >(data >> 24) & 0xFF, (data >> 16) & 0xFF, >(data >> 8) & 0xFF, data & 0xFF) < 0) > @@ -86,26 +92,30 @@ static void cuda_write_pram(int offset, __u8 data) > #endif /* CONFIG_ADB_CUDA */ > > #ifdef CONFIG_ADB_PMU68K > -static long pmu_read_time(void) > +static time64_t pmu_read_time(void) > { > struct adb_request req; > - long time; > + time64_t time; > > if (pmu_request(&req, NULL, 1, PMU_READ_RTC) < 0) > return 0; > while (!req.complete) > pmu_poll(); > > - time = (req.reply[1] << 24) | (req.reply[2] << 16) | > -(req.reply[3] << 8) | req.reply[4]; > + time = (u32)((req.reply[1] << 24) | (req.reply[2] << 16) | > + (req.reply[3] << 8) | req.reply[4]); > + > + /* it's either after year 2040, or the RTC has gone backwards */ > + WARN_ON(time < RTC_OFFSET); > + > return time - RTC_OFFSET; > } > > -static void pmu_write_time(long data) > +static void pmu_write_time(time64_t time) > { > struct adb_request req; > + u32 data = lower_32_bits(time + RTC_OFFSET); > > - data += RTC_OFFSET; > if (pmu_request(&req, NULL, 5, PMU_SET_RTC, > (data >> 24) & 0xFF, (data >> 16) & 0xFF, > (data >> 8) & 0xFF, data & 0xFF) < 0) > @@ -269,8 +279,12 @@ static lon
Re: [PATCH 2/3] migration/memory: Evaluate LMB assoc changes
Hi Michael, Thank you for the patch! Yet something to improve: [auto build test ERROR on powerpc/next] [also build test ERROR on v4.18-rc1 next-20180621] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Michael-Bringmann/powerpc-migration-Affinity-fix-for-memory/20180621-090151 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-allmodconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.2.0 make.cross ARCH=powerpc All errors (new ones prefixed by >>): arch/powerpc/platforms/pseries/hotplug-memory.c: In function 'pseries_update_drconf_memory': >> arch/powerpc/platforms/pseries/hotplug-memory.c:1032:4: error: implicit >> declaration of function 'dlpar_queue_action'; did you mean >> 'waitqueue_active'? [-Werror=implicit-function-declaration] dlpar_queue_action( ^~ waitqueue_active cc1: some warnings being treated as errors vim +1032 arch/powerpc/platforms/pseries/hotplug-memory.c 999 1000 static int pseries_update_drconf_memory(struct drmem_lmb_info *new_dinfo) 1001 { 1002 struct drmem_lmb *old_lmb, *new_lmb; 1003 unsigned long memblock_size; 1004 int rc = 0; 1005 1006 if (rtas_hp_event) 1007 return 0; 1008 1009 memblock_size = pseries_memory_block_size(); 1010 if (!memblock_size) 1011 return -EINVAL; 1012 1013 /* Arrays should have the same size and DRC indexes */ 1014 for_each_pair_dinfo_lmb(drmem_info, old_lmb, new_dinfo, new_lmb) { 1015 1016 if (new_lmb->drc_index != old_lmb->drc_index) 1017 continue; 1018 1019 if ((old_lmb->flags & DRCONF_MEM_ASSIGNED) && 1020 (!(new_lmb->flags & DRCONF_MEM_ASSIGNED))) { 1021 rc = pseries_remove_memblock( 1022 old_lmb->base_addr, memblock_size); 1023 break; 1024 } else if ((!(old_lmb->flags & DRCONF_MEM_ASSIGNED)) && 1025 (new_lmb->flags & DRCONF_MEM_ASSIGNED)) { 1026 rc = memblock_add(old_lmb->base_addr, 1027 memblock_size); 1028 rc = (rc < 0) ? -EINVAL : 0; 1029 break; 1030 } else if ((old_lmb->aa_index != new_lmb->aa_index) && 1031 (new_lmb->flags & DRCONF_MEM_ASSIGNED)) { > 1032 dlpar_queue_action( 1033 PSERIES_HP_ELOG_RESOURCE_MEM, 1034 PSERIES_HP_ELOG_ACTION_READD, 1035 new_lmb->drc_index); 1036 } 1037 } 1038 return rc; 1039 } 1040 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 13/26] ppc: Convert mmu context allocation to new IDA API
Nicholas Piggin writes: > On Thu, 21 Jun 2018 14:28:22 -0700 > Matthew Wilcox wrote: > >> ida_alloc_range is the perfect fit for this use case. Eliminates >> a custom spinlock, a call to ida_pre_get and a local check for the >> allocated ID exceeding a maximum. >> >> Signed-off-by: Matthew Wilcox >> --- >> arch/powerpc/mm/mmu_context_book3s64.c | 44 +++--- >> 1 file changed, 4 insertions(+), 40 deletions(-) >> >> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c >> b/arch/powerpc/mm/mmu_context_book3s64.c >> index f3d4b4a0e561..5a0cf2cc8ba0 100644 >> --- a/arch/powerpc/mm/mmu_context_book3s64.c >> +++ b/arch/powerpc/mm/mmu_context_book3s64.c >> @@ -26,48 +26,16 @@ >> #include >> #include >> >> -static DEFINE_SPINLOCK(mmu_context_lock); >> static DEFINE_IDA(mmu_context_ida); >> >> static int alloc_context_id(int min_id, int max_id) >> { >> -int index, err; >> - >> -again: >> -if (!ida_pre_get(&mmu_context_ida, GFP_KERNEL)) >> -return -ENOMEM; >> - >> -spin_lock(&mmu_context_lock); >> -err = ida_get_new_above(&mmu_context_ida, min_id, &index); >> -spin_unlock(&mmu_context_lock); >> - >> -if (err == -EAGAIN) >> -goto again; >> -else if (err) >> -return err; >> - >> -if (index > max_id) { >> -spin_lock(&mmu_context_lock); >> -ida_remove(&mmu_context_ida, index); >> -spin_unlock(&mmu_context_lock); >> -return -ENOMEM; >> -} >> - >> -return index; >> +return ida_alloc_range(&mmu_context_ida, min_id, max_id, GFP_KERNEL); >> } >> >> void hash__reserve_context_id(int id) >> { >> -int rc, result = 0; >> - >> -do { >> -if (!ida_pre_get(&mmu_context_ida, GFP_KERNEL)) >> -break; >> - >> -spin_lock(&mmu_context_lock); >> -rc = ida_get_new_above(&mmu_context_ida, id, &result); >> -spin_unlock(&mmu_context_lock); >> -} while (rc == -EAGAIN); >> +int result = ida_alloc_range(&mmu_context_ida, id, id, GFP_KERNEL); >> >> WARN(result != id, "mmu: Failed to reserve context id %d (rc %d)\n", >> id, result); >> } >> @@ -172,9 +140,7 @@ int init_new_context(struct task_struct *tsk, struct >> mm_struct *mm) >> >> void __destroy_context(int context_id) >> { >> -spin_lock(&mmu_context_lock); >> -ida_remove(&mmu_context_ida, context_id); >> -spin_unlock(&mmu_context_lock); >> +ida_free(&mmu_context_ida, context_id); >> } >> EXPORT_SYMBOL_GPL(__destroy_context); >> >> @@ -182,13 +148,11 @@ static void destroy_contexts(mm_context_t *ctx) >> { >> int index, context_id; >> >> -spin_lock(&mmu_context_lock); >> for (index = 0; index < ARRAY_SIZE(ctx->extended_id); index++) { >> context_id = ctx->extended_id[index]; >> if (context_id) >> -ida_remove(&mmu_context_ida, context_id); >> +ida_free(&mmu_context_ida, context_id); >> } >> -spin_unlock(&mmu_context_lock); >> } >> >> static void pte_frag_destroy(void *pte_frag) > > This hunk should be okay because the mmu_context_lock does not protect > the extended_id array, right Aneesh? Yes. This is called at process exit, so we should not find parallel calls. On the allocation side, we are protected by mmap_sem. We do allocate extended_id when doing mmap. -aneesh
Re: [PATCH 13/26] ppc: Convert mmu context allocation to new IDA API
Matthew Wilcox writes: > this situation from occurring (like, oh i don't know, this function only > being called on process exit, so implicitly only called once per context). > That is correct.
[PATCH 2/2] powerpc: Document issues with TM on POWER9
Signed-off-by: Michael Neuling --- .../powerpc/transactional_memory.txt | 44 +++ 1 file changed, 44 insertions(+) diff --git a/Documentation/powerpc/transactional_memory.txt b/Documentation/powerpc/transactional_memory.txt index e32fdbb4c9..b254eab517 100644 --- a/Documentation/powerpc/transactional_memory.txt +++ b/Documentation/powerpc/transactional_memory.txt @@ -198,3 +198,47 @@ presented). The transaction cannot then be continued and will take the failure handler route. Furthermore, the transactional 2nd register state will be inaccessible. GDB can currently be used on programs using TM, but not sensibly in parts within transactions. + +POWER9 +== + +TM on POWER9 has issues with storing the complete register state. This +is described in this commit: + +commit 4bb3c7a0208fc13ca70598efd109901a7cd45ae7 +Author: Paul Mackerras +Date: Wed Mar 21 21:32:01 2018 +1100 +KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 + +To account for this different POWER9 chips have TM enabled in +different ways. + +On POWER9N DD2.01 and below, TM is disabled. ie +HWCAP2[PPC_FEATURE2_HTM] is not set. + +On POWER9N DD2.1 TM is configured by firmware to always abort a +transaction when tm suspend occurs. So tsuspend will cause a +transaction to be aborted and rolled back. Kernel exceptions will also +cause the transaction to be aborted and rolled back and the exception +will not occur. If userspace constructs a sigcontext that enables TM +suspend, the sigcontext will be rejected by the kernel. This mode is +advertised to users with HWCAP2[PPC_FEATURE2_HTM_NO_SUSPEND] set. +HWCAP2[PPC_FEATURE2_HTM] is not set in this mode. + +On POWER9N DD2.2 and above, KVM and POWERVM emulate TM for guests (as +descibed in commit 4bb3c7a0208f), hence TM is enabled for guests +ie. HWCAP2[PPC_FEATURE2_HTM] is set for guest userspace. Guests that +makes heavy use of TM suspend (tsuspend or kernel suspend) will result +in traps into the hypervisor and hence will suffer a performance +degredation. Host userspace has TM disabled +ie. HWCAP2[PPC_FEATURE2_HTM] is not set. (although we make enable it +at some point in the future if we bring the emulation into host +userspace context switching). + +POWER9C DD1.2 and above are only avaliable with POWERNV and hence +Linux only runs as a guest. On these systems TM is emulated like on +POWER9N DD2.2. + +Guest migration from POWER8 to POWER9 will work with POWER9N DD2.2 and +POWER9C DD1.2. Since earlier POWER9 processors don't support TM +emulation, migration from POWER8 to POWER9 is not supported there. -- 2.17.1
[PATCH 1/2] powerpc: Document issues with the DAWR on POWER9
Signed-off-by: Michael Neuling --- Documentation/powerpc/DAWR-POWER9.txt | 58 +++ 1 file changed, 58 insertions(+) create mode 100644 Documentation/powerpc/DAWR-POWER9.txt diff --git a/Documentation/powerpc/DAWR-POWER9.txt b/Documentation/powerpc/DAWR-POWER9.txt new file mode 100644 index 00..cc6e69b69b --- /dev/null +++ b/Documentation/powerpc/DAWR-POWER9.txt @@ -0,0 +1,58 @@ +DAWR issues on POWER9 + + +On POWER9 the DAWR can cause a checkstop if it points to cache +inhibited (CI) memory. Currently Linux has no way to disinguish CI +memory when configuring the DAWR, so (for now) the DAWR is disabled by +this commit: + +commit 9654153158d3e0684a1bdb76dbababdb7111d5a0 +Author: Michael Neuling +Date: Tue Mar 27 15:37:24 2018 +1100 +powerpc: Disable DAWR in the base POWER9 CPU features + +Technical Details: + + +DAWR has 6 different ways of being set. +1) ptrace +2) h_set_mode(DAWR) +3) h_set_dabr() +4) kvmppc_set_one_reg() +5) xmon + +For ptrace, we now advertise zero breakpoints on POWER9 via the +PPC_PTRACE_GETHWDBGINFO call. This results in GDB falling back to +software emulation of the watchpoint (which is slow). + +h_set_mode(DAWR) and h_set_dabr() will now return an error to the +guest on a POWER9 host. Current Linux guests ignore this error, so +they will silently not get the DAWR. + +kvmppc_set_one_reg() will store the value in the vcpu but won't +actually set it on POWER9 hardware. This is done so we don't break +migration from POWER8 to POWER9, at the cost of silently losing the +DAWR on the migration. + +For xmon, the 'bd' command will return an error on P9. + +Consequences for users + + +For GDB watchpoints (ie 'watch' command) on POWER9 bare metal , GDB +will accept the command. Unfortunatley since there is no hardware +support for the watchpoint, GDB will software emulate the watchpoint +making it run very slowly. + +The same will also be true for any guests started on a POWER9 +host. The watchpoint will fail and GDB will fall back to software +emulation. + +If a guest is started on a POWER8 host, GDB will accept the watchpoint +and configure the hardware to use the DAWR. This will run at full +speed since it can use the hardware emualation. Unfortnatley if this +guest is migrated to a POWER9 host, the watchpoint will be lost on the +POWER9. Loads and stores to the watchpoint locations will not be +trapped in GDB. The watchpoint is remembered, so if the guest is +migrated back to the POWER8 host, it will start working again. + -- 2.17.1