Re: [PATCH kernel] powerpc/powernv/ioda2: Update iommu table base on ownership change
On 02/20/2017 11:41 PM, Alexey Kardashevskiy wrote: Cc: Gavin Shan Signed-off-by: Alexey Kardashevskiy Tested-by: Mauricio Faria de Oliveira P.S.: sorry, late, but for the record. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: 1M hugepage size being registered on Linux
On 06/21/2017 07:33 AM, Michael Ellerman wrote: I am working on a bug related to 1M hugepage size being registered on Linux (Power 8 Baremetal - Garrison). Wasn't that caused by a firmware bug? Ben/Stewart, does that ring a bell, something new, intended or not? :- ) Thanks, I was checking dmesg and it seems that 1M page size is coming from firmware to Linux. [0.00] base_shift=20: shift=20, sllp=0x0130, avpnm=0x, tlbiel=0, penc=2 [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0 pages Should Linux support this page size? Does it work?:) The user manual says it's a supported size, but I thought it didn't work (in hardware) for some reason. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH] drivers: of: increase MAX_RESERVED_REGIONS to 32
On 09/26/2017 05:40 AM, Stewart Smith wrote: The simple fix is to bump the length of the array to 32 which "should be enough for everyone(TM)". Tested-by: Mauricio Faria de Oliveira # uname -r 4.14.0-rc3 # dmesg [0.00] opal: OPAL detected ! [0.00] crashkernel: memory value expected [0.00] OF: reserved mem: not enough space all defined regions. [0.00] OF: reserved mem: not enough space all defined regions. [0.00] OF: reserved mem: not enough space all defined regions. [0.00] OF: reserved mem: not enough space all defined regions. [0.00] Allocated 2883584 bytes for 2048 pacas at cfd4 <...> # uname -r 4.14.0-rc3.of32maxrsvdmemregions # dmesg | head [0.00] opal: OPAL detected ! [0.00] crashkernel: memory value expected [0.00] Allocated 2883584 bytes for 2048 pacas at cfd4 <...> -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: Kernel 4.15 lost set_robust_list support on POWER 9
Nick, Michael, On 02/05/2018 10:48 AM, Florian Weimer wrote: 7041 set_robust_list(0x7fff93dc3980, 24) = -1 ENOSYS (Function not implemented) The regression was introduced by commit 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table"). The problem is Radix MMU specific (does not occur with 'disable_radix'), and does not occur with that code reverted (ie do not set PIDR to zero). Do you see any reasons why? (wondering if at all related to access_ok() in include/asm/uaccess.h) with: # strace -e set_robust_list -f ./test set_robust_list(0x7fffa4b03910, 24) = -1 ENOSYS (Function not implemented) +++ exited with 1 +++ # uname -r 4.15.0 without: # strace -e set_robust_list -f ./test set_robust_list(0x7fff889c3910, 24) = 0 +++ exited with 0 +++ # uname -r 4.15.0.nopidr
Re: Kernel 4.15 lost set_robust_list support on POWER 9
On 02/05/2018 11:06 PM, Nicholas Piggin wrote: Does this help? powerpc/64s/radix: allocate guard-PID for kernel contexts at boot Yes, the test-case passes: # strace -e set_robust_list -f ./test set_robust_list(0x7fff8d453910, 24) = 0 +++ exited with 0 +++ # uname -r 4.15.0.guardpid Thanks! cheers, Mauricio
[PATCH 1/3] rfi-flush: Move the logic to avoid a redo into the debugfs code
From: Michael Ellerman rfi_flush_enable() includes a check to see if we're already enabled (or disabled), and in that case does nothing. But that means calling setup_rfi_flush() a 2nd time doesn't actually work, which is a bit confusing. Move that check into the debugfs code, where it really belongs. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/setup_64.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c388cc3..3efc01a 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -846,9 +846,6 @@ static void do_nothing(void *unused) void rfi_flush_enable(bool enable) { - if (rfi_flush == enable) - return; - if (enable) { do_rfi_flush_fixups(enabled_flush_types); on_each_cpu(do_nothing, NULL, 1); @@ -902,13 +899,19 @@ void __init setup_rfi_flush(enum l1d_flush_type types, bool enable) #ifdef CONFIG_DEBUG_FS static int rfi_flush_set(void *data, u64 val) { + bool enable; + if (val == 1) - rfi_flush_enable(true); + enable = true; else if (val == 0) - rfi_flush_enable(false); + enable = false; else return -EINVAL; + /* Only do anything if we're changing state */ + if (enable != rfi_flush) + rfi_flush_enable(enable); + return 0; } -- 2.7.4
[PATCH 0/3] Setup RFI flush after PowerVM LPM migration
This patchset allows for setup_rfi_flush() to be called again after PowerVM LPM (live partition mobility) aka LPAR migration. It's originally written by Michael Ellerman; I just rebased it on top of powerpc/merge as of 2018-02-14 BRST (commit 3f6f556). I tested it with a debug patch (sending shortly) to shortcut the LPM functions migration_store() and post_mobility_fixup() to just reach pseries_setup_rfi_flush(), and abuse the boot option 'no_rfi_flush' not to allocate the fallback flush area at boot (to simulate the migration from patched to unpatched system). Testing: --- Fallback flush area allocated at boot time: # dmesg | grep rfi-flush [0.00] rfi-flush: Using fallback displacement flush [0.00] rfi-flush: patched 8 locations # echo > /sys/kernel/mobility/migration # dmesg | grep rfi-flush [0.00] rfi-flush: Using fallback displacement flush [0.00] rfi-flush: patched 8 locations [ 51.793238] rfi-flush: Using fallback displacement flush [ 51.793258] rfi-flush: patched 8 locations Fallback flush area NOT allocated at boot time: # grep -o no_rfi_flush /proc/cmdline no_rfi_flush # dmesg | grep rfi-flush [0.00] rfi-flush: disabled on command line. [0.00] rfi-flush: re-enabled # echo > /sys/kernel/mobility/migration # dmesg | grep rfi-flush [0.00] rfi-flush: disabled on command line. [0.00] rfi-flush: re-enabled [ 31.817921] rfi-flush: Error unable to use fallback displacement flush! [ 31.817927] rfi-flush: patched 8 locations Michael Ellerman (3): rfi-flush: Move the logic to avoid a redo into the debugfs code rfi-flush: Make it possible to call setup_rfi_flush() again rfi-flush: Call setup_rfi_flush() after LPM migration arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c| 38 +++ arch/powerpc/platforms/pseries/mobility.c | 3 +++ arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c| 2 +- 5 files changed, 36 insertions(+), 11 deletions(-) -- 2.7.4
[PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
From: Michael Ellerman For PowerVM migration we want to be able to call setup_rfi_flush() again after we've migrated the partition. To support that we need to check that we're not trying to allocate the fallback flush area after memblock has gone away. If so we just fail, we don't support migrating from a patched to an unpatched machine. Or we do support it, but there will be no RFI flush enabled on the destination. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c | 25 + 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h index 469b7fd..bbcdf929 100644 --- a/arch/powerpc/include/asm/setup.h +++ b/arch/powerpc/include/asm/setup.h @@ -49,7 +49,7 @@ enum l1d_flush_type { L1D_FLUSH_MTTRIG= 0x8, }; -void __init setup_rfi_flush(enum l1d_flush_type, bool enable); +void setup_rfi_flush(enum l1d_flush_type, bool enable); void do_rfi_flush_fixups(enum l1d_flush_type types); #endif /* !__ASSEMBLY__ */ diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 3efc01a..d692f71 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -855,11 +855,22 @@ void rfi_flush_enable(bool enable) rfi_flush = enable; } -static void init_fallback_flush(void) +static bool init_fallback_flush(void) { u64 l1d_size, limit; int cpu; + if (l1d_flush_fallback_area) + return true; + + /* +* Once the slab allocator is up it's too late to allocate the fallback +* flush area, so return an error. This could happen if we migrated from +* a patched machine to an unpatched machine. +*/ + if (slab_is_available()) + return false; + l1d_size = ppc64_caches.l1d.size; limit = min(ppc64_bolted_size(), ppc64_rma_size); @@ -875,13 +886,19 @@ static void init_fallback_flush(void) paca[cpu].rfi_flush_fallback_area = l1d_flush_fallback_area; paca[cpu].l1d_flush_size = l1d_size; } + + return true; } -void __init setup_rfi_flush(enum l1d_flush_type types, bool enable) +void setup_rfi_flush(enum l1d_flush_type types, bool enable) { if (types & L1D_FLUSH_FALLBACK) { - pr_info("rfi-flush: Using fallback displacement flush\n"); - init_fallback_flush(); + if (init_fallback_flush()) + pr_info("rfi-flush: Using fallback displacement flush\n"); + else { + pr_warn("rfi-flush: Error unable to use fallback displacement flush!\n"); + types &= ~L1D_FLUSH_FALLBACK; + } } if (types & L1D_FLUSH_ORI) -- 2.7.4
[PATCH 3/3] rfi-flush: Call setup_rfi_flush() after LPM migration
From: Michael Ellerman We might have migrated to a machine that uses a different flush type, or doesn't need flushing at all. If we migrate to a machine with no flush support, ie. that would use fallback, we just print an error and switch flushing off. We could support that, but it would complicate the implementation of the fallback flush, and we don't expect anyone will ever do it. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/platforms/pseries/mobility.c | 3 +++ arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c| 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 0f7fb71..8a8033a 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -348,6 +348,9 @@ void post_mobility_fixup(void) printk(KERN_ERR "Post-mobility device tree update " "failed: %d\n", rc); + /* Possibly switch to a new RFI flush type */ + pseries_setup_rfi_flush(); + return; } diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 1ae1d9f..27cdcb6 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -100,4 +100,6 @@ static inline unsigned long cmo_get_page_size(void) int dlpar_workqueue_init(void); +void pseries_setup_rfi_flush(void); + #endif /* _PSERIES_PSERIES_H */ diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 372d7ad..dad8197 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -459,7 +459,7 @@ static void __init find_and_init_phbs(void) of_pci_check_probe_only(); } -static void pseries_setup_rfi_flush(void) +void pseries_setup_rfi_flush(void) { struct h_cpu_char_result result; enum l1d_flush_type types; -- 2.7.4
[PATCH] DEBUG: shortcut mobility fixup/migration store, and abuse no_rfi_flush
Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/setup_64.c| 8 arch/powerpc/platforms/pseries/mobility.c | 4 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index d692f71..a05b9f4 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -892,6 +892,9 @@ static bool init_fallback_flush(void) void setup_rfi_flush(enum l1d_flush_type types, bool enable) { + if (no_rfi_flush) + types = L1D_FLUSH_NONE; + if (types & L1D_FLUSH_FALLBACK) { if (init_fallback_flush()) pr_info("rfi-flush: Using fallback displacement flush\n"); @@ -911,6 +914,11 @@ void setup_rfi_flush(enum l1d_flush_type types, bool enable) if (!no_rfi_flush) rfi_flush_enable(enable); + + if (no_rfi_flush) { + pr_info("rfi-flush: re-enabled\n"); + no_rfi_flush = 0; + } } #ifdef CONFIG_DEBUG_FS diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 8a8033a..201710e 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -326,6 +326,7 @@ int pseries_devicetree_update(s32 scope) void post_mobility_fixup(void) { +#if 0 int rc; int activate_fw_token; @@ -347,6 +348,7 @@ void post_mobility_fixup(void) if (rc) printk(KERN_ERR "Post-mobility device tree update " "failed: %d\n", rc); +#endif /* Possibly switch to a new RFI flush type */ pseries_setup_rfi_flush(); @@ -358,6 +360,7 @@ static ssize_t migration_store(struct class *class, struct class_attribute *attr, const char *buf, size_t count) { +#if 0 u64 streamid; int rc; @@ -373,6 +376,7 @@ static ssize_t migration_store(struct class *class, if (rc) return rc; +#endif post_mobility_fixup(); return count; -- 2.7.4
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
Hi Michal and Michael, On 02/15/2018 05:13 AM, Michal Suchánek wrote: From: Michael Ellerman For PowerVM migration we want to be able to call setup_rfi_flush() again after we've migrated the partition. To support that we need to check that we're not trying to allocate the fallback flush area after memblock has gone away. If so we just fail, we don't support migrating from a patched to an unpatched machine. Or we do support it, but there will be no RFI flush enabled on the destination. This sounds bad to me. Either we support RFI flush or we don't. If we do the fallback area should be allocated at boot so it is always available. [snip] I think the problem with this is the size of the fallback area might have to be different between the origin and destination systems, say, a larger L1 data cache at the destination. In that case, the original size might not be enough to fully flush the L1 data cache. Michael, is that the reason it is done that way? I thought of that, but don't know for sure. Thanks! Mauricio
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
Hi Michael, Michal, I got back from vacation. Checking this one. On 02/20/2018 02:06 PM, Michal Suchánek wrote: I did it the way I did because otherwise we waste memory on every system on earth just to support a use case that we don't actually intend for anyone to ever use - ie. migrating from a patched machine to an unpatched machine. If this thread eventually closes in 'ok, so that memory has to be reserved/wasted anyway', that can be done only in pseries, right? It seems not so much memory for this particular platform/hardware. If you have multiple hosts running some LPMs and want to update them without shutting down the whole thing I suppose it might easily happen that a machine (re)started on a patched host is migrated to unpatched host. Right, but that should be temporary, I think -- after updating some of the hosts, the LPAR(s) can be migrated back to one of them, where the fallback flush is not required anymore. I think I'm inclined to leave it the way it is, unless you feel strongly about it Michal? I think it would be more user friendly to either support the fallback method 100% or remove it and require patched firmware. I beg to disagree. Since the matter is a security issue, the option of still have some sort of fix that works on unpatched firmware does look good and friendly to users (rather than require 'you _must_ get the firmware update') IMHO. cheers, Mauricio
[PATCH] powerpc/64: Fix section mismatch warnings for early boot symbols
Some of the boot code located at the start of kernel text is "init" class, in that it only runs at boot time, however marking it as normal init code is problematic because that puts it into a different section located at the very end of kernel text. e.g., in case the TOC is not set up, we may not be able to tolerate a branch trampoline to reach the init function. Credits: code and message are based on 2016 patch by Nicholas Piggin, and slightly modified so not to rename the powerpc code/symbol names. Subject: [PATCH] powerpc/64: quieten section mismatch warnings From: Nicholas Piggin Date: Fri Dec 23 00:14:19 AEDT 2016 Signed-off-by: Mauricio Faria de Oliveira --- scripts/mod/modpost.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c index 9917f92..c65d5e2 100644 --- a/scripts/mod/modpost.c +++ b/scripts/mod/modpost.c @@ -1174,8 +1174,15 @@ static const struct sectioncheck *section_mismatch( * fromsec = text section * refsymname = *.constprop.* * + * Pattern 6: + * powerpc64 has boot functions that reference init, but must remain in text. + * This pattern is identified by + * tosec = init section + * fromsym = + * **/ -static int secref_whitelist(const struct sectioncheck *mismatch, +static int secref_whitelist(const struct elf_info *elf, + const struct sectioncheck *mismatch, const char *fromsec, const char *fromsym, const char *tosec, const char *tosym) { @@ -1212,6 +1219,17 @@ static int secref_whitelist(const struct sectioncheck *mismatch, match(fromsym, optim_symbols)) return 0; + /* Check for pattern 6 */ + if (elf->hdr->e_machine == EM_PPC64) + if (match(tosec, init_sections) && + (!strncmp(fromsym, "__boot_from_prom", + strlen("__boot_from_prom")) || +!strncmp(fromsym, "start_here_multiplatform", + strlen("start_here_multiplatform")) || +!strncmp(fromsym, "start_here_common", + strlen("start_here_common"))) + return 0; + return 1; } @@ -1552,7 +1570,7 @@ static void default_mismatch_handler(const char *modname, struct elf_info *elf, tosym = sym_name(elf, to); /* check whitelist - we may ignore it */ - if (secref_whitelist(mismatch, + if (secref_whitelist(elf, mismatch, fromsec, fromsym, tosec, tosym)) { report_sec_mismatch(modname, mismatch, fromsec, r->r_offset, fromsym, -- 1.8.3.1
[PATCH] powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping()
Fix the warning messages for stop_machine_change_mapping(), and a number of other affected functions in its call chain. All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit is okay (keeps them / does not discard them). Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu. $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux ... MODPOST vmlinux.o WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping() The function stop_machine_change_mapping() references the function __meminit create_physical_mapping(). This is often because stop_machine_change_mapping lacks a __meminit annotation or the annotation of create_physical_mapping is wrong. WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping() The function stop_machine_change_mapping() references the function __meminit create_physical_mapping(). This is often because stop_machine_change_mapping lacks a __meminit annotation or the annotation of create_physical_mapping is wrong. ... Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/mm/mem.c | 4 ++-- arch/powerpc/mm/pgtable-book3s64.c | 4 ++-- arch/powerpc/mm/pgtable-radix.c| 12 ++-- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index fe8c611..85245ef 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -127,7 +127,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end) return -ENODEV; } -int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, +int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock) { unsigned long start_pfn = start >> PAGE_SHIFT; @@ -148,7 +148,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int __meminit arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c index 422e802..bd6ca74 100644 --- a/arch/powerpc/mm/pgtable-book3s64.c +++ b/arch/powerpc/mm/pgtable-book3s64.c @@ -155,7 +155,7 @@ void mmu_cleanup_all(void) } #ifdef CONFIG_MEMORY_HOTPLUG -int create_section_mapping(unsigned long start, unsigned long end) +int __meminit create_section_mapping(unsigned long start, unsigned long end) { if (radix_enabled()) return radix__create_section_mapping(start, end); @@ -163,7 +163,7 @@ int create_section_mapping(unsigned long start, unsigned long end) return hash__create_section_mapping(start, end); } -int remove_section_mapping(unsigned long start, unsigned long end) +int __meminit remove_section_mapping(unsigned long start, unsigned long end) { if (radix_enabled()) return radix__remove_section_mapping(start, end); diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c index 2e10a96..ab9db0a 100644 --- a/arch/powerpc/mm/pgtable-radix.c +++ b/arch/powerpc/mm/pgtable-radix.c @@ -695,7 +695,7 @@ struct change_mapping_params { unsigned long aligned_end; }; -static int stop_machine_change_mapping(void *data) +static int __meminit stop_machine_change_mapping(void *data) { struct change_mapping_params *params = (struct change_mapping_params *)data; @@ -742,7 +742,7 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr, /* * clear the pte and potentially split the mapping helper */ -static void split_kernel_mapping(unsigned long addr, unsigned long end, +static void __meminit split_kernel_mapping(unsigned long addr, unsigned long end, unsigned long size, pte_t *pte) { unsigned long mask = ~(size - 1); @@ -835,7 +835,7 @@ static void remove_pud_table(pud_t *pud_start, unsigned long addr, } } -static void remove_pagetable(unsigned long start, unsigned long end) +static void __meminit remove_pagetable(unsigned long start, unsigned long end) { unsigned long addr, next; pud_t *pud_base; @@ -863,12 +863,12 @@ static void remove_pagetable(unsigned long start, unsigned long end) radix__flush_tlb_kernel_range(start, end); } -int __ref radix__create_section_mapping(unsigned long start, unsigned long end) +int __meminit radix__create_section_mapping(unsigned long start, unsigned long end) { return create_physical_mapping(star
[PATCH v2] powerpc/64: Fix section mismatch warnings for early boot symbols
Some of the boot code located at the start of kernel text is "init" class, in that it only runs at boot time, however marking it as normal init code is problematic because that puts it into a different section located at the very end of kernel text. e.g., in case the TOC is not set up, we may not be able to tolerate a branch trampoline to reach the init function. Credits: code and message are based on 2016 patch by Nicholas Piggin, and slightly modified so not to rename the powerpc code/symbol names. Subject: [PATCH] powerpc/64: quieten section mismatch warnings From: Nicholas Piggin Date: Fri Dec 23 00:14:19 AEDT 2016 Signed-off-by: Mauricio Faria de Oliveira --- v2: fix missing close parenthesis in conditional (wrong patch file, sorry) scripts/mod/modpost.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c index 9917f92..c65d5e2 100644 --- a/scripts/mod/modpost.c +++ b/scripts/mod/modpost.c @@ -1174,8 +1174,15 @@ static const struct sectioncheck *section_mismatch( * fromsec = text section * refsymname = *.constprop.* * + * Pattern 6: + * powerpc64 has boot functions that reference init, but must remain in text. + * This pattern is identified by + * tosec = init section + * fromsym = + * **/ -static int secref_whitelist(const struct sectioncheck *mismatch, +static int secref_whitelist(const struct elf_info *elf, + const struct sectioncheck *mismatch, const char *fromsec, const char *fromsym, const char *tosec, const char *tosym) { @@ -1212,6 +1219,17 @@ static int secref_whitelist(const struct sectioncheck *mismatch, match(fromsym, optim_symbols)) return 0; + /* Check for pattern 6 */ + if (elf->hdr->e_machine == EM_PPC64) + if (match(tosec, init_sections) && + (!strncmp(fromsym, "__boot_from_prom", + strlen("__boot_from_prom")) || +!strncmp(fromsym, "start_here_multiplatform", + strlen("start_here_multiplatform")) || +!strncmp(fromsym, "start_here_common", + strlen("start_here_common" + return 0; + return 1; } @@ -1552,7 +1570,7 @@ static void default_mismatch_handler(const char *modname, struct elf_info *elf, tosym = sym_name(elf, to); /* check whitelist - we may ignore it */ - if (secref_whitelist(mismatch, + if (secref_whitelist(elf, mismatch, fromsec, fromsym, tosec, tosym)) { report_sec_mismatch(modname, mismatch, fromsec, r->r_offset, fromsym, -- 1.8.3.1
Re: [PATCH] powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping()
Balbir, On 03/11/2018 03:23 AM, Balbir Singh wrote: Looks reasonable, I'd recommend trying to compile with MEMORY_HOTPLUG and MEMORY_HOTREMOVE enabled/disabled as well Thanks for reviewing. I should have mentioned it :) I did that (disable CONFIG_MEMORY_HOTPLUG) and all the related functions are not included in vmlinux, since they are guarded under that config option. cheers, Mauricio
[PATCH v2 0/4] Setup RFI flush after PowerVM LPM migration
This patchset allows for setup_rfi_flush() to be called again after PowerVM LPM (live partition mobility) aka LPAR migration, in order to possibly switch to a different flush method. The patches are mostly from Michael Ellerman, I have rebased to powerpc/linux.git merge branch as of commit 249d7ba (March 12th), and added one extra commit to force init of fallback flush area on pseries. Testing: --- I've disabled most code in the LPM functions migration_store() and post_mobility_fixup() to just reach pseries_setup_rfi_flush(), and modified pseries_setup_rfi_flush() a few times to fake the first and non-first calls to either enable instructions or the fallback flush method, to cover all 4 possible scenarios. Case 1) fallback -> fallback # dmesg | grep rfi-flush [0.00] rfi-flush: Using fallback displacement flush [0.00] rfi-flush: patched 8 locations # echo > /sys/kernel/mobility/migration # dmesg | grep rfi-flush [0.00] rfi-flush: Using fallback displacement flush [0.00] rfi-flush: patched 8 locations [ 20.583029] rfi-flush: Using fallback displacement flush [ 20.583038] rfi-flush: patched 8 locations Case 2) instructions -> fallback # dmesg | grep rfi-flush [0.00] rfi-flush: Using ori type flush [0.00] rfi-flush: Using mttrig type flush [0.00] rfi-flush: patched 8 locations # echo > /sys/kernel/mobility/migration # dmesg | grep rfi-flush [0.00] rfi-flush: Using ori type flush [0.00] rfi-flush: Using mttrig type flush [0.00] rfi-flush: patched 8 locations [ 36.023759] rfi-flush: Using fallback displacement flush [ 36.023764] rfi-flush: patched 8 locations Case 3) fallback -> instructions # dmesg | grep rfi-flush [0.00] rfi-flush: Using fallback displacement flush [0.00] rfi-flush: patched 8 locations # echo > /sys/kernel/mobility/migration # dmesg | grep rfi-flush [0.00] rfi-flush: Using fallback displacement flush [0.00] rfi-flush: patched 8 locations [ 23.373037] rfi-flush: Using ori type flush [ 23.373039] rfi-flush: Using mttrig type flush [ 23.373044] rfi-flush: patched 8 locations Case 4) instructions -> instructions # dmesg | grep rfi-flush [0.00] rfi-flush: Using ori type flush [0.00] rfi-flush: Using mttrig type flush [0.00] rfi-flush: patched 8 locations # echo > /sys/kernel/mobility/migration # dmesg | grep rfi-flush [0.00] rfi-flush: Using ori type flush [0.00] rfi-flush: Using mttrig type flush [0.00] rfi-flush: patched 8 locations [ 23.243564] rfi-flush: Using ori type flush [ 23.243566] rfi-flush: Using mttrig type flush [ 23.243570] rfi-flush: patched 8 locations Mauricio Faria de Oliveira (1): rfi-flush: Allow pseries to force init of fallback flush area Michael Ellerman (3): rfi-flush: Move the logic to avoid a redo into the debugfs code rfi-flush: Make it possible to call setup_rfi_flush() again rfi-flush: Call setup_rfi_flush() after LPM migration arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c| 22 +++--- arch/powerpc/platforms/powernv/setup.c| 3 ++- arch/powerpc/platforms/pseries/mobility.c | 3 +++ arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c| 6 +++--- 6 files changed, 26 insertions(+), 12 deletions(-) -- 2.7.4
[PATCH v2 1/4] rfi-flush: Move the logic to avoid a redo into the debugfs code
From: Michael Ellerman rfi_flush_enable() includes a check to see if we're already enabled (or disabled), and in that case does nothing. But that means calling setup_rfi_flush() a 2nd time doesn't actually work, which is a bit confusing. Move that check into the debugfs code, where it really belongs. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/setup_64.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c388cc3..3efc01a 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -846,9 +846,6 @@ static void do_nothing(void *unused) void rfi_flush_enable(bool enable) { - if (rfi_flush == enable) - return; - if (enable) { do_rfi_flush_fixups(enabled_flush_types); on_each_cpu(do_nothing, NULL, 1); @@ -902,13 +899,19 @@ void __init setup_rfi_flush(enum l1d_flush_type types, bool enable) #ifdef CONFIG_DEBUG_FS static int rfi_flush_set(void *data, u64 val) { + bool enable; + if (val == 1) - rfi_flush_enable(true); + enable = true; else if (val == 0) - rfi_flush_enable(false); + enable = false; else return -EINVAL; + /* Only do anything if we're changing state */ + if (enable != rfi_flush) + rfi_flush_enable(enable); + return 0; } -- 2.7.4
[PATCH v2 2/4] rfi-flush: Make it possible to call setup_rfi_flush() again
From: Michael Ellerman For PowerVM migration we want to be able to call setup_rfi_flush() again after we've migrated the partition. To support that we need to check that we're not trying to allocate the fallback flush area after memblock has gone away (i.e., boot-time only). Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c | 6 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h index 469b7fd..bbcdf929 100644 --- a/arch/powerpc/include/asm/setup.h +++ b/arch/powerpc/include/asm/setup.h @@ -49,7 +49,7 @@ enum l1d_flush_type { L1D_FLUSH_MTTRIG= 0x8, }; -void __init setup_rfi_flush(enum l1d_flush_type, bool enable); +void setup_rfi_flush(enum l1d_flush_type, bool enable); void do_rfi_flush_fixups(enum l1d_flush_type types); #endif /* !__ASSEMBLY__ */ diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 3efc01a..d60e2f7 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -860,6 +860,10 @@ static void init_fallback_flush(void) u64 l1d_size, limit; int cpu; + /* Only allocate the fallback flush area once (at boot time). */ + if (l1d_flush_fallback_area) + return; + l1d_size = ppc64_caches.l1d.size; limit = min(ppc64_bolted_size(), ppc64_rma_size); @@ -877,7 +881,7 @@ static void init_fallback_flush(void) } } -void __init setup_rfi_flush(enum l1d_flush_type types, bool enable) +void setup_rfi_flush(enum l1d_flush_type types, bool enable) { if (types & L1D_FLUSH_FALLBACK) { pr_info("rfi-flush: Using fallback displacement flush\n"); -- 2.7.4
[PATCH v2 3/4] rfi-flush: Allow pseries to force init of fallback flush area
This ensures the fallback flush area is always allocated at boot time on the pseries platform, so PowerVM migration to an unpatched system can rely on the fallback flush method. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c | 5 +++-- arch/powerpc/platforms/powernv/setup.c | 3 ++- arch/powerpc/platforms/pseries/setup.c | 4 ++-- 4 files changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h index bbcdf929..efdaf10 100644 --- a/arch/powerpc/include/asm/setup.h +++ b/arch/powerpc/include/asm/setup.h @@ -49,7 +49,7 @@ enum l1d_flush_type { L1D_FLUSH_MTTRIG= 0x8, }; -void setup_rfi_flush(enum l1d_flush_type, bool enable); +void setup_rfi_flush(enum l1d_flush_type, bool enable, bool force_init_fallback); void do_rfi_flush_fixups(enum l1d_flush_type types); #endif /* !__ASSEMBLY__ */ diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index d60e2f7..cb886a8 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -881,12 +881,13 @@ static void init_fallback_flush(void) } } -void setup_rfi_flush(enum l1d_flush_type types, bool enable) +void setup_rfi_flush(enum l1d_flush_type types, bool enable, bool force_init_fallback) { if (types & L1D_FLUSH_FALLBACK) { pr_info("rfi-flush: Using fallback displacement flush\n"); init_fallback_flush(); - } + } else if (force_init_fallback) + init_fallback_flush(); if (types & L1D_FLUSH_ORI) pr_info("rfi-flush: Using ori type flush\n"); diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index 092715b..a4ab8f6 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -46,6 +46,7 @@ static void pnv_setup_rfi_flush(void) struct device_node *np, *fw_features; enum l1d_flush_type type; int enable; + bool force_init_fallback = false; /* Default to fallback in case fw-features are not available */ type = L1D_FLUSH_FALLBACK; @@ -88,7 +89,7 @@ static void pnv_setup_rfi_flush(void) of_node_put(fw_features); } - setup_rfi_flush(type, enable > 0); + setup_rfi_flush(type, enable > 0, force_init_fallback); } static void __init pnv_setup_arch(void) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 1a52762..2f82bbb 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -463,7 +463,7 @@ static void pseries_setup_rfi_flush(void) { struct h_cpu_char_result result; enum l1d_flush_type types; - bool enable; + bool enable, force_init_fallback = true; long rc; /* Enable by default */ @@ -490,7 +490,7 @@ static void pseries_setup_rfi_flush(void) types = L1D_FLUSH_FALLBACK; } - setup_rfi_flush(types, enable); + setup_rfi_flush(types, enable, force_init_fallback); } #ifdef CONFIG_PCI_IOV -- 2.7.4
[PATCH v2 4/4] rfi-flush: Call setup_rfi_flush() after LPM migration
From: Michael Ellerman We might have migrated to a machine that uses a different flush type, or doesn't need flushing at all. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/platforms/pseries/mobility.c | 3 +++ arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c| 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 0f7fb71..8a8033a 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -348,6 +348,9 @@ void post_mobility_fixup(void) printk(KERN_ERR "Post-mobility device tree update " "failed: %d\n", rc); + /* Possibly switch to a new RFI flush type */ + pseries_setup_rfi_flush(); + return; } diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 1ae1d9f..27cdcb6 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -100,4 +100,6 @@ static inline unsigned long cmo_get_page_size(void) int dlpar_workqueue_init(void); +void pseries_setup_rfi_flush(void); + #endif /* _PSERIES_PSERIES_H */ diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 2f82bbb..360efe7 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -459,7 +459,7 @@ static void __init find_and_init_phbs(void) of_pci_check_probe_only(); } -static void pseries_setup_rfi_flush(void) +void pseries_setup_rfi_flush(void) { struct h_cpu_char_result result; enum l1d_flush_type types; -- 2.7.4
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
Hi Michael and Michal, Got back to this; sorry for the delay. On 03/06/2018 09:55 AM, Michal Suchánek wrote: Michael Ellerman wrote: I*think* the patch below is all we need, as well as some tweaking of patch 2, are you able to test and repost? Enabling the fallback flush always looks a bit dodgy but do_rfi_flush_fixups will overwrite the jump so long any other fixup is enabled. I agree; the 'Using fallback displacement flush' message is misleading (is the system slower/fallback or not? Ô_o) The suggested patch is definitely more contained within pseries, but the informational part seemed dodgy IMHO too. So I wrote something with a new function parameter to force the init of the fallback flush area (true in pseries, false in powernv). Not that contained, but it seemed to convey the intent here in a clear way. That's v2, just sent. cheers, Mauricio
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
Hi Michael, On 03/13/2018 08:39 AM, Michael Ellerman wrote: I agree; the 'Using fallback displacement flush' message is misleading (is the system slower/fallback or not? Ô_o) That message is actually just wrong. It still prints that even if enable=false. So we should change all those messages, perhaps: pr_info("rfi-flush: fallback displacement flush available\n"); pr_info("rfi-flush: ori type flush available\n"); pr_info("rfi-flush: mttrig type flush available\n"); Indeed. So I wrote something with a new function parameter to force the init of the fallback flush area (true in pseries, false in powernv). Not that contained, but it seemed to convey the intent here in a clear way. That's v2, just sent. OK thanks. I don't really like it :D - sorry! No worries :) fair enough. Well, I didn't like it much, either, TBH. It's a lot of plumbing of that bool just to avoid the message, whereas I think we could just change the message like above. Yup. And what you think about a more descriptive confirmation of what flush instructions/methods are _actually_ being used? Currently and w/ your suggestion aobve, all that is known is what is _available_, not what has gone in (or out, in the disable case) the nop slots. cheers, mauricio
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
On 03/13/2018 02:59 PM, Michal Suchánek wrote: Maybe it would make more sense to move the messages to the function that actually patches in the instructions? That helps, but if the instructions are not patched (e.g., no_rfi_flush) then there is no information about what the system actually supports, which is useful for diagnostics/debugging (and patch verification! :-) ) cheers, mauricio
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
On 03/13/2018 03:36 PM, Michal Suchánek wrote: On Tue, 13 Mar 2018 15:13:11 -0300 Mauricio Faria de Oliveira wrote: On 03/13/2018 02:59 PM, Michal Suchánek wrote: Maybe it would make more sense to move the messages to the function that actually patches in the instructions? That helps, but if the instructions are not patched (e.g., no_rfi_flush) then there is no information about what the system actually supports, which is useful for diagnostics/debugging (and patch verification!:-) ) Can't you patch with debugfs in that case? For development purposes, yes, sure; but unfortunately sometimes only a dmesg output or other offline/postmortem data is available. And there's the user case where he is not aware/willing/allowed to use the debugfs switch. I still think the correct, informative messages are a good way to go :) cheers, mauricio
[PATCH v3 0/5] Setup RFI flush after PowerVM LPM migration
This patchset allows for setup_rfi_flush() to be called again after PowerVM LPM (live partition mobility) aka LPAR migration, in order to possibly switch to a different flush method. The patches are mostly from Michael Ellerman, I have rebased to powerpc/linux.git merge branch as of commit d8443ef (March 14). Testcase and results sent as the last email in the series. v3: add patch 4 to tell flush types are 'available' and not 'Using'. remove plumbing patch. v2: add plumbing patch between platforms and setup code to force init fallback flush. Mauricio Faria de Oliveira (1): rfi-flush: Differentiate enabled and patched flush types Michael Ellerman (4): rfi-flush: Move the logic to avoid a redo into the debugfs code rfi-flush: Make it possible to call setup_rfi_flush() again rfi-flush: Always enable fallback flush on pseries rfi-flush: Call setup_rfi_flush() after LPM migration arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c| 25 - arch/powerpc/lib/feature-fixups.c | 9 - arch/powerpc/platforms/pseries/mobility.c | 3 +++ arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c| 12 ++-- 6 files changed, 32 insertions(+), 21 deletions(-) -- 2.7.4
[PATCH v3 1/5] rfi-flush: Move the logic to avoid a redo into the debugfs code
From: Michael Ellerman rfi_flush_enable() includes a check to see if we're already enabled (or disabled), and in that case does nothing. But that means calling setup_rfi_flush() a 2nd time doesn't actually work, which is a bit confusing. Move that check into the debugfs code, where it really belongs. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/setup_64.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c388cc3..3efc01a 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -846,9 +846,6 @@ static void do_nothing(void *unused) void rfi_flush_enable(bool enable) { - if (rfi_flush == enable) - return; - if (enable) { do_rfi_flush_fixups(enabled_flush_types); on_each_cpu(do_nothing, NULL, 1); @@ -902,13 +899,19 @@ void __init setup_rfi_flush(enum l1d_flush_type types, bool enable) #ifdef CONFIG_DEBUG_FS static int rfi_flush_set(void *data, u64 val) { + bool enable; + if (val == 1) - rfi_flush_enable(true); + enable = true; else if (val == 0) - rfi_flush_enable(false); + enable = false; else return -EINVAL; + /* Only do anything if we're changing state */ + if (enable != rfi_flush) + rfi_flush_enable(enable); + return 0; } -- 2.7.4
[PATCH v3 2/5] rfi-flush: Make it possible to call setup_rfi_flush() again
From: Michael Ellerman For PowerVM migration we want to be able to call setup_rfi_flush() again after we've migrated the partition. To support that we need to check that we're not trying to allocate the fallback flush area after memblock has gone away (i.e., boot-time only). Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/setup.h | 2 +- arch/powerpc/kernel/setup_64.c | 6 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h index 469b7fd..bbcdf92 100644 --- a/arch/powerpc/include/asm/setup.h +++ b/arch/powerpc/include/asm/setup.h @@ -49,7 +49,7 @@ enum l1d_flush_type { L1D_FLUSH_MTTRIG= 0x8, }; -void __init setup_rfi_flush(enum l1d_flush_type, bool enable); +void setup_rfi_flush(enum l1d_flush_type, bool enable); void do_rfi_flush_fixups(enum l1d_flush_type types); #endif /* !__ASSEMBLY__ */ diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 3efc01a..d60e2f7 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -860,6 +860,10 @@ static void init_fallback_flush(void) u64 l1d_size, limit; int cpu; + /* Only allocate the fallback flush area once (at boot time). */ + if (l1d_flush_fallback_area) + return; + l1d_size = ppc64_caches.l1d.size; limit = min(ppc64_bolted_size(), ppc64_rma_size); @@ -877,7 +881,7 @@ static void init_fallback_flush(void) } } -void __init setup_rfi_flush(enum l1d_flush_type types, bool enable) +void setup_rfi_flush(enum l1d_flush_type types, bool enable) { if (types & L1D_FLUSH_FALLBACK) { pr_info("rfi-flush: Using fallback displacement flush\n"); -- 2.7.4
[PATCH v3 3/5] rfi-flush: Always enable fallback flush on pseries
From: Michael Ellerman This ensures the fallback flush area is always allocated on pseries, so in case a LPAR is migrated from a patched to an unpatched system, it is possible to enable the fallback flush in the target system. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/platforms/pseries/setup.c | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 4642e48..b20d107 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -468,26 +468,18 @@ static void pseries_setup_rfi_flush(void) /* Enable by default */ enable = true; + types = L1D_FLUSH_FALLBACK; rc = plpar_get_cpu_characteristics(&result); if (rc == H_SUCCESS) { - types = L1D_FLUSH_NONE; - if (result.character & H_CPU_CHAR_L1D_FLUSH_TRIG2) types |= L1D_FLUSH_MTTRIG; if (result.character & H_CPU_CHAR_L1D_FLUSH_ORI30) types |= L1D_FLUSH_ORI; - /* Use fallback if nothing set in hcall */ - if (types == L1D_FLUSH_NONE) - types = L1D_FLUSH_FALLBACK; - if ((!(result.behaviour & H_CPU_BEHAV_L1D_FLUSH_PR)) || (!(result.behaviour & H_CPU_BEHAV_FAVOUR_SECURITY))) enable = false; - } else { - /* Default to fallback if case hcall is not available */ - types = L1D_FLUSH_FALLBACK; } setup_rfi_flush(types, enable); -- 2.7.4
[PATCH v3 4/5] rfi-flush: Differentiate enabled and patched flush types
Currently the rfi-flush messages print 'Using flush' for all enabled_flush_types, but that is not necessarily true -- as now the fallback flush is always enabled on pseries, but the fixup function overwrites its nop/branch slot with other flush types, if available. So, replace the 'Using flush' messages with ' flush is available'. Also, print the patched flush types in the fixup function, so users can know what is (not) being used (e.g., the slower, fallback flush, or no flush type at all if flush is disabled via the debugfs switch). Suggested-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- P.S.: Michael, you suggested only hunk 1. please feel free to discard hunk 2 if you don't like it. arch/powerpc/kernel/setup_64.c| 6 +++--- arch/powerpc/lib/feature-fixups.c | 9 - 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index d60e2f7..4ec4a27 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -884,15 +884,15 @@ static void init_fallback_flush(void) void setup_rfi_flush(enum l1d_flush_type types, bool enable) { if (types & L1D_FLUSH_FALLBACK) { - pr_info("rfi-flush: Using fallback displacement flush\n"); + pr_info("rfi-flush: fallback displacement flush available\n"); init_fallback_flush(); } if (types & L1D_FLUSH_ORI) - pr_info("rfi-flush: Using ori type flush\n"); + pr_info("rfi-flush: ori type flush available\n"); if (types & L1D_FLUSH_MTTRIG) - pr_info("rfi-flush: Using mttrig type flush\n"); + pr_info("rfi-flush: mttrig type flush available\n"); enabled_flush_types = types; diff --git a/arch/powerpc/lib/feature-fixups.c b/arch/powerpc/lib/feature-fixups.c index 73697c4..35f80ab 100644 --- a/arch/powerpc/lib/feature-fixups.c +++ b/arch/powerpc/lib/feature-fixups.c @@ -153,7 +153,14 @@ void do_rfi_flush_fixups(enum l1d_flush_type types) patch_instruction(dest + 2, instrs[2]); } - printk(KERN_DEBUG "rfi-flush: patched %d locations\n", i); + printk(KERN_DEBUG "rfi-flush: patched %d locations (%s flush)\n", i, + (types == L1D_FLUSH_NONE) ? "no" : + (types == L1D_FLUSH_FALLBACK) ? "fallback displacement" : + (types & L1D_FLUSH_ORI)? (types & L1D_FLUSH_MTTRIG) + ? "ori+mttrig type" + : "ori type" : + (types & L1D_FLUSH_MTTRIG) ? "mttrig type" + : "unknown"); } #endif /* CONFIG_PPC_BOOK3S_64 */ -- 2.7.4
[PATCH v3 5/5] rfi-flush: Call setup_rfi_flush() after LPM migration
From: Michael Ellerman We might have migrated to a machine that uses a different flush type, or doesn't need flushing at all. Signed-off-by: Michael Ellerman Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/platforms/pseries/mobility.c | 3 +++ arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c| 2 +- 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 0f7fb71..8a8033a 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -348,6 +348,9 @@ void post_mobility_fixup(void) printk(KERN_ERR "Post-mobility device tree update " "failed: %d\n", rc); + /* Possibly switch to a new RFI flush type */ + pseries_setup_rfi_flush(); + return; } diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index c73351c..60db2ee 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -108,4 +108,6 @@ static inline unsigned long cmo_get_page_size(void) int dlpar_workqueue_init(void); +void pseries_setup_rfi_flush(void); + #endif /* _PSERIES_PSERIES_H */ diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index b20d107..f34f908 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -459,7 +459,7 @@ static void __init find_and_init_phbs(void) of_pci_check_probe_only(); } -static void pseries_setup_rfi_flush(void) +void pseries_setup_rfi_flush(void) { struct h_cpu_char_result result; enum l1d_flush_type types; -- 2.7.4
[TESTS] Create 'enabled_flush_types' boot option, and short-circuit LPM
fallback -> fallback: # dmesg -c | grep rfi-flush [0.00] rfi-flush (debug): enabled flush types: 0x0 [0.00] rfi-flush: fallback displacement flush available [0.00] rfi-flush: patched 8 locations (fallback displacement flush) # echo > /sys/kernel/mobility/migration # dmesg -c | grep rfi-flush [ 32.525443] rfi-flush: fallback displacement flush available [ 32.526269] rfi-flush: patched 8 locations (fallback displacement flush) fallback -> instructions: # dmesg -c | grep rfi-flush [0.00] rfi-flush (debug): enabled flush types: 0x0 [0.00] rfi-flush: fallback displacement flush available [0.00] rfi-flush: patched 8 locations (fallback displacement flush) # echo 24 > /sys/kernel/debug/powerpc/rfi_flush # dmesg -c | grep rfi-flush [ 30.984247] rfi-flush (debug): enabled flush types: 0xc # echo > /sys/kernel/mobility/migration # dmesg -c | grep rfi-flush [ 51.554357] rfi-flush: fallback displacement flush available [ 51.554360] rfi-flush: ori type flush available [ 51.554361] rfi-flush: mttrig type flush available [ 51.554366] rfi-flush: patched 8 locations (ori+mttrig type flush) instructions -> instructions: # dmesg -c | grep rfi-flush [0.00] rfi-flush (debug): enabled flush types: 0xc [0.00] rfi-flush: fallback displacement flush available [0.00] rfi-flush: ori type flush available [0.00] rfi-flush: mttrig type flush available [0.00] rfi-flush: patched 8 locations (ori+mttrig type flush) # echo > /sys/kernel/mobility/migration # dmesg -c | grep rfi-flush [ 55.100566] rfi-flush: fallback displacement flush available [ 55.100570] rfi-flush: ori type flush available [ 55.100571] rfi-flush: mttrig type flush available [ 55.100575] rfi-flush: patched 8 locations (ori+mttrig type flush) instructions -> fallback: # dmesg -c | grep rfi-flush [0.00] rfi-flush (debug): enabled flush types: 0xc [0.00] rfi-flush: fallback displacement flush available [0.00] rfi-flush: ori type flush available [0.00] rfi-flush: mttrig type flush available [0.00] rfi-flush: patched 8 locations (ori+mttrig type flush) # echo > /sys/kernel/debug/powerpc/rfi_flush # dmesg -c | grep rfi-flush [ 18.730782] rfi-flush (debug): enabled flush types: 0x0 # echo > /sys/kernel/mobility/migration # dmesg -c | grep rfi-flush [ 27.120071] rfi-flush: fallback displacement flush available [ 27.120078] rfi-flush: patched 8 locations (fallback displacement flush) debugfs switch: # echo 0 > /sys/kernel/debug/powerpc/rfi_flush # echo 1 > /sys/kernel/debug/powerpc/rfi_flush # dmesg -c | grep rfi-flush [ 106.031993] rfi-flush: patched 8 locations (no flush) [ 109.670966] rfi-flush: patched 8 locations (fallback displacement flush) ori type only: # echo 8 > /sys/kernel/debug/powerpc/rfi_flush # echo 0 > /sys/kernel/debug/powerpc/rfi_flush # echo 1 > /sys/kernel/debug/powerpc/rfi_flush # dmesg -c | grep rfi-flush [ 308.988958] rfi-flush (debug): enabled flush types: 0x4 [ 314.206548] rfi-flush: patched 8 locations (no flush) [ 316.349916] rfi-flush: patched 8 locations (ori type flush) mttrig type only: # echo 16 > /sys/kernel/debug/powerpc/rfi_flush # echo 0 > /sys/kernel/debug/powerpc/rfi_flush # echo 1 > /sys/kernel/debug/powerpc/rfi_flush # dmesg -c | grep rfi-flush [ 355.993189] rfi-flush (debug): enabled flush types: 0x8 [ 360.644291] rfi-flush: patched 8 locations (no flush) [ 365.300547] rfi-flush: patched 8 locations (mttrig type flush) Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/setup_64.c| 29 + arch/powerpc/platforms/pseries/mobility.c | 4 2 files changed, 33 insertions(+) diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 4ec4a27..9c9568e 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -816,6 +816,24 @@ static void *l1d_flush_fallback_area; static bool no_rfi_flush; bool rfi_flush; +static int __init handle_enabled_flush_types(char *p) +{ + int rc; + enum l1d_flush_type types; + + rc = kstrtoul(p, 0, (long unsigned int *)&types); + if (!rc) { + enabled_flush_types = types; + pr_info("rfi-flush (debug): enabled flush types: 0x%x\n", enabled_flush_types); + } else { + enabled_flush_types = L1D_FLUSH_NONE; + pr_info("rfi-flush (debug): enabled f
Re: [PATCH v3 1/5] rfi-flush: Move the logic to avoid a redo into the debugfs code
Hi Murilo and Michal, On 03/16/2018 05:52 AM, Michal Suchánek wrote: Do we need to take into account if no_rfi_flush == true? I think it makes sense you are able to override that using debugfs. It's interface used for diagnostics and testing. If this was in sysfs it would be a different story. Yes, I agree. The debugfs is way to override the cmdline option. Thanks for looking carefully at this :) cheers, Mauricio
Re: [PATCH 2/3] rfi-flush: Make it possible to call setup_rfi_flush() again
Michael, On 03/16/2018 11:18 AM, Michael Ellerman wrote: I still think the correct, informative messages are a good way to go:) Yeah I agree. We probably want to do both, print what's available at boot, and print what's actually patched when the patching happens. Nice. Not sure you had a chance to review yet, but 'PATCH v3 4/5' does exactly that :- ) I think its implementation of the latter part looks a bit strange, but I couldn't figure an elegant way to fit that in (either that one or string array indexed by flush-type possible values or a long if-else chain). I'd be happy with suggestions if it's preferred in some other way. cheers, Mauricio
[PATCH] powerpc/pseries: Fix to clear security feature flags
The H_CPU_BEHAV_* flags should be checked for in the 'behaviour' field of 'struct h_cpu_char_result' -- 'character' is for H_CPU_CHAR_* flags. Found it by playing around with QEMU's implementation of the hypercall: Example: H_CPU_CHAR=0xf000 H_CPU_BEHAV=0x This clears H_CPU_BEHAV_FAVOUR_SECURITY and H_CPU_BEHAV_L1D_FLUSH_PR so pseries_setup_rfi_flush() disables 'rfi_flush'; and it also clears H_CPU_CHAR_L1D_THREAD_PRIV flag. So there is no RFI flush mitigation at all for cpu_show_meltdown() to report; but currently it does: Original kernel: # cat /sys/devices/system/cpu/vulnerabilities/meltdown Mitigation: RFI Flush Patched kernel: # cat /sys/devices/system/cpu/vulnerabilities/meltdown Not affected Example: H_CPU_CHAR=0x H_CPU_BEHAV=0xf000 This sets H_CPU_BEHAV_BNDS_CHK_SPEC_BAR so cpu_show_spectre_v1() should report vulnerable; but currently it doesn't: Original kernel: # cat /sys/devices/system/cpu/vulnerabilities/spectre_v1 Not affected Patched kernel: # cat /sys/devices/system/cpu/vulnerabilities/spectre_v1 Vulnerable Fixes: f636c14790ea ("powerpc/pseries: Set or clear security feature flags") Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/platforms/pseries/setup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 1f12235..b11564f 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -484,13 +484,13 @@ static void init_cpu_char_feature_flags(struct h_cpu_char_result *result) * The features below are enabled by default, so we instead look to see * if firmware has *disabled* them, and clear them if so. */ - if (!(result->character & H_CPU_BEHAV_FAVOUR_SECURITY)) + if (!(result->behaviour & H_CPU_BEHAV_FAVOUR_SECURITY)) security_ftr_clear(SEC_FTR_FAVOUR_SECURITY); - if (!(result->character & H_CPU_BEHAV_L1D_FLUSH_PR)) + if (!(result->behaviour & H_CPU_BEHAV_L1D_FLUSH_PR)) security_ftr_clear(SEC_FTR_L1D_FLUSH_PR); - if (!(result->character & H_CPU_BEHAV_BNDS_CHK_SPEC_BAR)) + if (!(result->behaviour & H_CPU_BEHAV_BNDS_CHK_SPEC_BAR)) security_ftr_clear(SEC_FTR_BNDS_CHK_SPEC_BAR); } -- 1.8.3.1
Re: [PATCH v2 03/10] powerpc/pseries: Set or clear security feature flags
Hi Michael, On 03/27/2018 09:01 AM, Michael Ellerman wrote: + if (!(result->character & H_CPU_BEHAV_FAVOUR_SECURITY)) + security_ftr_clear(SEC_FTR_FAVOUR_SECURITY); + + if (!(result->character & H_CPU_BEHAV_L1D_FLUSH_PR)) + security_ftr_clear(SEC_FTR_L1D_FLUSH_PR); + + if (!(result->character & H_CPU_BEHAV_BNDS_CHK_SPEC_BAR)) + security_ftr_clear(SEC_FTR_BNDS_CHK_SPEC_BAR); Oops, I missed this.. The H_CPU_BEHAV flags should be checked for in 'result->behaviour'. Just sent '[PATCH] powerpc/pseries: Fix to clear security feature flags' cheers, Mauricio
[PATCH 1/2] powerpc: Move default security feature flags
This moves the definition of the default security feature flags (i.e., enabled by default) closer to the security feature flags. This can be used to restore current flags to the default flags. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/security_features.h | 8 arch/powerpc/kernel/security.c | 7 +-- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/security_features.h b/arch/powerpc/include/asm/security_features.h index 400a905..fa4d2e1 100644 --- a/arch/powerpc/include/asm/security_features.h +++ b/arch/powerpc/include/asm/security_features.h @@ -63,4 +63,12 @@ static inline bool security_ftr_enabled(unsigned long feature) // Firmware configuration indicates user favours security over performance #define SEC_FTR_FAVOUR_SECURITY0x0200ull + +// Features enabled by default +#define SEC_FTR_DEFAULT \ + (SEC_FTR_L1D_FLUSH_HV | \ +SEC_FTR_L1D_FLUSH_PR | \ +SEC_FTR_BNDS_CHK_SPEC_BAR | \ +SEC_FTR_FAVOUR_SECURITY) + #endif /* _ASM_POWERPC_SECURITY_FEATURES_H */ diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c index 2cee3dc..bab5a27 100644 --- a/arch/powerpc/kernel/security.c +++ b/arch/powerpc/kernel/security.c @@ -11,12 +11,7 @@ #include -unsigned long powerpc_security_features __read_mostly = \ - SEC_FTR_L1D_FLUSH_HV | \ - SEC_FTR_L1D_FLUSH_PR | \ - SEC_FTR_BNDS_CHK_SPEC_BAR | \ - SEC_FTR_FAVOUR_SECURITY; - +unsigned long powerpc_security_features __read_mostly = SEC_FTR_DEFAULT; ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, char *buf) { -- 1.8.3.1
[PATCH 2/2] powerpc/pseries: Restore default security feature flags on setup
After migration the security feature flags might have changed (e.g., destination system with unpatched firmware), but some flags are not set/clear again in init_cpu_char_feature_flags() because it assumes the security flags to be the defaults. Additionally, if the H_GET_CPU_CHARACTERISTICS hypercall fails then init_cpu_char_feature_flags() does not run again, which potentially might leave the system in an insecure or sub-optimal configuration. So, just restore the security feature flags to the defaults assumed by init_cpu_char_feature_flags() so it can set/clear them correctly, and to ensure safe settings are in place in case the hypercall fail. Fixes: f636c14790ea ("powerpc/pseries: Set or clear security feature flags") Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/platforms/pseries/setup.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index b11564f..2581fc8 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -462,6 +462,10 @@ static void __init find_and_init_phbs(void) static void init_cpu_char_feature_flags(struct h_cpu_char_result *result) { + /* +* The features below are disabled by default, so we instead look to see +* if firmware has *enabled* them, and set them if so. +*/ if (result->character & H_CPU_CHAR_SPEC_BAR_ORI31) security_ftr_set(SEC_FTR_SPEC_BAR_ORI31); @@ -501,6 +505,13 @@ void pseries_setup_rfi_flush(void) bool enable; long rc; + /* +* Set features to the defaults assumed by init_cpu_char_feature_flags() +* so it can set/clear again any features that might have changed after +* migration, and in case the hypercall fails and it is not even called. +*/ + powerpc_security_features = SEC_FTR_DEFAULT; + rc = plpar_get_cpu_characteristics(&result); if (rc == H_SUCCESS) init_cpu_char_feature_flags(&result); -- 1.8.3.1
[PATCH v3] powerpc/64: Fix section mismatch warnings for early boot symbols
Some of the boot code located at the start of kernel text is "init" class, in that it only runs at boot time, however marking it as normal init code is problematic because that puts it into a different section located at the very end of kernel text. e.g., in case the TOC is not set up, we may not be able to tolerate a branch trampoline to reach the init function. Credits: code and message are based on 2016 patch by Nicholas Piggin, and slightly modified so not to rename the powerpc code/symbol names. Subject: [PATCH] powerpc/64: quieten section mismatch warnings From: Nicholas Piggin Date: Fri Dec 23 00:14:19 AEDT 2016 This resolves the following section mismatch warnings: WARNING: vmlinux.o(.text+0x2fa8): Section mismatch in reference from the variable __boot_from_prom to the function .init.text:prom_init() The function __boot_from_prom() references the function __init prom_init(). This is often because __boot_from_prom lacks a __init annotation or the annotation of prom_init is wrong. WARNING: vmlinux.o(.text+0x3238): Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:early_setup() The function start_here_multiplatform() references the function __init early_setup(). This is often because start_here_multiplatform lacks a __init annotation or the annotation of early_setup is wrong. WARNING: vmlinux.o(.text+0x326c): Section mismatch in reference from the variable start_here_common to the function .init.text:start_kernel() The function start_here_common() references the function __init start_kernel(). This is often because start_here_common lacks a __init annotation or the annotation of start_kernel is wrong. Signed-off-by: Mauricio Faria de Oliveira --- v3: reword some comments and include errors in commit message v2: fix build error due to missing parenthesis scripts/mod/modpost.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c index 4ff08a0..d10c9d8 100644 --- a/scripts/mod/modpost.c +++ b/scripts/mod/modpost.c @@ -1173,8 +1173,15 @@ static const struct sectioncheck *section_mismatch( * fromsec = text section * refsymname = *.constprop.* * + * Pattern 6: + * powerpc64 has boot functions that reference init, but must remain in text. + * This pattern is identified by + * tosec = init section + * fromsym = __boot_from_prom, start_here_common, start_here_multiplatform + * **/ -static int secref_whitelist(const struct sectioncheck *mismatch, +static int secref_whitelist(const struct elf_info *elf, + const struct sectioncheck *mismatch, const char *fromsec, const char *fromsym, const char *tosec, const char *tosym) { @@ -1211,6 +1218,17 @@ static int secref_whitelist(const struct sectioncheck *mismatch, match(fromsym, optim_symbols)) return 0; + /* Check for pattern 6 */ + if (elf->hdr->e_machine == EM_PPC64) + if (match(tosec, init_sections) && + (!strncmp(fromsym, "__boot_from_prom", + strlen("__boot_from_prom")) || +!strncmp(fromsym, "start_here_common", + strlen("start_here_common")) || +!strncmp(fromsym, "start_here_multiplatform", + strlen("start_here_multiplatform" + return 0; + return 1; } @@ -1551,7 +1569,7 @@ static void default_mismatch_handler(const char *modname, struct elf_info *elf, tosym = sym_name(elf, to); /* check whitelist - we may ignore it */ - if (secref_whitelist(mismatch, + if (secref_whitelist(elf, mismatch, fromsec, fromsym, tosec, tosym)) { report_sec_mismatch(modname, mismatch, fromsec, r->r_offset, fromsym, -- 1.8.3.1
Re: [PATCH v3] powerpc/64: Fix section mismatch warnings for early boot symbols
On 04/09/2018 11:51 PM, Michael Ellerman wrote: Thanks for picking this one up. I hate to be a pain ... but before we merge this and proliferate these names, I'd like to change the names of some of these early asm functions. They're terribly named due to historical reasons. Indeed :) No worries. I haven't actually thought of good names yet though:) I'll try and come up with some and post a patch doing the renames. Alright. Could you please copy me on that, and I can post an update. cheers, Mauricio
Re: powerpc: add kernel parameter iommu_alloc_quiet
Hi Michael, Thanks for the review. On 09/21/2016 10:51 AM, Michael Ellerman wrote: That is important/requirement for the distribution kernels - where the DMA_ATTR_NO_WARN changes to 'enum dma_attr' are not acceptable because it breaks the kernel ABI. [snip] > Removing an entry from an enum would break the API (Note _P_). But I don't see > how adding one does. Agree.. I should have worded 'already-built out-of-tree modules'. If I understand it correctly, this chunk will change the value of DMA_ATTR_MAX, and thus break the ABI for out-of-tree modules that depend on it. (Not that I know of any that use it, but that's the reason that causes distro kernel builds to fail w/ this applied.) @@ -19,6 +19,7 @@ enum dma_attr { DMA_ATTR_SKIP_CPU_SYNC, DMA_ATTR_FORCE_CONTIGUOUS, DMA_ATTR_ALLOC_SINGLE_PAGES, + DMA_ATTR_NO_WARN, DMA_ATTR_MAX, }; Also kernel command line parameters are really poor in terms of usability. Folks really don't want to have to change their kernel command line. If we really need a hack for distros then I'd suggest we add a global struct driver * in the powerpc iommu code, and then when nvme loads it sets that to point at itself, and then the check becomes: Agree w/ cmdline args are poor for usability, and that this new hack is certainly smarter -- however, it does not provide any option/choice for the user of whether enable/disable it (ie driver automatically sets it). Although that isn't a problem for older downstream nvme drivers, which have a single OK-to-fail dma mapping callsite (so the message doesn't matter at all), the newer downstream drivers also have a Not-OK-to-fail callsite, which is still worth to get the 'iommu_alloc failed' message (per Ben's point of 'message useful for debugging'). For the latter, I think an upstream patch like this suggestion is not a generally acceptable approach. So, the intent is to have a single/common hack that upstream is OK w/, then apply that downstream in the multiple distros. Of course, if you are not OK w/ this patch (which is obviously fair) we'll have to try it downstream only, and there we can diverge in the implementations. Please let me know your thoughts on it. Thanks! -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: powerpc: add kernel parameter iommu_alloc_quiet
Hi Michael, On 09/21/2016 11:34 AM, Mauricio Faria de Oliveira wrote: So, the intent is to have a single/common hack that upstream is OK w/, then apply that downstream in the multiple distros. Of course, if you are not OK w/ this patch (which is obviously fair) we'll have to try it downstream only, and there we can diverge in the implementations. Please let me know your thoughts on it. Thanks! -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH 1/3] powerpc: export cpu_to_core_id()
Export cpu_to_core_id(). This will be used by the lpfc driver. Tested on next-20160601. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/smp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 55c924b..67136e7 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -593,6 +593,7 @@ out: of_node_put(np); return id; } +EXPORT_SYMBOL_GPL(cpu_to_core_id); /* Helper routines for cpu to core mapping */ int cpu_core_index_of_thread(int cpu) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: export cpu_to_core_id()
Please ignore the 'PATCH 1/3' in the subject; this is a single patch. -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: export cpu_to_core_id()
Hi Michael, On 06/02/2016 04:41 AM, Michael Ellerman wrote: On Wed, 2016-06-01 at 17:16 -0300, Mauricio Faria de Oliveira wrote: Export cpu_to_core_id(). This will be used by the lpfc driver. Can you explain why? Yup, I would have thought there'd be architecture neutral APIs you can use - and if there aren't maybe we should write them. I actually use topology_core_id() from in lpfc [1] (defined to cpu_to_core_id() by arch/powerpc/include/asm/topology.h). That is arch-neutral, used by eg /sys/devices/system/cpu/cpu*/topology, but drivers/base/topology.c is built-in (obj-y in ./Makefile), and thus didn't need the export. Thus, since the module uses topology_core_id() and this is defined to cpu_to_core_id(), it needs the export: ERROR: "cpu_to_core_id" [drivers/scsi/lpfc/lpfc.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 Thanks, [1] http://marc.info/?l=linux-scsi&m=146481382301686 -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: export cpu_to_core_id()
On 06/02/2016 06:00 AM, Mauricio Faria de Oliveira wrote: I actually use topology_core_id() from in lpfc [1] Er, .. kinda early here :) -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: export cpu_to_core_id()
On 06/02/2016 08:03 AM, Michael Ellerman wrote: Can you send me a v2 with a change log that includes all that detail. Sure; should have done it. Thanks. -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: export cpu_to_core_id()
Export cpu_to_core_id(). This will be used by the lpfc driver. This enables topology_core_id() from (defined to cpu_to_core_id() in arch/powerpc/include/asm/topology.h) to be used by (non-builtin) modules. That is arch-neutral, already used by eg, drivers/base/topology.c, but it is builtin (obj-y in Makefile) thus didn't need the export. Since the module uses topology_core_id() and this is defined to cpu_to_core_id(), it needs the export, otherwise: ERROR: "cpu_to_core_id" [drivers/scsi/lpfc/lpfc.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 Tested on next-20160601. Changelog: - v2: include details about the need for this patch with regards to the architecture-neutral topology API. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/smp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 55c924b..67136e7 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -593,6 +593,7 @@ out: of_node_put(np); return id; } +EXPORT_SYMBOL_GPL(cpu_to_core_id); /* Helper routines for cpu to core mapping */ int cpu_core_index_of_thread(int cpu) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
This prevents flooding the logs with 'iommu_alloc failed' messages while I/O is performed (normally) to very fast devices (e.g. NVMe). That error is not necessarily a problem; device drivers can retry later / reschedule the requests for which the allocation failed, and handle things gracefully for the caller stack on top of them. This helps at least with NVMe devices without "64-bit"/direct DMA window scenarios (e.g., systems with more than a few terabytes of memory, on which DDW cannot be enabled, currently), where just an 'dd' command can trigger errors. # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=512k <...> # echo $? 0 # dmesg nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c00151c9 npages 16 nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c00151c9 npages 16 nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c00151c9 npages 16 <...> ppc_iommu_map_sg: 8186 callbacks suppressed nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c000fa5c npages 16 nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c0010044 npages 16 nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c0010044 npages 16 <...> ppc_iommu_map_sg: 5707 callbacks suppressed nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c000b5f5 npages 16 nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c000b5c6 npages 16 nvme :00:06.0: iommu_alloc failed, tbl c001fa67c520 vaddr c000b4b3 npages 16 <...> Tested on next-20160609. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/iommu.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a8e3490..b585bdc 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -479,10 +479,9 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Handle failure */ if (unlikely(entry == DMA_ERROR_CODE)) { - if (printk_ratelimit()) - dev_info(dev, "iommu_alloc failed, tbl %p " -"vaddr %lx npages %lu\n", tbl, vaddr, -npages); + dev_dbg_ratelimited(dev, "iommu_alloc failed, tbl %p " +"vaddr %lx npages %lu\n", tbl, vaddr, +npages); goto failure; } @@ -776,11 +775,9 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, mask >> tbl->it_page_shift, align, attrs); if (dma_handle == DMA_ERROR_CODE) { - if (printk_ratelimit()) { - dev_info(dev, "iommu_alloc failed, tbl %p " -"vaddr %p npages %d\n", tbl, vaddr, -npages); - } + dev_dbg_ratelimited(dev, "iommu_alloc failed, tbl %p " +"vaddr %p npages %d\n", tbl, vaddr, +npages); } else dma_handle |= (uaddr & ~IOMMU_PAGE_MASK(tbl)); } -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
Hi Ben, On 06/11/2016 08:02 PM, Benjamin Herrenschmidt wrote: I'm not fan of this. This is a very useful message to diagnose why, for example, your network adapter is not working properly. A lot of drivers don't deal well with IOMMU errors. The fact that NVME trigger these is a problem that needs to be solved differently. Ok, I understand your points. Thanks for the review -- helps in setting another direction. Kind regards, -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
Hi Ben, On 06/13/2016 06:26 PM, Benjamin Herrenschmidt wrote: I've been thinking about this a bit... it might be worthwhile adding a dma_* call to query the approximate size of the IOMMU window, as a way for the device to adjust its requirements dynamically. Ok, cool; something like it was one of the options being discussed here. What do you mean by 'approximate'? Maybe the size of 'free regions' in the pools? -- not sure because iiuic the window size is static / 2 gig, so didn't get why (or of what) to provide an approximation (for). Another option would be to use a dma_attr for silencing mapping errors which NVME could use provided it does handle them gracefully ... Ah, that's new. Interesting. Thanks for suggestion! -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
On 06/13/2016 06:51 PM, Benjamin Herrenschmidt wrote: Approximate wasn't a great choice of word but what I meant is: - The size doesn't mean you can do an allocation that size (pools layout etc..) - And it might be shared with another device (though less likely these days). Ah, yup - ok, now I get it; thanks. -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: fix oops in pcibios_release_device() after pcibios_free_controller()
a2d400 pcibios_free_controller QUIRK! ptr = 0123456789abcdef rpadlpar_io: slot PHB 33 removed # kill -9 $pid pcibios_release_device() phb = c001dea2d400, ptr = 0123456789abcdef, found 0 Signed-off-by: Mauricio Faria de Oliveira Tested on 4.7.0-rc6. --- arch/powerpc/kernel/pci-common.c | 5 +++-- arch/powerpc/kernel/pci-hotplug.c | 22 -- 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 0f7a60f..00a22e7 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -41,7 +41,7 @@ #include #include -static DEFINE_SPINLOCK(hose_spinlock); +DEFINE_SPINLOCK(hose_spinlock); LIST_HEAD(hose_list); /* XXX kill that some day ... */ @@ -95,10 +95,11 @@ void pcibios_free_controller(struct pci_controller *phb) { spin_lock(&hose_spinlock); list_del(&phb->list_node); - spin_unlock(&hose_spinlock); if (phb->is_dynamic) kfree(phb); + + spin_unlock(&hose_spinlock); } EXPORT_SYMBOL_GPL(pcibios_free_controller); diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index 2d71269..49b5388 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -21,6 +21,9 @@ #include #include +extern spinlock_t hose_spinlock; +extern struct list_head hose_list; + static struct pci_bus *find_bus_among_children(struct pci_bus *bus, struct device_node *dn) { @@ -58,12 +61,27 @@ EXPORT_SYMBOL_GPL(pci_find_bus_by_node); */ void pcibios_release_device(struct pci_dev *dev) { - struct pci_controller *phb = pci_bus_to_host(dev->bus); + struct pci_controller *phb = pci_bus_to_host(dev->bus), *e; + int found = 0; eeh_remove_device(dev); - if (phb->controller_ops.release_device) + /* +* Only access phb if it's still in hose_list; otherwise +* it's been freed and may contain corrupt data and oops. +*/ + spin_lock(&hose_spinlock); + list_for_each_entry(e, &hose_list, list_node) { + if (e == phb) { + found = 1; + break; + } + } + + if (found && phb->controller_ops.release_device) phb->controller_ops.release_device(dev); + + spin_unlock(&hose_spinlock); } /** -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: fix oops in pcibios_release_device() after pcibios_free_controller()
On 07/04/2016 11:55 PM, Benjamin Herrenschmidt wrote: Have you considered instead adding a kref to the PHB and only freeing it when all devices have been freed ? Or it's too hard to tract device creation ? Yes, considered it, but felt leery of possibly leaving the PHB unfreed (or block until it is freed) in the call that is supposed to remove it: for example, if it's not freed yet (because of an userspace program that holds references, maybe misbehaving or in error)... - Should the call block? -- in the sense of what would it mean to the DLPAR tools (say, drmgr and hmc) that might see a timeout as failure, and perhaps won't retry it? Or if the administrator requested the adapter/PHB to be removed, maybe he's aware of the situation and actually wants the removal to happen anyway. - If it should not block, the phb struct would be left there at mercy of the reference holders, and perhaps complicate things if the PHB is re-added later on? (say, bring the adapter back to the LPAR) Could the tools think it's already added/still present, and then stop in error? Now, putting the complicated scenarios aside, implementing krefs would touch several other parts of code that I'm less comfortable to change at this moment, which would probably take a lot more time to do. So, for simplicity and schedule/backport/time constraints, I've shot for a less intrusive change that still resolves the problem. Do you think it's an acceptable solution as is, or needs change or to be implemented in a different way? Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/3] dma, nvme, powerpc: introduce and implement DMA_ATTR_NO_WARN
This patchset introduces dma_attr DMA_ATTR_NO_WARN (just like __GFP_NOWARN), which tells the DMA-mapping subsystem to supress allocation failure reports. On some architectures allocation failures are reported with error messages to the system logs. Although this can help to identify and debug problems, drivers which handle failures (eg, retry later) have no problems with them, and can actually flood the system logs with error messages that aren't any problem at all, depending on the implementation of the retry mechanism. So, this provides a way for drivers to avoid those error messages on calls where allocation failures are not a problem, and shouldn't bother the logs. - Patch 1/3 introduces and documents the new dma_attr. - Patch 2/3 implements it on the nvme driver (which might repeatedly trip on allocation failures due to high load, flooding system logs with error messages at least on powerpc: "iommu_alloc failed") - Patch 3/3 implements support for it on powerpc arch (where this problem was observed. It's possible to extend support for more archs if the patchset is welcome). Mauricio Faria de Oliveira (3): dma: introduce DMA_ATTR_NO_WARN nvme: implement DMA_ATTR_NO_WARN powerpc: implement DMA_ATTR_NO_WARN Documentation/DMA-attributes.txt | 17 + arch/powerpc/kernel/iommu.c | 4 ++-- drivers/nvme/host/pci.c | 10 -- include/linux/dma-attrs.h| 1 + 4 files changed, 28 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] dma: introduce DMA_ATTR_NO_WARN
Introduce the DMA_ATTR_NO_WARN attribute, and document it. Signed-off-by: Mauricio Faria de Oliveira --- Documentation/DMA-attributes.txt | 17 + include/linux/dma-attrs.h| 1 + 2 files changed, 18 insertions(+) diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt index e8cf9cf..841e771 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/DMA-attributes.txt @@ -126,3 +126,20 @@ means that we won't try quite as hard to get them. NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, though ARM64 patches will likely be posted soon. + +DMA_ATTR_NO_WARN + + +This tells the DMA-mapping subsystem to supress allocation failure reports +(similarly to __GFP_NOWARN). + +On some architectures allocation failures are reported with error messages +to the system logs. Although this can help to identify and debug problems, +drivers which handle failures (eg, retry later) have no problems with them, +and can actually flood the system logs with error messages that aren't any +problem at all, depending on the implementation of the retry mechanism. + +So, this provides a way for drivers to avoid those error messages on calls +where allocation failures are not a problem, and shouldn't bother the logs. + +NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h index f3c5aea..0577389 100644 --- a/include/linux/dma-attrs.h +++ b/include/linux/dma-attrs.h @@ -19,6 +19,7 @@ enum dma_attr { DMA_ATTR_SKIP_CPU_SYNC, DMA_ATTR_FORCE_CONTIGUOUS, DMA_ATTR_ALLOC_SINGLE_PAGES, + DMA_ATTR_NO_WARN, DMA_ATTR_MAX, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] nvme: implement DMA_ATTR_NO_WARN
Use the DMA_ATTR_NO_WARN attribute on dma_map_sg() calls of nvme driver. Signed-off-by: Mauricio Faria de Oliveira --- drivers/nvme/host/pci.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d1a8259..c3c3348 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -65,6 +66,8 @@ MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes"); static struct workqueue_struct *nvme_workq; +static DEFINE_DMA_ATTRS(nvme_dma_attrs); + struct nvme_dev; struct nvme_queue; @@ -498,7 +501,7 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req, goto out; ret = BLK_MQ_RQ_QUEUE_BUSY; - if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir)) + if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir, &nvme_dma_attrs)) goto out; if (!nvme_setup_prps(dev, req, size)) @@ -516,7 +519,7 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req, if (rq_data_dir(req)) nvme_dif_remap(req, nvme_dif_prep); - if (!dma_map_sg(dev->dev, &iod->meta_sg, 1, dma_dir)) + if (!dma_map_sg_attrs(dev->dev, &iod->meta_sg, 1, dma_dir, &nvme_dma_attrs)) goto out_unmap; } @@ -2118,6 +2121,9 @@ static int __init nvme_init(void) result = pci_register_driver(&nvme_driver); if (result) destroy_workqueue(nvme_workq); + + dma_set_attr(DMA_ATTR_NO_WARN, &nvme_dma_attrs); + return result; } -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] powerpc: implement DMA_ATTR_NO_WARN
Add support for the DMA_ATTR_NO_WARN attribute on powerpc iommu code. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a8e3490..69bb17f 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -479,7 +479,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Handle failure */ if (unlikely(entry == DMA_ERROR_CODE)) { - if (printk_ratelimit()) + if (unlikely(!dma_get_attr(DMA_ATTR_NO_WARN, attrs)) && printk_ratelimit()) dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %lx npages %lu\n", tbl, vaddr, npages); @@ -776,7 +776,7 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, mask >> tbl->it_page_shift, align, attrs); if (dma_handle == DMA_ERROR_CODE) { - if (printk_ratelimit()) { + if (unlikely(!dma_get_attr(DMA_ATTR_NO_WARN, attrs)) && printk_ratelimit()) { dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %p npages %d\n", tbl, vaddr, npages); -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/3] dma, nvme, powerpc: introduce and implement DMA_ATTR_NO_WARN
This patchset introduces dma_attr DMA_ATTR_NO_WARN (just like __GFP_NOWARN), which tells the DMA-mapping subsystem to suppress allocation failure reports. On some architectures allocation failures are reported with error messages to the system logs. Although this can help to identify and debug problems, drivers which handle failures (eg, retry later) have no problems with them, and can actually flood the system logs with error messages that aren't any problem at all, depending on the implementation of the retry mechanism. So, this provides a way for drivers to avoid those error messages on calls where allocation failures are not a problem, and shouldn't bother the logs. - Patch 1/3 introduces and documents the new dma_attr. - Patch 2/3 implements it on the nvme driver (which might repeatedly trip on allocation failures due to high load, flooding system logs with error messages at least on powerpc: "iommu_alloc failed") - Patch 3/3 implements support for it on powerpc arch (where this problem was observed. It's possible to extend support for more archs if the patchset is welcome). Changelog: v2: - address warnings from checkpatch.pl (line wrapping and typos) Mauricio Faria de Oliveira (3): dma: introduce DMA_ATTR_NO_WARN nvme: implement DMA_ATTR_NO_WARN powerpc: implement DMA_ATTR_NO_WARN Documentation/DMA-attributes.txt | 17 + arch/powerpc/kernel/iommu.c | 6 -- drivers/nvme/host/pci.c | 12 ++-- include/linux/dma-attrs.h| 1 + 4 files changed, 32 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/3] dma: introduce DMA_ATTR_NO_WARN
Introduce the DMA_ATTR_NO_WARN attribute, and document it. Signed-off-by: Mauricio Faria de Oliveira --- Changelog: v2: - address warnings from checkpatch.pl (line wrapping and typos) Documentation/DMA-attributes.txt | 17 + include/linux/dma-attrs.h| 1 + 2 files changed, 18 insertions(+) diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt index e8cf9cf..48150c6 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/DMA-attributes.txt @@ -126,3 +126,20 @@ means that we won't try quite as hard to get them. NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, though ARM64 patches will likely be posted soon. + +DMA_ATTR_NO_WARN + + +This tells the DMA-mapping subsystem to suppress allocation failure reports +(similarly to __GFP_NOWARN). + +On some architectures allocation failures are reported with error messages +to the system logs. Although this can help to identify and debug problems, +drivers which handle failures (eg, retry later) have no problems with them, +and can actually flood the system logs with error messages that aren't any +problem at all, depending on the implementation of the retry mechanism. + +So, this provides a way for drivers to avoid those error messages on calls +where allocation failures are not a problem, and shouldn't bother the logs. + +NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h index f3c5aea..0577389 100644 --- a/include/linux/dma-attrs.h +++ b/include/linux/dma-attrs.h @@ -19,6 +19,7 @@ enum dma_attr { DMA_ATTR_SKIP_CPU_SYNC, DMA_ATTR_FORCE_CONTIGUOUS, DMA_ATTR_ALLOC_SINGLE_PAGES, + DMA_ATTR_NO_WARN, DMA_ATTR_MAX, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/3] nvme: implement DMA_ATTR_NO_WARN
Use the DMA_ATTR_NO_WARN attribute on dma_map_sg() calls of nvme driver. Signed-off-by: Mauricio Faria de Oliveira Reviewed-by: Gabriel Krisman Bertazi --- Changelog: v2: - address warnings from checkpatch.pl (line wrapping and typos) drivers/nvme/host/pci.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d1a8259..a7ccad8 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -65,6 +66,8 @@ MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes"); static struct workqueue_struct *nvme_workq; +static DEFINE_DMA_ATTRS(nvme_dma_attrs); + struct nvme_dev; struct nvme_queue; @@ -498,7 +501,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req, goto out; ret = BLK_MQ_RQ_QUEUE_BUSY; - if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir)) + if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir, + &nvme_dma_attrs)) goto out; if (!nvme_setup_prps(dev, req, size)) @@ -516,7 +520,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req, if (rq_data_dir(req)) nvme_dif_remap(req, nvme_dif_prep); - if (!dma_map_sg(dev->dev, &iod->meta_sg, 1, dma_dir)) + if (!dma_map_sg_attrs(dev->dev, &iod->meta_sg, 1, dma_dir, + &nvme_dma_attrs)) goto out_unmap; } @@ -2118,6 +2123,9 @@ static int __init nvme_init(void) result = pci_register_driver(&nvme_driver); if (result) destroy_workqueue(nvme_workq); + + dma_set_attr(DMA_ATTR_NO_WARN, &nvme_dma_attrs); + return result; } -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 3/3] powerpc: implement DMA_ATTR_NO_WARN
Add support for the DMA_ATTR_NO_WARN attribute on powerpc iommu code. Signed-off-by: Mauricio Faria de Oliveira --- Changelog: v2: - address warnings from checkpatch.pl (line wrapping and typos) arch/powerpc/kernel/iommu.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a8e3490..f1e20ea 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -479,7 +479,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Handle failure */ if (unlikely(entry == DMA_ERROR_CODE)) { - if (printk_ratelimit()) + if (unlikely(!dma_get_attr(DMA_ATTR_NO_WARN, attrs)) && + printk_ratelimit()) dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %lx npages %lu\n", tbl, vaddr, npages); @@ -776,7 +777,8 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, mask >> tbl->it_page_shift, align, attrs); if (dma_handle == DMA_ERROR_CODE) { - if (printk_ratelimit()) { + if (unlikely(!dma_get_attr(DMA_ATTR_NO_WARN, attrs)) && + printk_ratelimit()) { dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %p npages %d\n", tbl, vaddr, npages); -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/3] nvme: implement DMA_ATTR_NO_WARN
On 07/06/2016 09:41 PM, Gabriel Krisman Bertazi wrote: checkpatch.pl complains about line wrapping. Other than that, this looks good to me. I'll submit a v2 w/ that and typos fixed. Thanks for reviewing. -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/3] nvme: implement DMA_ATTR_NO_WARN
On 07/08/2016 04:54 AM, Masayoshi Mizuma wrote: Here, I think the error messages should not be suppressed because the return value of nvme_map_data() is BLK_MQ_RQ_QUEUE_ERROR, so the IO returns as -EIO. Agree; good point. fixed in v3. Thanks for reviewing. -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/3] dma, nvme, powerpc: introduce and implement DMA_ATTR_NO_WARN
This patchset introduces dma_attr DMA_ATTR_NO_WARN (just like __GFP_NOWARN), which tells the DMA-mapping subsystem to suppress allocation failure reports. On some architectures allocation failures are reported with error messages to the system logs. Although this can help to identify and debug problems, drivers which handle failures (eg, retry later) have no problems with them, and can actually flood the system logs with error messages that aren't any problem at all, depending on the implementation of the retry mechanism. So, this provides a way for drivers to avoid those error messages on calls where allocation failures are not a problem, and shouldn't bother the logs. - Patch 1/3 introduces and documents the new dma_attr. - Patch 2/3 implements it on the nvme driver (which might repeatedly trip on allocation failures due to high load, flooding system logs with error messages at least on powerpc: "iommu_alloc failed") - Patch 3/3 implements support for it on powerpc arch (where this problem was observed. It's possible to extend support for more archs if the patchset is welcome). Changelog: v3: - nvme: use DMA_ATTR_NO_WARN when ret = BLK_MQ_RQ_QUEUE_BUSY (io will be requeued) but not when ret = BLK_MQ_RQ_QUEUE_ERROR (io will be failed). thanks: Masayoshi Mizuma v2: - all: address warnings from checkpatch.pl (line wrapping and typos) Mauricio Faria de Oliveira (3): dma: introduce DMA_ATTR_NO_WARN nvme: implement DMA_ATTR_NO_WARN powerpc: implement DMA_ATTR_NO_WARN Documentation/DMA-attributes.txt | 17 + arch/powerpc/kernel/iommu.c | 6 -- drivers/nvme/host/pci.c | 9 - include/linux/dma-attrs.h| 1 + 4 files changed, 30 insertions(+), 3 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/3] dma: introduce DMA_ATTR_NO_WARN
Introduce the DMA_ATTR_NO_WARN attribute, and document it. Signed-off-by: Mauricio Faria de Oliveira --- Changelog: v3: - dma: none. v2: - all: address warnings from checkpatch.pl (line wrapping and typos) Documentation/DMA-attributes.txt | 17 + include/linux/dma-attrs.h| 1 + 2 files changed, 18 insertions(+) diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt index e8cf9cf..48150c6 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/DMA-attributes.txt @@ -126,3 +126,20 @@ means that we won't try quite as hard to get them. NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, though ARM64 patches will likely be posted soon. + +DMA_ATTR_NO_WARN + + +This tells the DMA-mapping subsystem to suppress allocation failure reports +(similarly to __GFP_NOWARN). + +On some architectures allocation failures are reported with error messages +to the system logs. Although this can help to identify and debug problems, +drivers which handle failures (eg, retry later) have no problems with them, +and can actually flood the system logs with error messages that aren't any +problem at all, depending on the implementation of the retry mechanism. + +So, this provides a way for drivers to avoid those error messages on calls +where allocation failures are not a problem, and shouldn't bother the logs. + +NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h index f3c5aea..0577389 100644 --- a/include/linux/dma-attrs.h +++ b/include/linux/dma-attrs.h @@ -19,6 +19,7 @@ enum dma_attr { DMA_ATTR_SKIP_CPU_SYNC, DMA_ATTR_FORCE_CONTIGUOUS, DMA_ATTR_ALLOC_SINGLE_PAGES, + DMA_ATTR_NO_WARN, DMA_ATTR_MAX, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 2/3] nvme: implement DMA_ATTR_NO_WARN
Use the DMA_ATTR_NO_WARN attribute on dma_map_sg() calls of nvme driver. Signed-off-by: Mauricio Faria de Oliveira Reviewed-by: Gabriel Krisman Bertazi --- Changelog: v3: - nvme: use DMA_ATTR_NO_WARN when ret = BLK_MQ_RQ_QUEUE_BUSY (io will be requeued) but not when ret = BLK_MQ_RQ_QUEUE_ERROR (io will be failed). thanks: Masayoshi Mizuma v2: - all: address warnings from checkpatch.pl (line wrapping and typos) drivers/nvme/host/pci.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d1a8259..187aa6b 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -65,6 +66,8 @@ MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes"); static struct workqueue_struct *nvme_workq; +static DEFINE_DMA_ATTRS(nvme_dma_attrs); + struct nvme_dev; struct nvme_queue; @@ -498,7 +501,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req, goto out; ret = BLK_MQ_RQ_QUEUE_BUSY; - if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir)) + if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir, + &nvme_dma_attrs)) goto out; if (!nvme_setup_prps(dev, req, size)) @@ -2118,6 +2122,9 @@ static int __init nvme_init(void) result = pci_register_driver(&nvme_driver); if (result) destroy_workqueue(nvme_workq); + + dma_set_attr(DMA_ATTR_NO_WARN, &nvme_dma_attrs); + return result; } -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 3/3] powerpc: implement DMA_ATTR_NO_WARN
Add support for the DMA_ATTR_NO_WARN attribute on powerpc iommu code. Signed-off-by: Mauricio Faria de Oliveira --- Changelog: v3: - powerpc: none v2: - all: address warnings from checkpatch.pl (line wrapping and typos) arch/powerpc/kernel/iommu.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a8e3490..f1e20ea 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -479,7 +479,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Handle failure */ if (unlikely(entry == DMA_ERROR_CODE)) { - if (printk_ratelimit()) + if (unlikely(!dma_get_attr(DMA_ATTR_NO_WARN, attrs)) && + printk_ratelimit()) dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %lx npages %lu\n", tbl, vaddr, npages); @@ -776,7 +777,8 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, mask >> tbl->it_page_shift, align, attrs); if (dma_handle == DMA_ERROR_CODE) { - if (printk_ratelimit()) { + if (unlikely(!dma_get_attr(DMA_ATTR_NO_WARN, attrs)) && + printk_ratelimit()) { dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %p npages %d\n", tbl, vaddr, npages); -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: fix oops in pcibios_release_device() after pcibios_free_controller()
Ben, On 07/04/2016 11:55 PM, Benjamin Herrenschmidt wrote: Have you considered instead adding a kref to the PHB and only freeing it when all devices have been freed ? Or it's too hard to tract device creation ? Can you clarify which are the devices that should be tracked w/ krefs to the PHB? I've been wondering if it's just the root bus (phb->bus) -- which relays on it (ie its phb->bus->children and phb->bus->devices) being eventually freed in order to free the phb, or perhaps track the children & devices directly. If that's too far from sensible, can you point some interesting places to look at? I've read much of arch/powerpc/kernel/pci-{common,hotplug}.c and arch/powerpc/include/asm/pci-bridge.h, and some more in drivers/pci, but things weren't as obvious to a newcomer in this area. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: fix oops in pcibios_release_device() after pcibios_free_controller()
On 07/12/2016 08:07 PM, Mauricio Faria de Oliveira wrote: Can you clarify which are the devices that should be tracked w/ krefs to the PHB? Last night I had forgotten about the fundamental point of krefs - track references to pointers - and this answers the question. I'm looking at the holders of pointers to the phb struct, and it seems I wasn't too far off w/ the child buses and devices idea -- as the root bus (phb->bus) is assigned the phb pointer in the bus->sysdata field, and it's inherited from parent by child buses and devices. I'll continue here. Comments always welcome. -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] dma-mapping: introduce the DMA_ATTR_NO_WARN attribute
Introduce the DMA_ATTR_NO_WARN attribute, and document it. Signed-off-by: Mauricio Faria de Oliveira --- Documentation/DMA-attributes.txt | 17 + include/linux/dma-mapping.h | 5 + 2 files changed, 22 insertions(+) diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt index 2d455a5..98bf7ac 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/DMA-attributes.txt @@ -126,3 +126,20 @@ means that we won't try quite as hard to get them. NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, though ARM64 patches will likely be posted soon. + +DMA_ATTR_NO_WARN + + +This tells the DMA-mapping subsystem to suppress allocation failure reports +(similarly to __GFP_NOWARN). + +On some architectures allocation failures are reported with error messages +to the system logs. Although this can help to identify and debug problems, +drivers which handle failures (eg, retry later) have no problems with them, +and can actually flood the system logs with error messages that aren't any +problem at all, depending on the implementation of the retry mechanism. + +So, this provides a way for drivers to avoid those error messages on calls +where allocation failures are not a problem, and shouldn't bother the logs. + +NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 66533e1..6efbd27 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -56,6 +56,11 @@ * that gives better TLB efficiency. */ #define DMA_ATTR_ALLOC_SINGLE_PAGES(1UL << 7) +/* + * DMA_ATTR_NO_WARN: This tells the DMA-mapping subsystem to suppress + * allocation failure reports (similarly to __GFP_NOWARN). + */ +#define DMA_ATTR_NO_WARN (1UL << 8) /* * A dma_addr_t can hold any valid DMA or bus address for the platform. -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4 0/3] dma-mapping, powerpc, nvme: introduce the DMA_ATTR_NO_WARN attribute
This patchset introduces dma_attr DMA_ATTR_NO_WARN (just like __GFP_NOWARN), which tells the DMA-mapping subsystem to suppress allocation failure reports. On some architectures allocation failures are reported with error messages to the system logs. Although this can help to identify and debug problems, drivers which handle failures (eg, retry later) have no problems with them, and can actually flood the system logs with error messages that aren't any problem at all, depending on the implementation of the retry mechanism. So, this provides a way for drivers to avoid those error messages on calls where allocation failures are not a problem, and shouldn't bother the logs. - Patch 1/3 introduces the dma_attr DMA_ATTR_NO_WARN. - Patch 2/3 implements support for it on powerpc arch (where this problem was observed; it's possible to extend support for more archs) - Patch 3/3 implements it on the nvme driver (which might repeatedly trip on allocation failures due to high load, flooding system logs with error messages at least on powerpc: "iommu_alloc failed") Changelog: v4: - rebase for commit 53a4b60 dma-mapping: use unsigned long for dma_attrs. - reorder patches 2/3 and 3/3. v3: - nvme: use DMA_ATTR_NO_WARN when ret = BLK_MQ_RQ_QUEUE_BUSY (io will be requeued) but not when ret = BLK_MQ_RQ_QUEUE_ERROR (io will be failed). thanks: Masayoshi Mizuma v2: - all: address warnings from checkpatch.pl (line wrapping and typos) Tested on next-20160801. Mauricio Faria de Oliveira (3): dma-mapping: introduce the DMA_ATTR_NO_WARN attribute powerpc: implement the DMA_ATTR_NO_WARN attribute nvme: use the DMA_ATTR_NO_WARN attribute Documentation/DMA-attributes.txt | 17 + arch/powerpc/kernel/iommu.c | 6 -- drivers/nvme/host/pci.c | 3 ++- include/linux/dma-mapping.h | 5 + 4 files changed, 28 insertions(+), 3 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] nvme: use the DMA_ATTR_NO_WARN attribute
Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR). Signed-off-by: Mauricio Faria de Oliveira Reviewed-by: Gabriel Krisman Bertazi --- drivers/nvme/host/pci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d7c33f9..eeaeae0 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -503,7 +503,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req, goto out; ret = BLK_MQ_RQ_QUEUE_BUSY; - if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir)) + if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir, + DMA_ATTR_NO_WARN)) goto out; if (!nvme_setup_prps(dev, req, size)) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] powerpc: implement the DMA_ATTR_NO_WARN attribute
Add support for the DMA_ATTR_NO_WARN attribute on powerpc iommu code. Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/iommu.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 37d6e74..5f202a5 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -479,7 +479,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Handle failure */ if (unlikely(entry == DMA_ERROR_CODE)) { - if (printk_ratelimit()) + if (!(attrs & DMA_ATTR_NO_WARN) && + printk_ratelimit()) dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %lx npages %lu\n", tbl, vaddr, npages); @@ -776,7 +777,8 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, mask >> tbl->it_page_shift, align, attrs); if (dma_handle == DMA_ERROR_CODE) { - if (printk_ratelimit()) { + if (!(attrs & DMA_ATTR_NO_WARN) && + printk_ratelimit()) { dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %p npages %d\n", tbl, vaddr, npages); -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
Hi Ben, On 06/13/2016 06:26 PM, Benjamin Herrenschmidt wrote: Another option would be to use a dma_attr for silencing mapping errors which NVME could use provided it does handle them gracefully ... I recently submitted patches that implement your suggestion [1]. May you please review/comment if they're OK with you? Thanks! [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/146850.html -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
On 08/03/2016 06:34 PM, Benjamin Herrenschmidt wrote: I think this is best done by the relevant community maintainer, I just threw an idea but I'm not that familiar with the details:-) Ok, sure; got it. Did you send them to the lkml list ? Yup, plus a few others lists from get_maintainer.pl iirc. Mailing list archive links: - linux-kernel: http://marc.info/?l=linux-kernel&m=146798084822100&w=2 - linux-doc: http://marc.info/?l=linux-doc&m=146798085522104&w=2 - linux-nvme: http://lists.infradead.org/pipermail/linux-nvme/2016-July/005349.html - linuxppc-dev: https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-July/145624.html Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/2] powerpc: fix oops in pcibios_release_device() after pcibios_free_controller()
This patchset addresses the problem/suggestion discussed previously [1], by adding a kref reference counting to the PHB (struct pci_controller): > It's possible to hit an oops/crash if pcibios_release_device() accesses the > phb struct and it had been freed earlier -- by pcibios_free_controller() -- > as the memory it pointed to can be reused. > If after reuse 'phb->controller_ops.release_device' is non-NULL it will be > called, but it points to an invalid location (that function pointer is not > set anywhere in the code, so if it's non-NULL, that's not correct), and so > it hits an oops and the system crashes. > That problem can happen with the pSeries platform's DLPAR remove operation > if references to devices are held until after the pcibios_free_controller() > function runs, and then released - exercising pcibios_release_device() path. More problem details/call trace are described in the original submission [1]. With the patch applied (tested on 4.8-rc1), the test-case demonstrates that the PHB is only released/freed after the last reference to the PCI device(s) is dropped: Enable debugging messages: # echo 'file pci-common.c +pf; file pci-hotplug.c +pf' > /sys/kernel/debug/dynamic_debug/control # echo 8 > /proc/sys/kernel/printk Hold references to both PCI devices in the PHB: # ls -ld /sys/block/sd* | grep -m1 0021:01:00.0 <...> /sys/block/sdaa -> ../devices/pci0021:01/0021:01:00.0/<...> # ls -ld /sys/block/sd* | grep -m1 0021:01:00.1 <...> /sys/block/sdab -> ../devices/pci0021:01/0021:01:00.1/<...> # cat > /dev/sdaa & pid1=$! # cat > /dev/sdab & pid2=$! Perform DLPAR remove of the PHB: # drmgr -w 5 -d 1 -c phb -s 'PHB 33' -r Validating PHB DLPAR capability...yes. [ 888.776964] pci_hp_remove_devices: PCI: Removing devices on bus 0021:01 [ 888.776983] pci_hp_remove_devices:Removing 0021:01:00.0... ... [ 893.696431] pci_hp_remove_devices:Removing 0021:01:00.1... ... [ 908.352717] pci_bus 0021:01: busn_res: [bus 01-ff] is released [ 908.352744] pcibios_remove_bus: PCI 0021:01, pci_bus c001e7d59400, phb c001e7d57400 [ 908.352753] controller_put: PCI domain 33, phb c001e7d57400 [ 908.352811] pcibios_free_controller: PCI domain 33, phb c001e7d57400, phb->is_dynamic 1 [ 908.352820] controller_put: PCI domain 33, phb c001e7d57400 [ 908.352832] rpadlpar_io: slot PHB 33 removed Notice the PHB was not freed yet (controller_free() was not called) Drop the last references to the PCI devices: # kill -9 $pid1 [ 991.221998] pcibios_release_device: PCI 0021:01:00.0, pci_dev c001ee0b7000, phb c001e7d57400 [ 991.222005] controller_put: PCI domain 33, phb c001e7d57400 # kill -9 $pid2 [ 996.076293] pcibios_release_device: PCI 0021:01:00.1, pci_dev c001ee0b3800, phb c001e7d57400 [ 996.076299] controller_put: PCI domain 33, phb c001e7d57400 [ 996.076303] controller_free: PCI domain: 33, phb c001e7d57400, phb->is_dynamic 1 Notice that only now the PHB was freed. Note: this patchset currently covers references from struct pci_dev/pci_bus, which _is_ enough to resolve this particular problem; it does not yet cover references from struct pci_dn/eeh_pe/eeh_dev (but since those are unchanged by/unrelated to this patchset, they remain working in the very same manner). I have gone to great lengths in time studying the relevant code for EEH in order to implement those too, but am not yet sure of all the details (e.g., lifetime of eeh_dev, removal of pci_dn, etc) that need to be considered to kfree() them - will likely ask Gavin & maintainers for RFC after some time. Links: [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-July/145264.html Changelog: v2: change approach to use krefs (suggestion by benh & mpe). Mauricio Faria de Oliveira (2): powerpc: add refcount to struct pci_controller powerpc: update pci_controller.refcount for PCI devices and buses arch/powerpc/include/asm/pci-bridge.h | 15 arch/powerpc/kernel/pci-common.c | 72 +-- arch/powerpc/kernel/pci-hotplug.c | 29 ++ arch/powerpc/kernel/pci_of_scan.c | 1 + 4 files changed, 114 insertions(+), 3 deletions(-) -- 1.8.3.1
[PATCH v2 1/2] powerpc: add refcount to struct pci_controller
This commit introduces the 'refcount' field in struct pci_controller, along with the corresponding functions 'controller_(get|put|free)()'. The functions 'pcibios_(alloc|free)_controller()' are modified to use that in a backwards compatible manner. (i.e., kfree(phb) is performed when pcibios_free_controller() is called.) So, this patch adds the infrastructure with no functional differences to current users of pcibios_(alloc|free)_controller(). Notably, only the pseries platform calls pcibios_free_controller() for some purpose other than to release the pci_controller in case of errors just after the call to pcibios_alloc_controller() (i.e., 'goto error' scenarios). Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/pci-bridge.h | 15 +++ arch/powerpc/kernel/pci-common.c | 47 --- 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index b5e88e4..6fde0a9 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -10,6 +10,7 @@ #include #include #include +#include struct device_node; @@ -128,9 +129,23 @@ struct pci_controller { struct pci_dn *pci_data; #endif /* CONFIG_PPC64 */ + /* +* Reference counting for the structures: +* - TODO pci_dev +* - TODO pci_bus +* - TODO pci_dn +* - TODO eeh_pe +* - TODO eeh_dev +*/ + struct kref refcount; + void *private_data; }; +void controller_get(struct pci_controller *phb); +void controller_put(struct pci_controller *phb); +void controller_free(struct kref *kref); + /* These are used for config access before all the PCI probing has been done. */ extern int early_read_config_byte(struct pci_controller *hose, int bus, diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 08afddf..29b37d3 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -114,6 +114,8 @@ struct pci_controller *pcibios_alloc_controller(struct device_node *dev) phb = zalloc_maybe_bootmem(sizeof(struct pci_controller), GFP_KERNEL); if (phb == NULL) return NULL; + + kref_init(&phb->refcount); /* use first reference for hose_list entry */ spin_lock(&hose_spinlock); phb->global_number = get_phb_number(dev); list_add_tail(&phb->list_node, &hose_list); @@ -130,12 +132,53 @@ struct pci_controller *pcibios_alloc_controller(struct device_node *dev) PHB_SET_NODE(phb, nid); } #endif + + pr_debug("PCI domain %d, phb %p, phb->is_dynamic %d\n", + phb->global_number, phb, phb->is_dynamic); + return phb; } EXPORT_SYMBOL_GPL(pcibios_alloc_controller); +void controller_get(struct pci_controller *phb) +{ + if (unlikely(!phb)) { + pr_warn("%s: null PHB; refcount bug!\n", __func__); + WARN_ON(1); + } else { + pr_debug("PCI domain %d, phb %p\n", phb->global_number, phb); + kref_get(&phb->refcount); + } +} + +void controller_put(struct pci_controller *phb) +{ + if (unlikely(!phb)) { + pr_warn("%s: null PHB; refcount bug!\n", __func__); + WARN_ON(1); + } else { + pr_debug("PCI domain %d, phb %p\n", phb->global_number, phb); + kref_put(&phb->refcount, controller_free); + } +} + +void controller_free(struct kref *kref) +{ + struct pci_controller *phb = container_of(kref, struct pci_controller, + refcount); + + pr_info("%s: PCI domain: %d, phb %p, phb->is_dynamic %d\n", + __func__, phb->global_number, phb, phb->is_dynamic); + + if (phb->is_dynamic) + kfree(phb); +} + void pcibios_free_controller(struct pci_controller *phb) { + pr_debug("PCI domain %d, phb %p, phb->is_dynamic %d\n", + phb->global_number, phb, phb->is_dynamic); + spin_lock(&hose_spinlock); /* Clear bit of phb_bitmap to allow reuse of this PHB number. */ @@ -143,10 +186,8 @@ void pcibios_free_controller(struct pci_controller *phb) clear_bit(phb->global_number, phb_bitmap); list_del(&phb->list_node); + controller_put(phb); spin_unlock(&hose_spinlock); - - if (phb->is_dynamic) - kfree(phb); } EXPORT_SYMBOL_GPL(pcibios_free_controller); -- 1.8.3.1
[PATCH v2 2/2] powerpc: update pci_controller.refcount for PCI devices and buses
This patch employs the refcount in struct pci_controller to track the references from PCI devices and buses (struct pci_dev/pci_bus). In order to do that without modifying any PCI scan/probe approach (e.g., PCI_PROBE_DEVTREE and PCI_PROBE_NORMAL), it leverages some of the PCI arch-specific callback: pci_(add|release)_device() and pci_(add|remove)_bus(). (a small change is required for PCI_PROBE_DEVTREE, which makes it consistent with PCI_PROBE_NORMAL - the pci_dev should inherit the parent pci_bus's phb pointer - see pci_setup_device() in probe.c) This also has the advantage that locking for kref_(get|put)() is satisfied by the 'pci_rescan_remove_lock' mutex, which is normal practice for usage of the PCI subsystem - thus already in place. More details added in comment on pcibios_release_device(). Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/include/asm/pci-bridge.h | 4 ++-- arch/powerpc/kernel/pci-common.c | 25 + arch/powerpc/kernel/pci-hotplug.c | 29 + arch/powerpc/kernel/pci_of_scan.c | 1 + 4 files changed, 57 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 6fde0a9..d10eee3 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -131,8 +131,8 @@ struct pci_controller { /* * Reference counting for the structures: -* - TODO pci_dev -* - TODO pci_bus +* - pci_dev +* - pci_bus * - TODO pci_dn * - TODO eeh_pe * - TODO eeh_dev diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 29b37d3..c55e9c0 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1047,6 +1047,17 @@ static void pcibios_setup_device(struct pci_dev *dev) int pcibios_add_device(struct pci_dev *dev) { + struct pci_controller *phb = pci_bus_to_host(dev->bus); + + pr_debug("PCI %s, pci_dev %p, phb %p\n", dev_name(&dev->dev), dev, phb); + + if (!phb) + pr_warn("%s: PCI device %s has null PHB; refcount bug!", + __func__, dev_name(&dev->dev)); /* WARN_ON ahead */ + + /* locking: see comment on pcibios_release_device(). */ + controller_get(phb); + /* * We can only call pcibios_setup_device() after bus setup is complete, * since some of the platform specific DMA setup code depends on it. @@ -1062,6 +1073,20 @@ int pcibios_add_device(struct pci_dev *dev) return 0; } +void pcibios_add_bus(struct pci_bus *bus) +{ + struct pci_controller *phb = pci_bus_to_host(bus); + + pr_debug("PCI %s, pci_bus %p, phb %p\n", dev_name(&bus->dev), bus, phb); + + if (!phb) + pr_warn("%s: PCI bus %s has null PHB; refcount bug!", + __func__, dev_name(&bus->dev)); /* WARN_ON ahead */ + + /* locking: see comment on pcibios_release_device(). */ + controller_get(phb); +} + void pcibios_setup_bus_devices(struct pci_bus *bus) { struct pci_dev *dev; diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index 2d71269..24d1a2a 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -55,15 +55,44 @@ EXPORT_SYMBOL_GPL(pci_find_bus_by_node); * @dev: PCI device * * The function is called before releasing the indicated PCI device. + * + * The locking for kref_get() and kref_put() of the PHB/pci_controller + * in pcibios_(add|release)_device() and pcibios_(add|remove)_bus() is + * satisfied by the pci_rescan_remove_lock mutex (required for rescan/ + * remove paths of PCI devices/buses; the scan path doesn't require it, + * as there is only addition of devices/buses - no removal at all.) */ void pcibios_release_device(struct pci_dev *dev) { struct pci_controller *phb = pci_bus_to_host(dev->bus); + pr_debug("PCI %s, pci_dev %p, phb %p\n", dev_name(&dev->dev), dev, phb); + eeh_remove_device(dev); if (phb->controller_ops.release_device) phb->controller_ops.release_device(dev); + + if (unlikely(!phb)) + pr_warn("%s: PCI device %s has null PHB; refcount bug!", + __func__, dev_name(&dev->dev)); /* WARN_ON ahead */ + + /* locking: see comment on pcibios_release_device(). */ + controller_put(phb); +} + +void pcibios_remove_bus(struct pci_bus *bus) +{ + struct pci_controller *phb = pci_bus_to_host(bus); + + pr_debug("PCI %s, pci_bus %p, phb %p\n", dev_name(&bus->dev), bus, phb); + + if (unlikely(!phb)) + pr_warn("%s: PCI bus %s has null PHB; refcount bug!", + __func__, dev_name(&a
Re: [PATCH v2 1/2] powerpc: add refcount to struct pci_controller
On 08/09/2016 10:45 PM, Andrew Donnellan wrote: [snip] Notably, only the pseries platform calls pcibios_free_controller() for some purpose other than to release the pci_controller in case of errors just after the call to pcibios_alloc_controller() (i.e., 'goto error' scenarios). cxl's vPHB API also uses pcibios_free_controller() [snip] Cool. I see I missed this report line from grep; thanks. I was mostly biased at arch/powerpc/ and driver/pci/ these days. I'm currently working on a cxl defect found by an IBM test team where we run into this - will review this patch more thoroughly and test it shortly. That's great; thanks! -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v2 2/2] powerpc: update pci_controller.refcount for PCI devices and buses
On 08/10/2016 12:35 AM, Andrew Donnellan wrote: if (phb->controller_ops.release_device) phb->controller_ops.release_device(dev); + +if (unlikely(!phb)) +pr_warn("%s: PCI device %s has null PHB; refcount bug!", +__func__, dev_name(&dev->dev)); /* WARN_ON ahead */ This should probably go before trying to dereference phb->controller_ops above? You're right; I misplaced this check; will fix that. Just a bit of explanation/history: While trying to understand why I didn't hit that null dereference when I initially hit the WARN_ON (the reason for the 'small change' in the commit description), I found that back then I checked for 'pci_dev->sysdata' directly (not 'phb' -- early stages of the patch). The former indeed was null, as it didn't inherit 'pci_bus->sysdata' on pseries, but the code uses 'phb = dev->bus->sysdata' (and this was not null as pci_bus->sysdata was actually set). -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v2 1/2] powerpc: add refcount to struct pci_controller
On 08/09/2016 10:45 PM, Andrew Donnellan wrote: I'm currently working on a cxl defect found by an IBM test team where we run into this - will review this patch more thoroughly and test it shortly. Gavin provided a review/suggestions via chat, pointing to rely on the refcount that already exists in the PCI subsystem (not reinvent another) and leverage the release of the PCI root bus -- which is much simpler! He replied there should be no problems w/ the EEH reset path (PCI root bus not released) nor w/ other structs w/ refs to the PHB (PCI DN, EEH PE, EEH DEV). I'll go down that path for a PATCH v3. -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH v3] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
This patch leverages 'struct pci_host_bridge' from the PCI subsystem in order to free the pci_controller only after the last reference to its devices is dropped (avoiding an oops in pcibios_release_device() if the last reference is dropped after pcibios_free_controller()). The patch relies on pci_host_bridge.release_fn() (and .release_data), which is called automatically by the PCI subsystem when the root bus is released (i.e., the last reference is dropped). Those fields are set via pci_set_host_bridge_release() (e.g. in the platform-specific implementation of pcibios_root_bridge_prepare()). It introduces the 'pcibios_host_bridge_release()' function to be set as .release_fn(), which expects .release_data to hold the pointer to the pci_controller to kfree(). It enables that functionality for pseries (although it isn't platform -specific, and may be used by cxl). It keeps pcibios_free_controller() backwards-compatible (i.e., kfree(phb) in it) in case no .release_fn() is defined for the pci_controller. Details on not-so-elegant design choices: - Added 'pci_controller.bridge' field (pointer to associated 'struct pci_host_bridge') so *not* to use 'pci_find_host_bridge(phb->bus)' in pcibios_free_controller(). That's because remove_phb_dynamic() sets 'phb->bus = NULL' before pcibios_free_controller(). That seems to be very important, with commit title 'powerpc/pci: Fix various pseries PCI hotplug issues' (so I'll not remove it just to avoid this null pointer dereference). - Used 'pci_host_bridge.release_data' field (pointer to associated 'struct pci_controller') so *not* to 'pci_bus_to_host(bridge->bus)' in pcibios_host_bridge_release(). That's because pci_remove_root_bus() sets 'host_bridge->bus = NULL' (so, if the last reference is released after pci_remove_root_bus() runs, which eventually reaches pcibios_host_bridge_release(), that would hit a null pointer dereference). The cxl/vphb.c code calls pci_remove_root_bus(), and the cxl folks are interested in this fix. Test-case: # ls -ld /sys/block/sd* | grep -m1 0021:01:00.0 <...> /sys/block/sdaa -> ../devices/pci0021:01/0021:01:00.0/<...> # ls -ld /sys/block/sd* | grep -m1 0021:01:00.1 <...> /sys/block/sdab -> ../devices/pci0021:01/0021:01:00.1/<...> # cat >/dev/sdaa & pid1=$! # cat >/dev/sdab & pid2=$! # drmgr -w 5 -d 1 -c phb -s 'PHB 33' -r Validating PHB DLPAR capability...yes. [ 479.547020] pci_hp_remove_devices: PCI: Removing devices on bus 0021:01 [ 479.547049] pci_hp_remove_devices:Removing 0021:01:00.0... ... [ 483.536303] pci_hp_remove_devices:Removing 0021:01:00.1... ... [ 497.072130] pci_bus 0021:01: busn_res: [bus 01-ff] is released [ 497.072209] rpadlpar_io: slot PHB 33 removed # kill -9 $pid1 # kill -9 $pid2 [ 506.604458] pcibios_host_bridge_release: domain 33, dynamic 1 Suggested-By: Gavin Shan Signed-off-by: Mauricio Faria de Oliveira Changelog: - v3: different approach: struct pci_host_bridge.release_fn() - v2: different approach: struct pci_controller.refcount --- arch/powerpc/include/asm/pci-bridge.h | 2 ++ arch/powerpc/kernel/pci-common.c | 15 ++- arch/powerpc/platforms/pseries/pci.c | 3 +++ 3 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index b5e88e4..9b11631 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -54,6 +54,7 @@ struct pci_controller_ops { */ struct pci_controller { struct pci_bus *bus; + struct pci_host_bridge *bridge; /* associated 'PHB' in PCI subsystem */ char is_dynamic; #ifdef CONFIG_PPC64 int node; @@ -301,6 +302,7 @@ extern void pci_process_bridge_OF_ranges(struct pci_controller *hose, /* Allocate & free a PCI host bridge structure */ extern struct pci_controller *pcibios_alloc_controller(struct device_node *dev); extern void pcibios_free_controller(struct pci_controller *phb); +extern void pcibios_host_bridge_release(struct pci_host_bridge *bridge); #ifdef CONFIG_PCI extern int pcibios_vaddr_is_ioport(void __iomem *address); diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index a5c0153..c5b5f60 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -145,11 +145,23 @@ void pcibios_free_controller(struct pci_controller *phb) list_del(&phb->list_node); spin_unlock(&hose_spinlock); - if (phb->is_dynamic) + /* if the associated pci_host_bridge has a release_fn(), rely on that. */ + if (!phb->bridge->release_fn && phb->is_dynamic) kfree(phb); } EXP
Re: [PATCH v2 1/2] powerpc: add refcount to struct pci_controller
On 08/10/2016 10:53 AM, Mauricio Faria de Oliveira wrote: I'll go down that path for a PATCH v3. That is, 'powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)' -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v3] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
On 08/10/2016 06:45 PM, Mauricio Faria de Oliveira wrote: Changelog: - v3: different approach: struct pci_host_bridge.release_fn() - v2: different approach: struct pci_controller.refcount Oops, the v3 submission has no cover letter, so the subject changed a bit from what was in v2. This is the old subject & archive link: [PATCH v2 0/2] powerpc: fix oops in pcibios_release_device() after pcibios_free_controller() [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/147351.html -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v3] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
On 08/11/2016 02:01 AM, Andrew Donnellan wrote: In cxl, we currently call: pci_remove_root_bus(phb->bus); pcibios_free_controller(phb); which appears to break with this patch after I wire up pci_set_host_bridge_release() in cxl, as phb can be freed before we call pcibios_free_controller(). Ugh; you're right. I believe the user is expected to use either one way or another, but now I see it's not that intuitive -- a design fault. I'll address this w/ the other review/suggestion by Gavin; replying it. Missing a '---' here :) Changelog: Ok, thanks! -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v3] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
Hi Gavin, tl;dr: thanks for the comments & suggestions; i'll submit v4. On 08/11/2016 03:40 AM, Gavin Shan wrote: [added some line breaks] It seems the user has two options here: (1) Setup bridge's release_fn() and call pcibios_free_controller() explicitly; I think the v3 design was non-intuitive in that point -- it does not seem right for an user to use both options: if release_fn() is set and is called before pcibios_free_controller() (normal case w/ DLPAR/PCI hotplug/cxl, as buses/devices are supposed to be removed before the controller is released) the latter will use an invalid 'phb' pointer. (what Andrew reported) In that scenario, it's not even possible for pcibios_free_controller() to try to detect if release_fn() was already run or not, as the only information it has is the 'phb' pointer, which may be invalid. So, I believe the elegant way out of this is your suggestion to have "immediate or deferred release" and make the user *choose* either one. Obviously, let's make this explicit to the user -- w/ rename & comments. > (2) Call pcibios_free_controller() without a valid bridge's release_fn() initialized. Ok, that looks legitimate for those using immediate release (default). i.e., once an user decides to use deferred released, it's understood that pcibios_free_controller() should not be called. > I think we can provide better interface to users: what we do in pcibios_free_controller() and pcibios_host_bridge_release() should be (almost) same. pcibios_host_bridge_release() can be a wrapper of pcibios_free_controller(). Right; I implemented only kfree() in pcibios_host_bridge_release() because I was focused on when it runs *after* pcibios_free_controller(); but it turns out that if it runs *before*, phb becomes invalid pointer. So, you're right -- both functions are expected to have the same effect (slightly different code), that is all of what pcibios_free_controller() does. The only difference should be the timing. (good point on wrapper) > With this, the users have two options: (1) Rely on bridge's release_fn() to free the PCI controller; (2) Call pcibios_free_controller() as we're doing currently. Those two options corresponds to immediately or deferred releasing. Looks very good. I'll submit a v4 like this: -rename pcibios_host_bridge_release()/pcibios_free_controller_deferred() -add comments about using _either_ one or another -pcibios_free_controller_deferred() calls pcibios_free_controller(). -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH v4] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
This patch leverages 'struct pci_host_bridge' from the PCI subsystem in order to free the pci_controller only after the last reference to its devices is dropped (avoiding an oops in pcibios_release_device() if the last reference is dropped after pcibios_free_controller()). The patch relies on pci_host_bridge.release_fn() (and .release_data), which is called automatically by the PCI subsystem when the root bus is released (i.e., the last reference is dropped). Those fields are set via pci_set_host_bridge_release() (e.g. in the platform-specific implementation of pcibios_root_bridge_prepare()). It introduces the 'pcibios_free_controller_deferred()' .release_fn() and it expects .release_data to hold a pointer to the pci_controller. The function implictly calls 'pcibios_free_controller()', so an user must *NOT* explicitly call it if using the new _deferred() callback. The functionality is enabled for pseries (although it isn't platform specific, and may be used by cxl). Details on not-so-elegant design choices: - Use 'pci_host_bridge.release_data' field as pointer to associated 'struct pci_controller' so *not* to 'pci_bus_to_host(bridge->bus)' in pcibios_free_controller_deferred(). That's because pci_remove_root_bus() sets 'host_bridge->bus = NULL' (so, if the last reference is released after pci_remove_root_bus() runs, which eventually reaches pcibios_free_controller_deferred(), that would hit a null pointer dereference). The cxl/vphb.c code calls pci_remove_root_bus(), and the cxl folks are interested in this fix. Test-case #1 (hold references) # ls -ld /sys/block/sd* | grep -m1 0021:01:00.0 <...> /sys/block/sdaa -> ../devices/pci0021:01/0021:01:00.0/<...> # ls -ld /sys/block/sd* | grep -m1 0021:01:00.1 <...> /sys/block/sdab -> ../devices/pci0021:01/0021:01:00.1/<...> # cat >/dev/sdaa & pid1=$! # cat >/dev/sdab & pid2=$! # drmgr -w 5 -d 1 -c phb -s 'PHB 33' -r Validating PHB DLPAR capability...yes. [ 594.306719] pci_hp_remove_devices: PCI: Removing devices on bus 0021:01 [ 594.306738] pci_hp_remove_devices:Removing 0021:01:00.0... ... [ 598.236381] pci_hp_remove_devices:Removing 0021:01:00.1... ... [ 611.972077] pci_bus 0021:01: busn_res: [bus 01-ff] is released [ 611.972140] rpadlpar_io: slot PHB 33 removed # kill -9 $pid1 # kill -9 $pid2 [ 632.918088] pcibios_free_controller_deferred: domain 33, dynamic 1 Test-case #2 (don't hold references) # drmgr -w 5 -d 1 -c phb -s 'PHB 33' -r Validating PHB DLPAR capability...yes. [ 916.357363] pci_hp_remove_devices: PCI: Removing devices on bus 0021:01 [ 916.357386] pci_hp_remove_devices:Removing 0021:01:00.0... ... [ 920.566527] pci_hp_remove_devices:Removing 0021:01:00.1... ... [ 933.955873] pci_bus 0021:01: busn_res: [bus 01-ff] is released [ 933.955977] pcibios_free_controller_deferred: domain 33, dynamic 1 [ 933.955999] rpadlpar_io: slot PHB 33 removed Suggested-By: Gavin Shan Signed-off-by: Mauricio Faria de Oliveira --- Changelog: - v4: improve usability/design/documentation: - rename function to pcibios_free_controller_deferred() - from function call pcibios_free_controller() - no more struct pci_controller.bridge field thanks: Gavin Shan, Andrew Donnellan - v3: different approach: struct pci_host_bridge.release_fn() - v2: different approach: struct pci_controller.refcount arch/powerpc/include/asm/pci-bridge.h | 1 + arch/powerpc/kernel/pci-common.c | 36 ++ arch/powerpc/platforms/pseries/pci.c | 4 arch/powerpc/platforms/pseries/pci_dlpar.c | 7 -- 4 files changed, 46 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index b5e88e4..c0309c5 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -301,6 +301,7 @@ extern void pci_process_bridge_OF_ranges(struct pci_controller *hose, /* Allocate & free a PCI host bridge structure */ extern struct pci_controller *pcibios_alloc_controller(struct device_node *dev); extern void pcibios_free_controller(struct pci_controller *phb); +extern void pcibios_free_controller_deferred(struct pci_host_bridge *bridge); #ifdef CONFIG_PCI extern int pcibios_vaddr_is_ioport(void __iomem *address); diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index a5c0153..8c48a78 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -151,6 +151,42 @@ void pcibios_free_controller(struct pci_controller *phb) EXPORT_SYMBOL_GPL(pcibios_free_controller); /* + * This function is used to call pcibios_free_controller() + * in a deferred manner: a callback from the PCI subsystem. +
Re: [PATCH v4] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
On 08/12/2016 03:03 AM, Andrew Donnellan wrote: On 12/08/16 15:54, Gavin Shan wrote: It might be nicer for users to implement their own pcibios_free_controller_deferred(), meaning pSeries needs its own implementation for now. The reason is more user (pSeries) specific objects can be released together with the PHB. However, I'm still fine without the comment to be covered. That's probably not a bad idea, though from a cxl perspective I'm fine with using the current version. I agree, but in that case the user _still_ can use another function. This patch just provides an implementation that defaults to what was already done, in a deferred manner. If some users need something more specific, they can wire it up too :) the same way we did - and it's explained in the function's comments. Thanks for the review, suggestions and reaching a better and much more interesting patch. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v4] powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
Michael, On 08/12/2016 02:54 AM, Gavin Shan wrote: > I don't have more obvious comments except below one nitpicky: I just addressed/replied this in the other e-mail; i think it's OK, and Andrew/cxl is OK w/ it too. > Reviewed-by: Gavin Shan On 08/12/2016 03:06 AM, Andrew Donnellan wrote: Suggested-By: Gavin Shan Signed-off-by: Mauricio Faria de Oliveira Reviewed-by: Andrew Donnellan Tested-by: Andrew Donnellan # cxl Does this justify a Cc: stable? I'd think so, since this prevents an oops & crash (w/ panic_on_oops). Do you agree? Thanks! -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH] powerpc: add kernel parameter iommu_alloc_quiet
This patch introduces the 'iommu_alloc_quiet=driver_name' parameter to suppress the 'iommu_alloc failures' messages for that one driver. This is an additional approach for this 'problem' of flooding logs, not as fine-grained and not enabled by default as DMA_ATTR_NO_WARN, but it has the advantage that it doesn't introduce any ABI changes. That is important/requirement for the distribution kernels - where the DMA_ATTR_NO_WARN changes to 'enum dma_attr' are not acceptable because it breaks the kernel ABI. Tested on next-20160825 + nvme changed not to use DMA_ATTR_NO_WARN. - test-case: default / no iommu_alloc_quiet - result:messages occur # dmesg -c | grep iommu_alloc_quiet # # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c <...> [ 31.753230] nvme :00:06.0: iommu_alloc failed, tbl c003f7080c00 vaddr c0022bf3 npages 16 - test-case: iommu_alloc_quiet=(null) - result:messages occur # dmesg -c | grep iommu_alloc_quiet [0.00] Kernel command line: root=<...> ro disable_ddw iommu_alloc_quiet= [0.00] iommu_alloc_quiet: driver '' # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c <...> [ 29.032848] nvme :00:06.0: iommu_alloc failed, tbl c003f7190c00 vaddr c00238fc npages 16 - test-case: iommu_alloc_quiet=(length overflow) - result:messages occur # dmesg -c | grep iommu_alloc_quiet [0.00] Kernel command line: root=<...> ro disable_ddw iommu_alloc_quiet=0123456789abcdef0123456789abcdef [0.00] iommu_alloc_quiet: driver '0123456789abcde' # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c <...> [ 54.913279] nvme :00:06.0: iommu_alloc failed, tbl c003f7120c00 vaddr c0022d96 npages 16 - test-case: iommu_alloc_quiet=nvme - result:messages do not occur # dmesg -c | grep iommu_alloc_quiet [0.00] Kernel command line: root=<...> ro disable_ddw iommu_alloc_quiet=nvme [0.00] iommu_alloc_quiet: driver 'nvme' # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c # dd if=/dev/zero of=/dev/nvme0n1p1 bs=64k count=128k 2>/dev/null; dmesg -c Signed-off-by: Mauricio Faria de Oliveira --- arch/powerpc/kernel/iommu.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 5f202a5..8524d91 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -65,6 +65,23 @@ static int __init setup_iommu(char *str) __setup("iommu=", setup_iommu); +/* + * iommu_alloc_quiet: string with one driver name + * not to print 'iommu_alloc failed' messages for. + */ +#define IOMMU_ALLOC_QUIET_LEN 16 /* includes '\0' */ +static char iommu_alloc_quiet[IOMMU_ALLOC_QUIET_LEN]; + +static int __init setup_iommu_alloc_quiet(char *str) +{ + strncpy(iommu_alloc_quiet, str, IOMMU_ALLOC_QUIET_LEN); + iommu_alloc_quiet[IOMMU_ALLOC_QUIET_LEN - 1] = '\0'; + pr_info("iommu_alloc_quiet: driver '%s'\n", iommu_alloc_quiet); + return 1; +} + +__setup("iommu_alloc_quiet=", setup_iommu_alloc_quiet); + static DEFINE_PER_CPU(unsigned int, iommu_pool_hash); /* @@ -479,8 +496,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Handle failure */ if (unlikely(entry == DMA_ERROR_CODE)) { - if (!(attrs & DMA_ATTR_NO_WARN) && - printk_ratelimit()) + if (strncmp(iommu_alloc_quiet, dev->driver->name, IOMMU_ALLOC_QUIET_LEN) && + !(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) dev_info(dev, "iommu_alloc failed, tbl %p " "vaddr %lx npages %lu\n", tbl, vaddr, npages); @@ -777,8 +794,8 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl, mask >> tbl->it_page_shift, align, attrs); if (dma_handle == DMA_ERROR_CODE) { - if (!(attrs & DMA_ATTR_NO_WARN) && - printk_ratelimit()) { + if (strncmp(iommu_alloc_q
Re: [PATCH] powerpc: add kernel parameter iommu_alloc_quiet
Michael / Ben, On 09/01/2016 09:56 AM, Mauricio Faria de Oliveira wrote: +#define IOMMU_ALLOC_QUIET_LEN 16 /* includes '\0' */ Guilherme suggested MAX_PARAM_PREFIX_LEN for this, which looks better (a few extra bytes). Would you mind to s/16/MAX_PARAM_PREFIX_LEN/ if you like that? I can send a v2 too; just let me know. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH] powerpc: add kernel parameter iommu_alloc_quiet
On 09/01/2016 10:39 AM, Torsten Duwe wrote: JFYI, my strongly preferred solution would still be to just dev_dbg() the whole thing. Which group of people would be interested in these messages, after all? Certainly understandable. Ben didn't like the idea to convert the messages to dynamic debug [1], (with which I agree/understand nowadays) and suggested us to look for a different approach. [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-June/144196.html -- Mauricio Faria de Oliveira IBM Linux Technology Center