[RFC PATCH v1] pseries/drmem: don't cache node id in drmem_lmb struct
At memory hot-remove time we can retrieve an LMB's nid from its corresponding memory_block. There is no need to store the nid in multiple locations. Signed-off-by: Scott Cheloha --- The linear search in powerpc's memory_add_physaddr_to_nid() has become a bottleneck at boot on systems with many LMBs. As described in this patch here: https://lore.kernel.org/linuxppc-dev/20200221172901.1596249-2-chel...@linux.ibm.com/ the linear search seriously cripples drmem_init(). The obvious solution (shown in that patch) is to just make the search in memory_add_physaddr_to_nid() faster. An XArray seems well-suited to the task of mapping an address range to an LMB object. The less obvious approach is to just call memory_add_physaddr_to_nid() in fewer places. I'm not sure which approach is correct, hence the RFC. The attached patch is an example of how we could just eliminate the memory_add_physaddr_to_nid() calls during drmem_init(). Unless this is somehow an abuse of the memory_block I think retrieving the nid in this way deduplicates memoization. There is another spot where we don't *need* to search by address to find the node id. In dlpar_memory_add() we're calling lmb_set_nid(), which is just memory_add_physaddr_to_nid(). But we could easily bypass the search and call of_drconf_to_nid_single() directly with the LMB we already have a pointer to. Then again, I see callers like the xen balloon driver and probe_store() that seemingly need to do an address-to-nid lookup. So, does memory_add_physaddr_to_nid() need to be fast? Should I bite the bullet and make it fast so we can leverage it without slowdown? Or is calling it in fewer places an acceptable approach? arch/powerpc/include/asm/drmem.h | 21 --- arch/powerpc/mm/drmem.c | 6 +- .../platforms/pseries/hotplug-memory.c| 19 ++--- 3 files changed, 13 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 3d76e1c388c2..4e7b6b70e366 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -13,9 +13,6 @@ struct drmem_lmb { u32 drc_index; u32 aa_index; u32 flags; -#ifdef CONFIG_MEMORY_HOTPLUG - int nid; -#endif }; struct drmem_lmb_info { @@ -103,22 +100,4 @@ static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb) lmb->aa_index = 0x; } -#ifdef CONFIG_MEMORY_HOTPLUG -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ - lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr); -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ - lmb->nid = -1; -} -#else -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ -} -#endif - #endif /* _ASM_POWERPC_LMB_H */ diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 59327cefbc6a..873fcfc7b875 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop) if (!drmem_info->lmbs) return; - for_each_drmem_lmb(lmb) { + for_each_drmem_lmb(lmb) read_drconf_v1_cell(lmb, &prop); - lmb_set_nid(lmb); - } } static void __init init_drmem_v2_lmbs(const __be32 *prop) @@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop) lmb->aa_index = dr_cell.aa_index; lmb->flags = dr_cell.flags; - - lmb_set_nid(lmb); } } } diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index a4d40a3ceea3..f6d4236286cf 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -376,25 +376,29 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { + struct memory_block *mem_block; unsigned long block_sz; int rc; if (!lmb_is_removable(lmb)) return -EINVAL; + mem_block = lmb_to_memblock(lmb); + if (mem_block == NULL) + return -EINVAL; + rc = dlpar_offline_lmb(lmb); if (rc) return rc; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + __remove_memory(mem_block->nid, lmb->base_addr, block_sz); /* Update memory regions for memory remove */ memblock_remove(lmb->base_addr, block_sz); invalidate_lmb_associativity_index(lmb); - lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; return 0; @@ -651,7 +655,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to
Re: [RFC PATCH v1] pseries/drmem: don't cache node id in drmem_lmb struct
Hi Michal, On Thu, Mar 12, 2020 at 06:02:37AM +0100, Michal Suchánek wrote: > On Wed, Mar 11, 2020 at 06:08:15PM -0500, Scott Cheloha wrote: > > At memory hot-remove time we can retrieve an LMB's nid from its > > corresponding memory_block. There is no need to store the nid > > in multiple locations. > > > > Signed-off-by: Scott Cheloha > > --- > > The linear search in powerpc's memory_add_physaddr_to_nid() has become a > > bottleneck at boot on systems with many LMBs. > > > > As described in this patch here: > > > > https://lore.kernel.org/linuxppc-dev/20200221172901.1596249-2-chel...@linux.ibm.com/ > > > > the linear search seriously cripples drmem_init(). > > > > The obvious solution (shown in that patch) is to just make the search > > in memory_add_physaddr_to_nid() faster. An XArray seems well-suited > > to the task of mapping an address range to an LMB object. > > > > The less obvious approach is to just call memory_add_physaddr_to_nid() > > in fewer places. > > > > I'm not sure which approach is correct, hence the RFC. > > You basically revert the below which will likely cause the very error > that was fixed there: > > commit b2d3b5ee66f2a04a918cc043cec0c9ed3de58f40 > Author: Nathan Fontenot > Date: Tue Oct 2 10:35:59 2018 -0500 > > powerpc/pseries: Track LMB nid instead of using device tree > > When removing memory we need to remove the memory from the node > it was added to instead of looking up the node it should be in > in the device tree. > > During testing we have seen scenarios where the affinity for a > LMB changes due to a partition migration or PRRN event. In these > cases the node the LMB exists in may not match the node the device > tree indicates it belongs in. This can lead to a system crash > when trying to DLPAR remove the LMB after a migration or PRRN > event. The current code looks up the node in the device tree to > remove the LMB from, the crash occurs when we try to offline this > node and it does not have any data, i.e. node_data[nid] == NULL. I'm aware of this patch and that this is a near-total revert. I'm not reintroducing the original behavior, though. Instead of going to the device tree to recompute the expected node id I'm retrieving it from the LMB's corresponding memory_block. That crucial difference is this chunk: --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -376,25 +376,29 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { + struct memory_block *mem_block; unsigned long block_sz; int rc; if (!lmb_is_removable(lmb)) return -EINVAL; + mem_block = lmb_to_memblock(lmb); + if (mem_block == NULL) + return -EINVAL; + rc = dlpar_offline_lmb(lmb); if (rc) return rc; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + __remove_memory(mem_block->nid, lmb->base_addr, block_sz); /* Update memory regions for memory remove */ memblock_remove(lmb->base_addr, block_sz); invalidate_lmb_associativity_index(lmb); - lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; return 0; --- Unless it's possible that the call: __add_memory(my_nid, my_addr, my_size); does not yield the following: memory_block.nid == my_nid on success, then I think the solution in my patch is equivalent to and simpler than the one introduced in the patch you quote. Can you see a way the above would not hold? Then again, maybe there's a good reason not to retrieve the node id in this way. I'm thinking David Hildenbrand and/or Nathan Fontenont may have some insight on that.
[PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
LMB lookup is currently an O(n) linear search. This scales poorly when there are many LMBs. If we cache each LMB by both its base address and its DRC index in an xarray we can cut lookups to O(log n), greatly accelerating drmem initialization and memory hotplug. This patch introduces two xarrays of of LMBs and fills them during drmem initialization. The patch also adds two interfaces for LMB lookup. The first interface, drmem_find_lmb_by_base_addr(), is employed in hot_add_drconf_scn_to_nid() to replace a linear search. This speeds up memory_add_physaddr_to_nid(), which is called by lmb_set_nid(), an interface used during drmem initialization and memory hotplug. The second interface, drmem_find_lmb_by_drc_index(), is employed in get_lmb_range() to replace a linear search. This speeds up dlpar_memory_add_by_ic() and dlpar_memory_remove_by_ic(), interfaces used during memory hotplug. These substitutions yield significant improvements: 1. A POWER9 VM with a maximum memory of 10TB and 256MB LMBs has 40960 LMBs. With this patch it completes drmem_init() ~1138ms faster. Before: [0.542244] drmem: initializing drmem v1 [1.768787] drmem: initialized 40960 LMBs After: [0.543611] drmem: initializing drmem v1 [0.631386] drmem: initialized 40960 LMBs 2. A POWER9 VM with a maximum memory of 4TB and 256MB LMBs has 16384 LMBs. Via the qemu monitor we can hot-add memory as virtual DIMMs. Each DIMM is 256 LMBs. With this patch we hot-add every possible LMB about 60 seconds faster. Before: [ 17.422177] pseries-hotplug-mem: Attempting to hot-add 256 LMB(s) at index 8100 [...] [ 167.285563] pseries-hotplug-mem: Memory at 3fff000 (drc index 80003fff) was hot-added After: [ 14.753480] pseries-hotplug-mem: Attempting to hot-add 256 LMB(s) at index 8100 [...] [ 103.934092] pseries-hotplug-mem: Memory at 3fff000 (drc index 80003fff) was hot-added Signed-off-by: Scott Cheloha --- These linear searches become a serious bottleneck as the machine approaches 64TB. There are just too many LMBs to use a linear search. On a 60TB machine we recently saw the following soft lockup during drmem_init(): [ 60.602386] watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [swapper/0:1] [ 60.602414] Modules linked in: [ 60.602417] Supported: No, Unreleased kernel [ 60.602423] CPU: 9 PID: 1 Comm: swapper/0 Not tainted 5.3.18-2-default #1 SLE15-SP2 (unreleased) [ 60.602426] NIP: c0095c0c LR: c0095bb0 CTR: [ 60.602430] REGS: c00022c7fc497830 TRAP: 0901 Not tainted (5.3.18-2-default) [ 60.602432] MSR: 82009033 CR: 44000244 XER: [ 60.602442] CFAR: c0095c18 IRQMASK: 0 GPR00: c0095bb0 c00022c7fc497ac0 c162cc00 c0003bff5e08 GPR04: c0ea539a 002f 1000 GPR08: c7fc5f59ffb8 14bc5000 c7fc5f1f1a30 c14f2fb8 GPR12: c0001e980600 [ 60.602464] NIP [c0095c0c] hot_add_scn_to_nid+0xbc/0x400 [ 60.602467] LR [c0095bb0] hot_add_scn_to_nid+0x60/0x400 [ 60.602470] Call Trace: [ 60.602473] [c00022c7fc497ac0] [c0095bb0] hot_add_scn_to_nid+0x60/0x400 (unreliable) [ 60.602478] [c00022c7fc497b20] [c007a6a0] memory_add_physaddr_to_nid+0x20/0x60 [ 60.602483] [c00022c7fc497b40] [c10235a4] drmem_init+0x258/0x2d8 [ 60.602485] [c00022c7fc497c10] [c0010694] do_one_initcall+0x64/0x300 [ 60.602489] [c00022c7fc497ce0] [c10144f8] kernel_init_freeable+0x2e8/0x3fc [ 60.602491] [c00022c7fc497db0] [c0010b0c] kernel_init+0x2c/0x160 [ 60.602497] [c00022c7fc497e20] [c000b960] ret_from_kernel_thread+0x5c/0x7c [ 60.602498] Instruction dump: [ 60.602501] 7d0a4214 7faa4040 419d0328 e92a0010 71290088 2fa90008 409e001c e92a [ 60.602506] 7fbe4840 419c0010 7d274a14 7fbe4840 <419c00e4> 394a0018 7faa4040 409dffd0 This patch should eliminate the drmem_init() bottleneck during boot. One other important thing to note is that this only addresses part of the slowdown during memory hotplug when there are many LMBs. A far larger part of it is caused by the linear memblock search in find_memory_block(). That problem is addressed with this patch: https://lore.kernel.org/lkml/20200121231028.13699-1-chel...@linux.ibm.com/ which is in linux-next. The numbers I quote here in the commit message for time improvements during hotplug are taken with that patch applied. Without it, hotplug is even slower. arch/powerpc/include/asm/drmem.h | 3 ++ arch/powerpc/mm/drmem.c | 33 +++ arch/powerpc/mm/numa.c| 30 +++-- .../platforms/pseries/hotplug-memory.c| 11 ++- 4 files changed, 50 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/p
Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote: > Scott Cheloha writes: > > LMB lookup is currently an O(n) linear search. This scales poorly when > > there are many LMBs. > > > > If we cache each LMB by both its base address and its DRC index > > in an xarray we can cut lookups to O(log n), greatly accelerating > > drmem initialization and memory hotplug. > > > > This patch introduces two xarrays of of LMBs and fills them during > > drmem initialization. The patch also adds two interfaces for LMB > > lookup. > > Good but can you replace the array of LMBs altogether > (drmem_info->lmbs)? xarray allows iteration over the members if needed. I don't think we can without potentially changing the current behavior. The current behavior in dlpar_memory_{add,remove}_by_ic() is to advance linearly through the array from the LMB with the matching DRC index. Iteration through the xarray via xa_for_each_start() will return LMBs indexed with monotonically increasing DRC indices. Are they equivalent? Or can we have an LMB with a smaller DRC index appear at a greater offset in the array? If the following condition is possible: drmem_info->lmbs[i].drc_index > drmem_info->lmbs[j].drc_index where i < j, then we have a possible behavior change because xa_for_each_start() may not return a contiguous array slice. It might "leap backwards" in the array. Or it might skip over a chunk of LMBs.
Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
On Thu, Jan 30, 2020 at 10:09:32AM -0600, Fontenot, Nathan wrote: > On 1/29/2020 12:10 PM, Scott Cheloha wrote: > > On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote: > >> Scott Cheloha writes: > >>> LMB lookup is currently an O(n) linear search. This scales poorly when > >>> there are many LMBs. > >>> > >>> If we cache each LMB by both its base address and its DRC index > >>> in an xarray we can cut lookups to O(log n), greatly accelerating > >>> drmem initialization and memory hotplug. > >>> > >>> This patch introduces two xarrays of of LMBs and fills them during > >>> drmem initialization. The patch also adds two interfaces for LMB > >>> lookup. > >> > >> Good but can you replace the array of LMBs altogether > >> (drmem_info->lmbs)? xarray allows iteration over the members if needed. > > > > I don't think we can without potentially changing the current behavior. > > > > The current behavior in dlpar_memory_{add,remove}_by_ic() is to advance > > linearly through the array from the LMB with the matching DRC index. > > > > Iteration through the xarray via xa_for_each_start() will return LMBs > > indexed with monotonically increasing DRC indices.> > > Are they equivalent? Or can we have an LMB with a smaller DRC index > > appear at a greater offset in the array? > > > > If the following condition is possible: > > > > drmem_info->lmbs[i].drc_index > drmem_info->lmbs[j].drc_index > > > > where i < j, then we have a possible behavior change because > > xa_for_each_start() may not return a contiguous array slice. It might > > "leap backwards" in the array. Or it might skip over a chunk of LMBs. > > > > The LMB array should have each LMB in monotonically increasing DRC Index > value. Note that this is set up based on the DT property but I don't recall > ever seeing the DT specify LMBs out of order or not being contiguous. Is that ordering guaranteed by the PAPR or some other spec or is that just a convention? Code like drmem_update_dt_v1() makes me very nervous: static int drmem_update_dt_v1(struct device_node *memory, struct property *prop) { struct property *new_prop; struct of_drconf_cell_v1 *dr_cell; struct drmem_lmb *lmb; u32 *p; new_prop = clone_property(prop, prop->length); if (!new_prop) return -1; p = new_prop->value; *p++ = cpu_to_be32(drmem_info->n_lmbs); dr_cell = (struct of_drconf_cell_v1 *)p; for_each_drmem_lmb(lmb) { dr_cell->base_addr = cpu_to_be64(lmb->base_addr); dr_cell->drc_index = cpu_to_be32(lmb->drc_index); dr_cell->aa_index = cpu_to_be32(lmb->aa_index); dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb)); dr_cell++; } of_update_property(memory, new_prop); return 0; } If for whatever reason the firmware has a DRC that isn't monotonically increasing and we update a firmware property at the wrong offset I have no idea what would happen. With the array we preserve the order. Without it we might violate some assumption the firmware has made.
Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote: > Scott Cheloha writes: > > LMB lookup is currently an O(n) linear search. This scales poorly when > > there are many LMBs. > > > > If we cache each LMB by both its base address and its DRC index > > in an xarray we can cut lookups to O(log n), greatly accelerating > > drmem initialization and memory hotplug. > > > > This patch introduces two xarrays of of LMBs and fills them during > > drmem initialization. The patch also adds two interfaces for LMB > > lookup. > > Good but can you replace the array of LMBs altogether > (drmem_info->lmbs)? xarray allows iteration over the members if needed. I would like to try to "solve one problem at a time". We can fix the linear search performance scaling problems without removing the array of LMBs. As I've shown in my diff, we can do it with minimal change to the existing code. If it turns out that the PAPR guarantees the ordering of the memory DRCs then in a subsequent patch (series) we can replace the LMB array (__drmem_info.lmbs) with an xarray indexed by DRC and use e.g. xa_for_each() in the hotplug code.
[PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray
On PowerPC, memory_add_physaddr_to_nid() uses a linear search to find an LMB matching the given address. This scales very poorly when there are many LMBs. The poor scaling cripples drmem_init() during boot: lmb_set_nid(), which calls memory_add_physaddr_to_nid(), is called for each LMB. If we index each LMB in an xarray by its base address we can achieve O(log n) search during memory_add_physaddr_to_nid(), which scales much better. For example, in the lab we have a 64TB P9 machine with 256MB LMBs. So, suring drmem_init() we instantiate 249854 LMBs. On a vanilla kernel it completes drmem_init() in ~35 seconds with a soft lockup trace. On the patched kernel it completes drmem_init() in ~0.5 seconds. Before: [ 53.721639] drmem: initializing drmem v2 [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1] [ 80.604377] Modules linked in: [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4 [ 80.604397] NIP: c00a4980 LR: c00a4940 CTR: [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+) [ 80.604412] MSR: 82009033 CR: 44000248 XER: 000d [ 80.604431] CFAR: c00a4a38 IRQMASK: 0 [ 80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 c0003cfede30 [ 80.604431] GPR04: c0f4095a 002f 1000 [ 80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 c00c0002fdfb2001 [ 80.604431] GPR12: c0001e8ec200 [ 80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 [ 80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 [ 80.604492] Call Trace: [ 80.604498] [c0002dbff8493ac0] [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable) [ 80.604509] [c0002dbff8493b20] [c0087c10] memory_add_physaddr_to_nid+0x20/0x60 [ 80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 [ 80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0 [ 80.604540] [c0002dbff8493ce0] [c10c4aa0] kernel_init_freeable+0x2d8/0x3a0 [ 80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 [ 80.604560] [c0002dbff8493e20] [c000b648] ret_from_kernel_thread+0x5c/0x74 [ 80.604567] Instruction dump: [ 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 7d094214 [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 7fbe5040 [ 89.047390] drmem: 249854 LMB(s) After: [ 53.424702] drmem: initializing drmem v2 [ 53.898813] drmem: 249854 LMB(s) lmb_set_nid() is called from dlpar_lmb_add() so this patch will also improve memory hot-add speeds on big machines. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/drmem.h | 1 + arch/powerpc/mm/drmem.c | 24 arch/powerpc/mm/numa.c | 29 ++--- 3 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 3d76e1c388c2..90a5a9ad872b 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -88,6 +88,7 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb) return lmb->flags & DRMEM_LMB_RESERVED; } +struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr); u64 drmem_lmb_memory_max(void); void __init walk_drmem_lmbs(struct device_node *dn, void (*func)(struct drmem_lmb *, const __be32 **)); diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 44bfbdae920c..62cbe79e3860 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -11,12 +11,31 @@ #include #include #include +#include #include #include +static DEFINE_XARRAY(drmem_lmb_xa_base_addr); static struct drmem_lmb_info __drmem_info; struct drmem_lmb_info *drmem_info = &__drmem_info; +static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb) +{ + void *ret; + + ret = xa_store(&drmem_lmb_xa_base_addr, lmb->base_addr, lmb, + GFP_KERNEL); + if (xa_is_err(ret)) + return xa_err(ret); + + return 0; +} + +struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr) +{ + return xa_load(&drmem_lmb_xa_base_addr, base_addr); +} + u64 drmem_lmb_memory_max(void) { struct drmem_lmb *last_lmb; @@ -364,6 +383,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop) for_each_drmem_lmb(lmb) { read_drconf_v1_cell(lmb, &prop); + if (drmem_cache_lmb_for_lookup(lmb) != 0) + return; lmb_set_nid(lmb); } } @@ -411,6 +432,9 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop) lmb->aa_index = dr_cell.aa_index; lmb->flags = dr_ce
pseries: accelerate drmem and simplify hotplug with xarrays
This patch series introduces two xarrays of LMBs in an effort to speed up the drmem code and simplify the hotplug code on pseries machines. The first patch introduces an xarray of LMBs indexed by physical address. xa_load() is then used to accelerate LMB lookup during memory_add_physaddr_to_nid(). The interface is used during boot, in drmem_init(), and memory hot-add, in dlpar_add_lmb(), on pseries machines. Its linear LMB search is a serious bottleneck on larger machines that xa_load() flattens nicely as shown in the before/after example. The second patch introduces a second xarray of LMBs, this one indexed by DRC index. The xarray API is leveraged to replace the custom LMB search/iteration/marking code currently used in the pseries memory hotplug module. The result is cleaner and, again thanks to xa_load(), faster. v1: One big patch. v2 changes: - Split up the big patch from v1 into a series. - Provide a more dramatic example in patch 1/2 to emphasize the linear search bottleneck in memory_add_physaddr_to_nid(). - Expand the use of the xarray API in patch 2/2 to replace more custom code.
[PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code
The xarray API has entry marking (xa_set_mark/xa_clear_mark) and a range-based iterator (xa_for_each_range), so there is no need for the pseries hotplug code to maintain its own implementation of these features. This patch introduces an xarray of drmem_lmb structures indexed by each LMB's DRC index (drmem_lmb.drc_index). The xarray is protected by the hotplug lock. LMBs are indexed into the xarray during drmem_init() and accessed during hotplug operations. Custom LMB search, iteration, and marking code is replaced with xarray equivalents. The result is more compact. The code ought to run faster, too: several linear searches have been replaced with xa_load(), which runs in sub-linear time. The array of LMBs, drmem_info.lmbs[], is kept to preserve the ordering of LMBs read from the firmware in drmem_init() during firmware writes in drmem_update_dt(). Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/drmem.h | 26 +-- arch/powerpc/mm/drmem.c | 38 ++-- .../platforms/pseries/hotplug-memory.c| 207 ++ 3 files changed, 92 insertions(+), 179 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 90a5a9ad872b..97e07eec7eda 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -26,13 +26,8 @@ struct drmem_lmb_info { extern struct drmem_lmb_info *drmem_info; -#define for_each_drmem_lmb_in_range(lmb, start, end) \ - for ((lmb) = (start); (lmb) <= (end); (lmb)++) - -#define for_each_drmem_lmb(lmb)\ - for_each_drmem_lmb_in_range((lmb), \ - &drmem_info->lmbs[0], \ - &drmem_info->lmbs[drmem_info->n_lmbs - 1]) +struct xarray; +extern struct xarray *drmem_lmb_xa; /* * The of_drconf_cell_v1 struct defines the layout of the LMB data @@ -71,23 +66,6 @@ static inline u32 drmem_lmb_size(void) return drmem_info->lmb_size; } -#define DRMEM_LMB_RESERVED 0x8000 - -static inline void drmem_mark_lmb_reserved(struct drmem_lmb *lmb) -{ - lmb->flags |= DRMEM_LMB_RESERVED; -} - -static inline void drmem_remove_lmb_reservation(struct drmem_lmb *lmb) -{ - lmb->flags &= ~DRMEM_LMB_RESERVED; -} - -static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb) -{ - return lmb->flags & DRMEM_LMB_RESERVED; -} - struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr); u64 drmem_lmb_memory_max(void); void __init walk_drmem_lmbs(struct device_node *dn, diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 62cbe79e3860..013ab2689bd8 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -16,6 +16,8 @@ #include static DEFINE_XARRAY(drmem_lmb_xa_base_addr); +static DEFINE_XARRAY(drmem_lmb_xa_drc_index); +struct xarray *drmem_lmb_xa = &drmem_lmb_xa_drc_index; static struct drmem_lmb_info __drmem_info; struct drmem_lmb_info *drmem_info = &__drmem_info; @@ -27,6 +29,10 @@ static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb) GFP_KERNEL); if (xa_is_err(ret)) return xa_err(ret); + ret = xa_store(&drmem_lmb_xa_drc_index, lmb->drc_index, lmb, + GFP_KERNEL); + if (xa_is_err(ret)) + return xa_err(ret); return 0; } @@ -44,15 +50,6 @@ u64 drmem_lmb_memory_max(void) return last_lmb->base_addr + drmem_lmb_size(); } -static u32 drmem_lmb_flags(struct drmem_lmb *lmb) -{ - /* -* Return the value of the lmb flags field minus the reserved -* bit used internally for hotplug processing. -*/ - return lmb->flags & ~DRMEM_LMB_RESERVED; -} - static struct property *clone_property(struct property *prop, u32 prop_sz) { struct property *new_prop; @@ -84,6 +81,7 @@ static int drmem_update_dt_v1(struct device_node *memory, struct of_drconf_cell_v1 *dr_cell; struct drmem_lmb *lmb; u32 *p; + int i; new_prop = clone_property(prop, prop->length); if (!new_prop) @@ -94,11 +92,12 @@ static int drmem_update_dt_v1(struct device_node *memory, dr_cell = (struct of_drconf_cell_v1 *)p; - for_each_drmem_lmb(lmb) { + for (i = 0; i < drmem_info->n_lmbs; i++) { + lmb = &drmem_info->lmbs[i]; dr_cell->base_addr = cpu_to_be64(lmb->base_addr); dr_cell->drc_index = cpu_to_be32(lmb->drc_index); dr_cell->aa_index = cpu_to_be32(lmb->aa_index); - dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb)); + dr_cell->flags = cpu_to_be32(lmb->flags); dr_cell++; } @@ -113,7 +112,7 @@ static void init_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell, dr_cell-&g
Re: [PATCH 2/3] powerpc/pseries/memhp: Remove unbalanced dlpar_release_drc() call
> On Nov 28, 2023, at 9:21 AM, Nathan Lynch wrote: > > Nick Child writes: >> Hi Nathan, >> Patches 1 and 3 LGTM > > thanks. > >> Regarding this patch, dlpar_memory_remove_by_count() calls >> dlpar_add_lmb() and does not free drc on add error. >> dlpar_add_lmb() is called here in error recovery so probably >> not a big deal. >> >> This is all new code to me but it looks like if the requested >> number of lmbs cannot be removed then it attempts to add back >> the ones that were successfully removed. So if you cannot add >> an lmb that WAS successfully removed, it seems sane to also >> release the drc. > > Maybe I'll drop this one for now and turn my attention to removing all > the high-level rollback logic in this code. There's no reliable way to > accomplish what it's trying to do (i.e. restore the original quantity of > LMBs) and it just complicates things. Removing all of the rollback code is a wonderful idea.
[PATCH v1] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() to determine which node id (nid) to use when later calling __add_memory(). This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an appropriate nid for a given address by looking up the LMB containing the address and then passing that LMB to of_drconf_to_nid_single() to get the nid. In dlpar_add_lmb() we get this address from the LMB itself. In short, we have a pointer to an LMB and then we are searching for that LMB *again* in order to find its nid. If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we can skip the redundant lookup. The only error handling we need to duplicate from memory_add_physaddr_to_nid() is the fallback to the default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or an invalid nid. Skipping the extra lookup makes hot-add operations faster, especially on machines with many LMBs. Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 LMBs on an upatched kernel took ~3.5 hours while a patched kernel completed the same operation in ~2 hours: Unpatched (12450 seconds): Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added Patched (7065 seconds): Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added It should be noted that the speedup grows more substantial when hot-adding LMBs at the end of the drconf range. This is because we are skipping a linear LMB search. To see the distinction, consider smaller hot-add test on the same LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 LMBs completed less than 1 second faster on a patched kernel: Unpatched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,753.42 msec task-clock#0.992 CPUs utilized ( +- 0.55% ) 4,708 context-switches #0.045 K/sec ( +- 0.69% ) 2,444 cpu-migrations#0.023 K/sec ( +- 1.25% ) 394 page-faults #0.004 K/sec ( +- 0.22% ) 445,902,503,057 cycles#4.257 GHz ( +- 0.55% ) (66.67%) 8,558,376,740 stalled-cycles-frontend #1.92% frontend cycles idle ( +- 0.88% ) (49.99%) 300,346,181,651 stalled-cycles-backend# 67.36% backend cycles idle ( +- 0.76% ) (50.01%) 258,091,488,691 instructions #0.58 insn per cycle #1.16 stalled cycles per insn ( +- 0.22% ) (66.67%) 70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%) 3,100,725,426 branch-misses #4.39% of all branches ( +- 0.20% ) (49.99%) 105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) Patched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,055.69 msec task-clock#0.993 CPUs utilized ( +- 0.32% ) 4,606 context-switches #0.044 K/sec ( +- 0.20% ) 2,463 cpu-migrations#0.024 K/sec ( +- 0.93% ) 394 page-faults #0.004 K/sec ( +- 0.25% ) 442,951,129,921 cycles#4.257 GHz ( +- 0.32% ) (66.66%) 8,710,413,329 stalled-cycles-frontend #1.97% frontend cycles idle ( +- 0.47% ) (50.06%) 299,656,905,836 stalled-cycles-backend# 67.65% backend cycles idle ( +- 0.39% ) (50.02%) 252,731,168,193 instructions #0.57 insn per cycle #1.19 stalled cycles per insn ( +- 0.20% ) (66.66%) 68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%) 3,100,242,882 branch-misses #4.50% of all branches ( +- 0.15% ) (49.98%) 104.829 +- 0.325 seconds time elapsed ( +- 0.31% ) This is consistent. An add-by-count hot-add operation adds LMBs greedily, so LMBs near the start of the drconf range are considered first. On an otherwise idle LPAR with so many LMBs we would expect to find the LMBs we need near the start of the drconf range, hence the smaller speedup. Signed-o
Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
On Fri, Aug 21, 2020 at 10:33:10AM +0200, Laurent Dufour wrote: > Le 11/08/2020 à 03:51, Scott Cheloha a écrit : > > > > [...] > > > > @@ -631,7 +638,7 @@ static int dlpar_memory_remove_by_ic(u32 > > lmbs_to_remove, u32 drc_index) > > static int dlpar_add_lmb(struct drmem_lmb *lmb) > > { > > unsigned long block_sz; > > - int rc; > > + int nid, rc; > > > > if (lmb->flags & DRCONF_MEM_ASSIGNED) > > return -EINVAL; > > @@ -642,11 +649,13 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) > > return rc; > > } > > > > - lmb_set_nid(lmb); > > block_sz = memory_block_size_bytes(); > > > > + /* Find the node id for this address. */ > > + nid = memory_add_physaddr_to_nid(lmb->base_addr); > > I think we could be more efficient here. > Here is the call stack behind memory_add_physaddr_to_nid(): > > memory_add_physaddr_to_nid(lmb->base_addr) > hot_add_scn_to_nid() > if (of_find_node_by_path("/ibm,dynamic-reconfiguration-memory")) == > true* > then > hot_add_drconf_scn_to_nid() > for_each_drmem_lmb() to find the LMB based on lmb->base_addr > of_drconf_to_nid_single(found LMB) > use lmb->aa_index to get the nid. > > * that test is necessarily true when called from dlpar_add_lmb() > otherwise the call to update_lmb_associativity_index() would have > failed earlier. > > Basically, we have a LMB and we later walk all the LMBs to find that lmb > again. In the case of dlpar_add_lmb(), it would be more efficient to > directly call of_drconf_to_nid_single(). That function is not exported > from arch/powerpc/mm/numa.c but it may be good to export it through that > patch. I've posted a patch for this: https://lore.kernel.org/linuxppc-dev/20200910175637.2865160-1-chel...@linux.ibm.com/T/#u The speedup is nice, especially for LMBs at the end of the array.
[PATCH v2] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() to determine which node id (nid) to use when later calling __add_memory(). This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an appropriate nid for a given address by looking up the LMB containing the address and then passing that LMB to of_drconf_to_nid_single() to get the nid. In dlpar_add_lmb() we get this address from the LMB itself. In short, we have a pointer to an LMB and then we are searching for that LMB *again* in order to find its nid. If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we can skip the redundant lookup. The only error handling we need to duplicate from memory_add_physaddr_to_nid() is the fallback to the default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or an invalid nid. Skipping the extra lookup makes hot-add operations faster, especially on machines with many LMBs. Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 LMBs on an upatched kernel took ~3.5 hours while a patched kernel completed the same operation in ~2 hours: Unpatched (12450 seconds): Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added Patched (7065 seconds): Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added It should be noted that the speedup grows more substantial when hot-adding LMBs at the end of the drconf range. This is because we are skipping a linear LMB search. To see the distinction, consider smaller hot-add test on the same LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 LMBs completed less than 1 second faster on a patched kernel: Unpatched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,753.42 msec task-clock#0.992 CPUs utilized ( +- 0.55% ) 4,708 context-switches #0.045 K/sec ( +- 0.69% ) 2,444 cpu-migrations#0.023 K/sec ( +- 1.25% ) 394 page-faults #0.004 K/sec ( +- 0.22% ) 445,902,503,057 cycles#4.257 GHz ( +- 0.55% ) (66.67%) 8,558,376,740 stalled-cycles-frontend #1.92% frontend cycles idle ( +- 0.88% ) (49.99%) 300,346,181,651 stalled-cycles-backend# 67.36% backend cycles idle ( +- 0.76% ) (50.01%) 258,091,488,691 instructions #0.58 insn per cycle #1.16 stalled cycles per insn ( +- 0.22% ) (66.67%) 70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%) 3,100,725,426 branch-misses #4.39% of all branches ( +- 0.20% ) (49.99%) 105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) Patched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,055.69 msec task-clock#0.993 CPUs utilized ( +- 0.32% ) 4,606 context-switches #0.044 K/sec ( +- 0.20% ) 2,463 cpu-migrations#0.024 K/sec ( +- 0.93% ) 394 page-faults #0.004 K/sec ( +- 0.25% ) 442,951,129,921 cycles#4.257 GHz ( +- 0.32% ) (66.66%) 8,710,413,329 stalled-cycles-frontend #1.97% frontend cycles idle ( +- 0.47% ) (50.06%) 299,656,905,836 stalled-cycles-backend# 67.65% backend cycles idle ( +- 0.39% ) (50.02%) 252,731,168,193 instructions #0.57 insn per cycle #1.19 stalled cycles per insn ( +- 0.20% ) (66.66%) 68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%) 3,100,242,882 branch-misses #4.50% of all branches ( +- 0.15% ) (49.98%) 104.829 +- 0.325 seconds time elapsed ( +- 0.31% ) This is consistent. An add-by-count hot-add operation adds LMBs greedily, so LMBs near the start of the drconf range are considered first. On an otherwise idle LPAR with so many LMBs we would expect to find the LMBs we need near the start of the drconf range, hence the smaller speedup. Signed-o
[PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() to determine which node id (nid) to use when later calling __add_memory(). This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an appropriate nid for a given address by looking up the LMB containing the address and then passing that LMB to of_drconf_to_nid_single() to get the nid. In dlpar_add_lmb() we get this address from the LMB itself. In short, we have a pointer to an LMB and then we are searching for that LMB *again* in order to find its nid. If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we can skip the redundant lookup. The only error handling we need to duplicate from memory_add_physaddr_to_nid() is the fallback to the default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or an invalid nid. Skipping the extra lookup makes hot-add operations faster, especially on machines with many LMBs. Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 LMBs on an upatched kernel took ~3.5 hours while a patched kernel completed the same operation in ~2 hours: Unpatched (12450 seconds): Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added Patched (7065 seconds): Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added It should be noted that the speedup grows more substantial when hot-adding LMBs at the end of the drconf range. This is because we are skipping a linear LMB search. To see the distinction, consider smaller hot-add test on the same LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 LMBs completed less than 1 second faster on a patched kernel: Unpatched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,753.42 msec task-clock#0.992 CPUs utilized ( +- 0.55% ) 4,708 context-switches #0.045 K/sec ( +- 0.69% ) 2,444 cpu-migrations#0.023 K/sec ( +- 1.25% ) 394 page-faults #0.004 K/sec ( +- 0.22% ) 445,902,503,057 cycles#4.257 GHz ( +- 0.55% ) (66.67%) 8,558,376,740 stalled-cycles-frontend #1.92% frontend cycles idle ( +- 0.88% ) (49.99%) 300,346,181,651 stalled-cycles-backend# 67.36% backend cycles idle ( +- 0.76% ) (50.01%) 258,091,488,691 instructions #0.58 insn per cycle #1.16 stalled cycles per insn ( +- 0.22% ) (66.67%) 70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%) 3,100,725,426 branch-misses #4.39% of all branches ( +- 0.20% ) (49.99%) 105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) Patched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,055.69 msec task-clock#0.993 CPUs utilized ( +- 0.32% ) 4,606 context-switches #0.044 K/sec ( +- 0.20% ) 2,463 cpu-migrations#0.024 K/sec ( +- 0.93% ) 394 page-faults #0.004 K/sec ( +- 0.25% ) 442,951,129,921 cycles#4.257 GHz ( +- 0.32% ) (66.66%) 8,710,413,329 stalled-cycles-frontend #1.97% frontend cycles idle ( +- 0.47% ) (50.06%) 299,656,905,836 stalled-cycles-backend# 67.65% backend cycles idle ( +- 0.39% ) (50.02%) 252,731,168,193 instructions #0.57 insn per cycle #1.19 stalled cycles per insn ( +- 0.20% ) (66.66%) 68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%) 3,100,242,882 branch-misses #4.50% of all branches ( +- 0.15% ) (49.98%) 104.829 +- 0.325 seconds time elapsed ( +- 0.31% ) This is consistent. An add-by-count hot-add operation adds LMBs greedily, so LMBs near the start of the drconf range are considered first. On an otherwise idle LPAR with so many LMBs we would expect to find the LMBs we need near the start of the drconf range, hence the smaller speedup. Signed-o
Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
On Wed, Sep 16, 2020 at 09:39:53AM +0200, David Hildenbrand wrote: > On 15.09.20 21:46, Scott Cheloha wrote: > > During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() > > to determine which node id (nid) to use when later calling __add_memory(). > > > > This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an > > appropriate nid for a given address by looking up the LMB containing the > > address and then passing that LMB to of_drconf_to_nid_single() to get the > > nid. In dlpar_add_lmb() we get this address from the LMB itself. > > > > In short, we have a pointer to an LMB and then we are searching for > > that LMB *again* in order to find its nid. > > > > If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we > > can skip the redundant lookup. The only error handling we need to > > duplicate from memory_add_physaddr_to_nid() is the fallback to the > > default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or > > an invalid nid. > > > > Skipping the extra lookup makes hot-add operations faster, especially > > on machines with many LMBs. > > > > Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 > > LMBs on an upatched kernel took ~3.5 hours while a patched kernel > > completed the same operation in ~2 hours: > > > > Unpatched (12450 seconds): > > Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 > > Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to > > hot-add 126000 LMB(s) > > [...] > > Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 > > (drc index 8002) was hot-added > > > > Patched (7065 seconds): > > Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 > > Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to > > hot-add 126000 LMB(s) > > [...] > > Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 > > (drc index 8002) was hot-added > > > > It should be noted that the speedup grows more substantial when > > hot-adding LMBs at the end of the drconf range. This is because we > > are skipping a linear LMB search. > > > > To see the distinction, consider smaller hot-add test on the same > > LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 > > LMBs completed less than 1 second faster on a patched kernel: > > > > Unpatched: > > Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): > > > > 104,753.42 msec task-clock#0.992 CPUs utilized > > ( +- 0.55% ) > > 4,708 context-switches #0.045 K/sec > > ( +- 0.69% ) > > 2,444 cpu-migrations#0.023 K/sec > > ( +- 1.25% ) > >394 page-faults #0.004 K/sec > > ( +- 0.22% ) > >445,902,503,057 cycles#4.257 GHz > > ( +- 0.55% ) (66.67%) > > 8,558,376,740 stalled-cycles-frontend #1.92% frontend > > cycles idle ( +- 0.88% ) (49.99%) > >300,346,181,651 stalled-cycles-backend# 67.36% backend cycles > > idle ( +- 0.76% ) (50.01%) > >258,091,488,691 instructions #0.58 insn per cycle > > #1.16 stalled cycles > > per insn ( +- 0.22% ) (66.67%) > > 70,568,169,256 branches # 673.660 M/sec > > ( +- 0.17% ) (50.01%) > > 3,100,725,426 branch-misses #4.39% of all > > branches ( +- 0.20% ) (49.99%) > > > >105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) > > > > Patched: > > Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): > > > > 104,055.69 msec task-clock#0.993 CPUs utilized > > ( +- 0.32% ) > > 4,606 context-switches #0.044 K/sec > > ( +- 0.20% ) > > 2,463 cpu-migrations#0.024 K/sec > > ( +- 0.93% ) > >394 page-faults #0.004 K/sec > > ( +- 0.25% ) > >442,951,129,921 cycles#4.257 GHz > > ( +- 0.32% ) (66.66%) &g
[PATCH v4] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() to determine which node id (nid) to use when later calling __add_memory(). This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an appropriate nid for a given address by looking up the LMB containing the address and then passing that LMB to of_drconf_to_nid_single() to get the nid. In dlpar_add_lmb() we get this address from the LMB itself. In short, we have a pointer to an LMB and then we are searching for that LMB *again* in order to find its nid. If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we can skip the redundant lookup. The only error handling we need to duplicate from memory_add_physaddr_to_nid() is the fallback to the default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or an invalid nid. Skipping the extra lookup makes hot-add operations faster, especially on machines with many LMBs. Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 LMBs on an upatched kernel took ~3.5 hours while a patched kernel completed the same operation in ~2 hours: Unpatched (12450 seconds): Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added Patched (7065 seconds): Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 2000 (drc index 8002) was hot-added It should be noted that the speedup grows more substantial when hot-adding LMBs at the end of the drconf range. This is because we are skipping a linear LMB search. To see the distinction, consider smaller hot-add test on the same LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 LMBs completed less than 1 second faster on a patched kernel: Unpatched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,753.42 msec task-clock#0.992 CPUs utilized ( +- 0.55% ) 4,708 context-switches #0.045 K/sec ( +- 0.69% ) 2,444 cpu-migrations#0.023 K/sec ( +- 1.25% ) 394 page-faults #0.004 K/sec ( +- 0.22% ) 445,902,503,057 cycles#4.257 GHz ( +- 0.55% ) (66.67%) 8,558,376,740 stalled-cycles-frontend #1.92% frontend cycles idle ( +- 0.88% ) (49.99%) 300,346,181,651 stalled-cycles-backend# 67.36% backend cycles idle ( +- 0.76% ) (50.01%) 258,091,488,691 instructions #0.58 insn per cycle #1.16 stalled cycles per insn ( +- 0.22% ) (66.67%) 70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%) 3,100,725,426 branch-misses #4.39% of all branches ( +- 0.20% ) (49.99%) 105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) Patched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,055.69 msec task-clock#0.993 CPUs utilized ( +- 0.32% ) 4,606 context-switches #0.044 K/sec ( +- 0.20% ) 2,463 cpu-migrations#0.024 K/sec ( +- 0.93% ) 394 page-faults #0.004 K/sec ( +- 0.25% ) 442,951,129,921 cycles#4.257 GHz ( +- 0.32% ) (66.66%) 8,710,413,329 stalled-cycles-frontend #1.97% frontend cycles idle ( +- 0.47% ) (50.06%) 299,656,905,836 stalled-cycles-backend# 67.65% backend cycles idle ( +- 0.39% ) (50.02%) 252,731,168,193 instructions #0.57 insn per cycle #1.19 stalled cycles per insn ( +- 0.20% ) (66.66%) 68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%) 3,100,242,882 branch-misses #4.50% of all branches ( +- 0.15% ) (49.98%) 104.829 +- 0.325 seconds time elapsed ( +- 0.31% ) This is consistent. An add-by-count hot-add operation adds LMBs greedily, so LMBs near the start of the drconf range are considered first. On an otherwise idle LPAR with so many LMBs we would expect to find the LMBs we need near the start of the drconf range, hence the smaller speedup. Signed-o
[PATCH] powerpc: topology.h: fix build when CONFIG_NUMA=n
Add a non-NUMA definition for of_drconf_to_nid_single() to topology.h so we have one even if powerpc/mm/numa.c is not compiled. On a non-NUMA kernel the appropriate node id is always first_online_node. Signed-off-by: Scott Cheloha Reported-by: kernel test robot Fixes: 72cdd117c449 ("pseries/hotplug-memory: hot-add: skip redundant LMB lookup") --- arch/powerpc/include/asm/topology.h | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 8728590f514a..90d2424418b5 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -61,6 +61,10 @@ static inline int early_cpu_to_node(int cpu) */ return (nid < 0) ? 0 : nid; } + +struct drmem_lmb; +extern int of_drconf_to_nid_single(struct drmem_lmb *lmb); + #else static inline int early_cpu_to_node(int cpu) { return 0; } @@ -84,10 +88,13 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) return 0; } -#endif /* CONFIG_NUMA */ - struct drmem_lmb; -int of_drconf_to_nid_single(struct drmem_lmb *lmb); +static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb) +{ + return first_online_node; +} + +#endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) extern int find_and_online_cpu_nid(int cpu); -- 2.28.0
[PATCH v2] powerpc: topology.h: fix build when CONFIG_NUMA=n
Add a non-NUMA definition for of_drconf_to_nid_single() to topology.h so we have one even if powerpc/mm/numa.c is not compiled. On a non-NUMA kernel the appropriate node id is always first_online_node. Signed-off-by: Scott Cheloha Reported-by: kernel test robot Fixes: 72cdd117c449 ("pseries/hotplug-memory: hot-add: skip redundant LMB lookup") --- v1: Initial patch. v2: Incorporate suggested cleanups from Christophe Leroy. arch/powerpc/include/asm/topology.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 8728590f514a..3beeb030cd78 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -6,6 +6,7 @@ struct device; struct device_node; +struct drmem_lmb; #ifdef CONFIG_NUMA @@ -61,6 +62,9 @@ static inline int early_cpu_to_node(int cpu) */ return (nid < 0) ? 0 : nid; } + +int of_drconf_to_nid_single(struct drmem_lmb *lmb); + #else static inline int early_cpu_to_node(int cpu) { return 0; } @@ -84,10 +88,12 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) return 0; } -#endif /* CONFIG_NUMA */ +static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb) +{ + return first_online_node; +} -struct drmem_lmb; -int of_drconf_to_nid_single(struct drmem_lmb *lmb); +#endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) extern int find_and_online_cpu_nid(int cpu); -- 2.28.0
[PATCH v1 1/4] powerpc/pseries: hvcall.h: add H_WATCHDOG opcode, H_NOOP return code
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. Add the opcode for the H_WATCHDOG hypercall to hvcall.h. While here, add a definition for H_NOOP, a possible return code for H_WATCHDOG. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/hvcall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index d92a20a85395..4b4f69c35b4f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -87,6 +87,7 @@ #define H_P7 -60 #define H_P8 -61 #define H_P9 -62 +#define H_NOOP -63 #define H_TOO_BIG -64 #define H_UNSUPPORTED -67 #define H_OVERLAP -68 @@ -324,7 +325,8 @@ #define H_RPT_INVALIDATE 0x448 #define H_SCM_FLUSH0x44C #define H_GET_ENERGY_SCALE_INFO0x450 -#define MAX_HCALL_OPCODE H_GET_ENERGY_SCALE_INFO +#define H_WATCHDOG 0x45C +#define MAX_HCALL_OPCODE H_WATCHDOG /* Scope args for H_SCM_UNBIND_ALL */ #define H_UNBIND_SCOPE_ALL (0x1) -- 2.27.0
[PATCH v1 2/4] powerpc/pseries: add FW_FEATURE_WATCHDOG flag
PAPR v2.12 specifies a new optional function set, "hcall-watchdog", for the /rtas/ibm,hypertas-functions property. The presence of this function set indicates support for the H_WATCHDOG hypercall. Check for this function set and, if present, set the new FW_FEATURE_WATCHDOG flag. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/firmware.h | 3 ++- arch/powerpc/platforms/pseries/firmware.c | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 834b8ecf..398e0b5e485f 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -55,6 +55,7 @@ #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100) #define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0200) #define FW_FEATURE_ENERGY_SCALE_INFO ASM_CONST(0x0400) +#define FW_FEATURE_WATCHDOGASM_CONST(0x0800) #ifndef __ASSEMBLY__ @@ -76,7 +77,7 @@ enum { FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY | - FW_FEATURE_ENERGY_SCALE_INFO, + FW_FEATURE_ENERGY_SCALE_INFO | FW_FEATURE_WATCHDOG, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 09c119b2f623..080108d129ed 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -67,6 +67,7 @@ hypertas_fw_features_table[] = { {FW_FEATURE_PAPR_SCM, "hcall-scm"}, {FW_FEATURE_RPT_INVALIDATE, "hcall-rpt-invalidate"}, {FW_FEATURE_ENERGY_SCALE_INFO, "hcall-energy-scale-info"}, + {FW_FEATURE_WATCHDOG, "hcall-watchdog"}, }; /* Build up the firmware features bitmask using the contents of -- 2.27.0
[PATCH v1 3/4] powerpc/pseries: register pseries-wdt device with platform bus
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. These timers do not conform to PowerPC device conventions. They are not affixed to any extant bus, nor do they have full representation in the device tree. As a workaround we represent them as platform devices. This patch registers a single platform device, "pseries-wdt", with the platform bus if the FW_FEATURE_WATCHDOG flag is set. A driver for this device, "pseries-wdt", will be introduced in a subsequent patch. Signed-off-by: Scott Cheloha --- arch/powerpc/platforms/pseries/setup.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 955ff8aa1644..6e8b67ea2a33 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -169,6 +170,22 @@ static void __init fwnmi_init(void) #endif } +/* + * Affix a device for the first timer to the platform bus if + * we have firmware support for the H_WATCHDOG hypercall. + */ +static struct platform_device *pseries_wdt_pdev; + +static __init int pseries_wdt_init(void) +{ + if (!firmware_has_feature(FW_FEATURE_WATCHDOG)) + return 0; + pseries_wdt_pdev = platform_device_register_simple("pseries-wdt", + 0, NULL, 0); + return 0; +} +machine_subsys_initcall(pseries, pseries_wdt_init); + static void pseries_8259_cascade(struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); -- 2.27.0
[PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. Signed-off-by: Scott Cheloha --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c| 337 ++ 4 files changed, 358 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..4ffe725e796c 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: +action: + Action taken when watchdog expires: 1 (power off), 2 (restart), + 3 (dump and restart). (default=2) +timeout: + Initial watchdog timeout in seconds. (default=60) +nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c4e82a8d863f..06b412603f3e 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1932,6 +1932,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index f7da867e8782..f35660409f17 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..f41bc4d3b7a2 --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify + * defining bitfields as described in the PAPR without needing to + * transpose values to the more C-like 63->0 ordering. + */ +#define SETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) +#define GETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) + +/* + * H_WATCHDOG Hypercall Input + * + * R4: "flags": + * + * A 64-bit value structured as follows: + * + * Bits 0-46: Reserved (must be zero). + */ +#define PSERIES_WDTF_RESERVED PPC_BITMASK(0, 46) + +/* + * Bit 47: "leaveOtherWatchdogsRunningOnTimeout" + * + * 0 Stop outstanding watchdogs on timeout. + * 1 Leave outstanding watchdogs running on timeout. + */ +#define PSERIES_WDTF_LEAVE_OTHER PPC_BIT(47) + +/* + * Bits 48-55: "operation" + * + * 0x01 Start Watchdog + * 0x02 Stop Watchdog + * 0x03 Query Watchdog Capabilities + * 0x04 Query Watchdog LPM Requirement + */ +#define PSERIES_WDTF_OP(op)SETFIELD((op), 48, 55) +#define PSERIES_WDTF_OP_START PSERIES_WDTF_OP(0x1) +#define PSERIES_WDTF_OP_STOP PSERIES_WDTF_OP(0x2) +#define PSERIES_WDTF_OP_QUERY PSERIES_WDTF_OP(0x3) +#define PSERIES_WDTF_OP_QUERY_LPM PSERIES_WDTF_OP(0x4) + +/* + * Bits 56-63: "timeoutAction" + * + * 0x01 Hard poweroff + * 0x02 Hard restart + * 0x03 Dump restart + */ +#define PSERIES_WDTF_ACTION(ac)SETFIELD(ac, 56, 63) +#define PSERIES_WDTF_ACTION_HARD_POWERO
[PATCH v1 0/4] pseries-wdt: initial support for PAPR virtual watchdog timers
This series is preceded by two RFCs: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/ Changes of note from RFC v2: - Separate platform device registration in pseries/setup.c into a separate patch to avoid cross-subsystem patches. - Change the "action" module parameter to use integers instead of strings. - Add a proper .get method, action_get(), for reading the "action" module parameter value. - Document pseries-wdt module parameters in watchdog-parameters.rst. - Correct many checkpatch.pl warnings. - For now, assume there is only one pseries-wdt device and always assign it watchdogNumber 1.
[PATCH v1 1/4] powerpc/pseries: hvcall.h: add H_WATCHDOG opcode, H_NOOP return code
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. Add the opcode for the H_WATCHDOG hypercall to hvcall.h. While here, add a definition for H_NOOP, a possible return code for H_WATCHDOG. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/hvcall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index d92a20a85395..4b4f69c35b4f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -87,6 +87,7 @@ #define H_P7 -60 #define H_P8 -61 #define H_P9 -62 +#define H_NOOP -63 #define H_TOO_BIG -64 #define H_UNSUPPORTED -67 #define H_OVERLAP -68 @@ -324,7 +325,8 @@ #define H_RPT_INVALIDATE 0x448 #define H_SCM_FLUSH0x44C #define H_GET_ENERGY_SCALE_INFO0x450 -#define MAX_HCALL_OPCODE H_GET_ENERGY_SCALE_INFO +#define H_WATCHDOG 0x45C +#define MAX_HCALL_OPCODE H_WATCHDOG /* Scope args for H_SCM_UNBIND_ALL */ #define H_UNBIND_SCOPE_ALL (0x1) -- 2.27.0
[PATCH v1 2/4] powerpc/pseries: add FW_FEATURE_WATCHDOG flag
PAPR v2.12 specifies a new optional function set, "hcall-watchdog", for the /rtas/ibm,hypertas-functions property. The presence of this function set indicates support for the H_WATCHDOG hypercall. Check for this function set and, if present, set the new FW_FEATURE_WATCHDOG flag. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/firmware.h | 3 ++- arch/powerpc/platforms/pseries/firmware.c | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 834b8ecf..398e0b5e485f 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -55,6 +55,7 @@ #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100) #define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0200) #define FW_FEATURE_ENERGY_SCALE_INFO ASM_CONST(0x0400) +#define FW_FEATURE_WATCHDOGASM_CONST(0x0800) #ifndef __ASSEMBLY__ @@ -76,7 +77,7 @@ enum { FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY | - FW_FEATURE_ENERGY_SCALE_INFO, + FW_FEATURE_ENERGY_SCALE_INFO | FW_FEATURE_WATCHDOG, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 09c119b2f623..080108d129ed 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -67,6 +67,7 @@ hypertas_fw_features_table[] = { {FW_FEATURE_PAPR_SCM, "hcall-scm"}, {FW_FEATURE_RPT_INVALIDATE, "hcall-rpt-invalidate"}, {FW_FEATURE_ENERGY_SCALE_INFO, "hcall-energy-scale-info"}, + {FW_FEATURE_WATCHDOG, "hcall-watchdog"}, }; /* Build up the firmware features bitmask using the contents of -- 2.27.0
[PATCH v1 3/4] powerpc/pseries: register pseries-wdt device with platform bus
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. These timers do not conform to PowerPC device conventions. They are not affixed to any extant bus, nor do they have full representation in the device tree. As a workaround we represent them as platform devices. This patch registers a single platform device, "pseries-wdt", with the platform bus if the FW_FEATURE_WATCHDOG flag is set. A driver for this device, "pseries-wdt", will be introduced in a subsequent patch. Signed-off-by: Scott Cheloha --- arch/powerpc/platforms/pseries/setup.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 955ff8aa1644..6e8b67ea2a33 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -169,6 +170,22 @@ static void __init fwnmi_init(void) #endif } +/* + * Affix a device for the first timer to the platform bus if + * we have firmware support for the H_WATCHDOG hypercall. + */ +static struct platform_device *pseries_wdt_pdev; + +static __init int pseries_wdt_init(void) +{ + if (!firmware_has_feature(FW_FEATURE_WATCHDOG)) + return 0; + pseries_wdt_pdev = platform_device_register_simple("pseries-wdt", + 0, NULL, 0); + return 0; +} +machine_subsys_initcall(pseries, pseries_wdt_init); + static void pseries_8259_cascade(struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); -- 2.27.0
[PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. Signed-off-by: Scott Cheloha --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c| 337 ++ 4 files changed, 358 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..4ffe725e796c 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: +action: + Action taken when watchdog expires: 1 (power off), 2 (restart), + 3 (dump and restart). (default=2) +timeout: + Initial watchdog timeout in seconds. (default=60) +nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c4e82a8d863f..06b412603f3e 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1932,6 +1932,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index f7da867e8782..f35660409f17 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..f41bc4d3b7a2 --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify + * defining bitfields as described in the PAPR without needing to + * transpose values to the more C-like 63->0 ordering. + */ +#define SETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) +#define GETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) + +/* + * H_WATCHDOG Hypercall Input + * + * R4: "flags": + * + * A 64-bit value structured as follows: + * + * Bits 0-46: Reserved (must be zero). + */ +#define PSERIES_WDTF_RESERVED PPC_BITMASK(0, 46) + +/* + * Bit 47: "leaveOtherWatchdogsRunningOnTimeout" + * + * 0 Stop outstanding watchdogs on timeout. + * 1 Leave outstanding watchdogs running on timeout. + */ +#define PSERIES_WDTF_LEAVE_OTHER PPC_BIT(47) + +/* + * Bits 48-55: "operation" + * + * 0x01 Start Watchdog + * 0x02 Stop Watchdog + * 0x03 Query Watchdog Capabilities + * 0x04 Query Watchdog LPM Requirement + */ +#define PSERIES_WDTF_OP(op)SETFIELD((op), 48, 55) +#define PSERIES_WDTF_OP_START PSERIES_WDTF_OP(0x1) +#define PSERIES_WDTF_OP_STOP PSERIES_WDTF_OP(0x2) +#define PSERIES_WDTF_OP_QUERY PSERIES_WDTF_OP(0x3) +#define PSERIES_WDTF_OP_QUERY_LPM PSERIES_WDTF_OP(0x4) + +/* + * Bits 56-63: "timeoutAction" + * + * 0x01 Hard poweroff + * 0x02 Hard restart + * 0x03 Dump restart + */ +#define PSERIES_WDTF_ACTION(ac)SETFIELD(ac, 56, 63) +#define PSERIES_WDTF_ACTION_HARD_POWERO
[PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are useful to modules outside of perf/, so move them into asm/hvcall.h to live alongside the other powerpc hypercall structs. Leave the perf-specific GPCI stuff in perf/hv-gpci.h. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/hvcall.h | 36 +++ arch/powerpc/perf/hv-gpci.c | 9 arch/powerpc/perf/hv-gpci.h | 27 --- 3 files changed, 36 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index e90c073e437e..c338480b4551 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -527,6 +527,42 @@ struct hv_guest_state { /* Latest version of hv_guest_state structure */ #define HV_GUEST_STATE_VERSION 1 +/* + * From the document "H_GetPerformanceCounterInfo Interface" v1.07 + * + * H_GET_PERF_COUNTER_INFO argument + */ +struct hv_get_perf_counter_info_params { + __be32 counter_request; /* I */ + __be32 starting_index; /* IO */ + __be16 secondary_index; /* IO */ + __be16 returned_values; /* O */ + __be32 detail_rc; /* O, only needed when called via *_norets() */ + + /* +* O, size each of counter_value element in bytes, only set for version +* >= 0x3 +*/ + __be16 cv_element_size; + + /* I, 0 (zero) for versions < 0x3 */ + __u8 counter_info_version_in; + + /* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */ + __u8 counter_info_version_out; + __u8 reserved[0xC]; + __u8 counter_value[]; +} __packed; + +#define HGPCI_REQ_BUFFER_SIZE 4096 +#define HGPCI_MAX_DATA_BYTES \ + (HGPCI_REQ_BUFFER_SIZE - sizeof(struct hv_get_perf_counter_info_params)) + +struct hv_gpci_request_buffer { + struct hv_get_perf_counter_info_params params; + uint8_t bytes[HGPCI_MAX_DATA_BYTES]; +} __packed; + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_HVCALL_H */ diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c index 6884d16ec19b..1667315b82e9 100644 --- a/arch/powerpc/perf/hv-gpci.c +++ b/arch/powerpc/perf/hv-gpci.c @@ -123,17 +123,8 @@ static const struct attribute_group *attr_groups[] = { NULL, }; -#define HGPCI_REQ_BUFFER_SIZE 4096 -#define HGPCI_MAX_DATA_BYTES \ - (HGPCI_REQ_BUFFER_SIZE - sizeof(struct hv_get_perf_counter_info_params)) - static DEFINE_PER_CPU(char, hv_gpci_reqb[HGPCI_REQ_BUFFER_SIZE]) __aligned(sizeof(uint64_t)); -struct hv_gpci_request_buffer { - struct hv_get_perf_counter_info_params params; - uint8_t bytes[HGPCI_MAX_DATA_BYTES]; -} __packed; - static unsigned long single_gpci_request(u32 req, u32 starting_index, u16 secondary_index, u8 version_in, u32 offset, u8 length, u64 *value) diff --git a/arch/powerpc/perf/hv-gpci.h b/arch/powerpc/perf/hv-gpci.h index a3053eda5dcc..4d108262bed7 100644 --- a/arch/powerpc/perf/hv-gpci.h +++ b/arch/powerpc/perf/hv-gpci.h @@ -2,33 +2,6 @@ #ifndef LINUX_POWERPC_PERF_HV_GPCI_H_ #define LINUX_POWERPC_PERF_HV_GPCI_H_ -#include - -/* From the document "H_GetPerformanceCounterInfo Interface" v1.07 */ - -/* H_GET_PERF_COUNTER_INFO argument */ -struct hv_get_perf_counter_info_params { - __be32 counter_request; /* I */ - __be32 starting_index; /* IO */ - __be16 secondary_index; /* IO */ - __be16 returned_values; /* O */ - __be32 detail_rc; /* O, only needed when called via *_norets() */ - - /* -* O, size each of counter_value element in bytes, only set for version -* >= 0x3 -*/ - __be16 cv_element_size; - - /* I, 0 (zero) for versions < 0x3 */ - __u8 counter_info_version_in; - - /* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */ - __u8 counter_info_version_out; - __u8 reserved[0xC]; - __u8 counter_value[]; -} __packed; - /* * counter info version => fw version/reference (spec version) * -- 2.24.1
[PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall, Affinity_Domain_Info_By_Partition, which returns, among other things, a "partition affinity score" for a given LPAR. This score, a value on [0-100], represents the processor-memory affinity for the LPAR in question. A score of 0 indicates the worst possible affinity while a score of 100 indicates perfect affinity. The score can be used to reason about performance. This patch adds the score for the local LPAR to the lparcfg procfile under a new 'partition_affinity_score' key. Signed-off-by: Scott Cheloha --- arch/powerpc/platforms/pseries/lparcfg.c | 35 1 file changed, 35 insertions(+) diff --git a/arch/powerpc/platforms/pseries/lparcfg.c b/arch/powerpc/platforms/pseries/lparcfg.c index b8d28ab88178..e278390ab28d 100644 --- a/arch/powerpc/platforms/pseries/lparcfg.c +++ b/arch/powerpc/platforms/pseries/lparcfg.c @@ -136,6 +136,39 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data) return rc; } +static void show_gpci_data(struct seq_file *m) +{ + struct hv_gpci_request_buffer *buf; + unsigned int affinity_score; + long ret; + + buf = kmalloc(sizeof(*buf), GFP_KERNEL); + if (buf == NULL) + return; + + /* +* Show the local LPAR's affinity score. +* +* 0xB1 selects the Affinity_Domain_Info_By_Partition subcall. +* The score is at byte 0xB in the output buffer. +*/ + memset(&buf->params, 0, sizeof(buf->params)); + buf->params.counter_request = cpu_to_be32(0xB1); + buf->params.starting_index = cpu_to_be32(-1); /* local LPAR */ + buf->params.counter_info_version_in = 0x5; /* v5+ for score */ + ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf), +sizeof(*buf)); + if (ret != H_SUCCESS) { + pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n", +ret, be32_to_cpu(buf->params.detail_rc)); + goto out; + } + affinity_score = buf->bytes[0xB]; + seq_printf(m, "partition_affinity_score=%u\n", affinity_score); +out: + kfree(buf); +} + static unsigned h_pic(unsigned long *pool_idle_time, unsigned long *num_procs) { @@ -487,6 +520,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v) partition_active_processors * 100); } + show_gpci_data(m); + seq_printf(m, "partition_active_processors=%d\n", partition_active_processors); -- 2.24.1
[PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct
At memory hot-remove time we can retrieve an LMB's nid from its corresponding memory_block. There is no need to store the nid in multiple locations. Note that lmb_to_memblock() uses find_memory_block() to get the corresponding memory_block. As find_memory_block() runs in sub-linear time this approach is negligibly slower than what we do at present. In exchange for this lookup at hot-remove time we no longer need to call memory_add_physaddr_to_nid() during drmem_init() for each LMB. On powerpc, memory_add_physaddr_to_nid() is a linear search, so this spares us an O(n^2) initialization during boot. On systems with many LMBs that initialization overhead is palpable and disruptive. For example, on a box with 249854 LMBs we're seeing drmem_init() take upwards of 30 seconds to complete: [ 53.721639] drmem: initializing drmem v2 [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1] [ 80.604377] Modules linked in: [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4 [ 80.604397] NIP: c00a4980 LR: c00a4940 CTR: [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+) [ 80.604412] MSR: 82009033 CR: 44000248 XER: 000d [ 80.604431] CFAR: c00a4a38 IRQMASK: 0 [ 80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 c0003cfede30 [ 80.604431] GPR04: c0f4095a 002f 1000 [ 80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 c00c0002fdfb2001 [ 80.604431] GPR12: c0001e8ec200 [ 80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 [ 80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 [ 80.604492] Call Trace: [ 80.604498] [c0002dbff8493ac0] [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable) [ 80.604509] [c0002dbff8493b20] [c0087c10] memory_add_physaddr_to_nid+0x20/0x60 [ 80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 [ 80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0 [ 80.604540] [c0002dbff8493ce0] [c10c4aa0] kernel_init_freeable+0x2d8/0x3a0 [ 80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 [ 80.604560] [c0002dbff8493e20] [c000b648] ret_from_kernel_thread+0x5c/0x74 [ 80.604567] Instruction dump: [ 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 7d094214 [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 7fbe5040 [ 89.047390] drmem: 249854 LMB(s) With a patched kernel on the same machine we're no longer seeing the soft lockup. drmem_init() now completes in negligible time, even when the LMB count is large. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/drmem.h | 21 --- arch/powerpc/mm/drmem.c | 6 +- .../platforms/pseries/hotplug-memory.c| 19 ++--- 3 files changed, 13 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209f45bb..34e4e9b257f5 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -13,9 +13,6 @@ struct drmem_lmb { u32 drc_index; u32 aa_index; u32 flags; -#ifdef CONFIG_MEMORY_HOTPLUG - int nid; -#endif }; struct drmem_lmb_info { @@ -104,22 +101,4 @@ static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb) lmb->aa_index = 0x; } -#ifdef CONFIG_MEMORY_HOTPLUG -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ - lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr); -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ - lmb->nid = -1; -} -#else -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ -} -#endif - #endif /* _ASM_POWERPC_LMB_H */ diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 59327cefbc6a..873fcfc7b875 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop) if (!drmem_info->lmbs) return; - for_each_drmem_lmb(lmb) { + for_each_drmem_lmb(lmb) read_drconf_v1_cell(lmb, &prop); - lmb_set_nid(lmb); - } } static void __init init_drmem_v2_lmbs(const __be32 *prop) @@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop) lmb->aa_index = dr_cell.aa_index; lmb->flags = dr_cell.flags; - - lmb_set_nid(lmb); } } } diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c inde
[PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
At memory hot-remove time we can retrieve an LMB's nid from its corresponding memory_block. There is no need to store the nid in multiple locations. Note that lmb_to_memblock() uses find_memory_block() to get the corresponding memory_block. As find_memory_block() runs in sub-linear time this approach is negligibly slower than what we do at present. In exchange for this lookup at hot-remove time we no longer need to call memory_add_physaddr_to_nid() during drmem_init() for each LMB. On powerpc, memory_add_physaddr_to_nid() is a linear search, so this spares us an O(n^2) initialization during boot. On systems with many LMBs that initialization overhead is palpable and disruptive. For example, on a box with 249854 LMBs we're seeing drmem_init() take upwards of 30 seconds to complete: [ 53.721639] drmem: initializing drmem v2 [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1] [ 80.604377] Modules linked in: [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4 [ 80.604397] NIP: c00a4980 LR: c00a4940 CTR: [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+) [ 80.604412] MSR: 82009033 CR: 44000248 XER: 000d [ 80.604431] CFAR: c00a4a38 IRQMASK: 0 [ 80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 c0003cfede30 [ 80.604431] GPR04: c0f4095a 002f 1000 [ 80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 c00c0002fdfb2001 [ 80.604431] GPR12: c0001e8ec200 [ 80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 [ 80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 [ 80.604492] Call Trace: [ 80.604498] [c0002dbff8493ac0] [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable) [ 80.604509] [c0002dbff8493b20] [c0087c10] memory_add_physaddr_to_nid+0x20/0x60 [ 80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 [ 80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0 [ 80.604540] [c0002dbff8493ce0] [c10c4aa0] kernel_init_freeable+0x2d8/0x3a0 [ 80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 [ 80.604560] [c0002dbff8493e20] [c000b648] ret_from_kernel_thread+0x5c/0x74 [ 80.604567] Instruction dump: [ 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 7d094214 [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 7fbe5040 [ 89.047390] drmem: 249854 LMB(s) With a patched kernel on the same machine we're no longer seeing the soft lockup. drmem_init() now completes in negligible time, even when the LMB count is large. Signed-off-by: Scott Cheloha --- v1: - RFC v2: - Adjusted commit message. - Miscellaneous cleanup. v3: - Correct issue found by Laurent Dufour : - Add missing put_device() call in dlpar_remove_lmb() for the lmb's associated mem_block. arch/powerpc/include/asm/drmem.h | 21 arch/powerpc/mm/drmem.c | 6 + .../platforms/pseries/hotplug-memory.c| 24 --- 3 files changed, 17 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209f45bb..34e4e9b257f5 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -13,9 +13,6 @@ struct drmem_lmb { u32 drc_index; u32 aa_index; u32 flags; -#ifdef CONFIG_MEMORY_HOTPLUG - int nid; -#endif }; struct drmem_lmb_info { @@ -104,22 +101,4 @@ static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb) lmb->aa_index = 0x; } -#ifdef CONFIG_MEMORY_HOTPLUG -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ - lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr); -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ - lmb->nid = -1; -} -#else -static inline void lmb_set_nid(struct drmem_lmb *lmb) -{ -} -static inline void lmb_clear_nid(struct drmem_lmb *lmb) -{ -} -#endif - #endif /* _ASM_POWERPC_LMB_H */ diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 59327cefbc6a..873fcfc7b875 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop) if (!drmem_info->lmbs) return; - for_each_drmem_lmb(lmb) { + for_each_drmem_lmb(lmb) read_drconf_v1_cell(lmb, &prop); - lmb_set_nid(lmb); - } } static void __init init_drmem_v2_lmbs(const __be32 *prop) @@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop) lmb->aa_index = dr_cell.aa_index;
[PATCH] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
The H_GetPerformanceCounterInfo PHYP hypercall has a subcall, Affinity_Domain_Info_By_Partition, which returns, among other things, a "partition affinity score" for a given LPAR. This score, a value on [0-100], represents the processor-memory affinity for the LPAR in question. A score of 0 indicates the worst possible affinity while a score of 100 indicates perfect affinity. The score can be used to reason about performance. This patch adds the score for the local LPAR to the lparcfg procfile under a new 'partition_affinity_score' key. The H_GetPerformanceCounterInfo hypercall is already used elsewhere in the kernel, in powerpc/perf/hv-gpci.c. Refactoring that code and this code into a more general API might be worthwhile if additional modules require the hypercall in the future. Signed-off-by: Scott Cheloha --- arch/powerpc/platforms/pseries/lparcfg.c | 53 1 file changed, 53 insertions(+) diff --git a/arch/powerpc/platforms/pseries/lparcfg.c b/arch/powerpc/platforms/pseries/lparcfg.c index b8d28ab88178..b75151eee0f0 100644 --- a/arch/powerpc/platforms/pseries/lparcfg.c +++ b/arch/powerpc/platforms/pseries/lparcfg.c @@ -136,6 +136,57 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data) return rc; } +/* + * Based on H_GetPerformanceCounterInfo v1.10. + */ +static void show_gpci_data(struct seq_file *m) +{ + struct perf_counter_info_params { + __be32 counter_request; + __be32 starting_index; + __be16 secondary_index; + __be16 returned_values; + __be32 detail_rc; + __be16 counter_value_element_size; + u8 counter_info_version_in; + u8 counter_info_version_out; + u8 reserved[0xC]; + } __packed; + struct hv_gpci_request_buffer { + struct perf_counter_info_params params; + u8 output[4096 - sizeof(struct perf_counter_info_params)]; + } __packed; + struct hv_gpci_request_buffer *buf; + long ret; + unsigned int affinity_score; + + buf = kmalloc(sizeof(*buf), GFP_KERNEL); + if (buf == NULL) + return; + + /* +* Show the local LPAR's affinity score. +* +* 0xB1 selects the Affinity_Domain_Info_By_Partition subcall. +* The score is at byte 0xB in the output buffer. +*/ + memset(&buf->params, 0, sizeof(buf->params)); + buf->params.counter_request = cpu_to_be32(0xB1); + buf->params.starting_index = cpu_to_be32(-1); /* local LPAR */ + buf->params.counter_info_version_in = 0x5; /* v5+ for score */ + ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf), +sizeof(*buf)); + if (ret != H_SUCCESS) { + pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n", +ret, be32_to_cpu(buf->params.detail_rc)); + goto out; + } + affinity_score = buf->output[0xB]; + seq_printf(m, "partition_affinity_score=%u\n", affinity_score); +out: + kfree(buf); +} + static unsigned h_pic(unsigned long *pool_idle_time, unsigned long *num_procs) { @@ -487,6 +538,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v) partition_active_processors * 100); } + show_gpci_data(m); + seq_printf(m, "partition_active_processors=%d\n", partition_active_processors); -- 2.24.1
Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
On Wed, May 25, 2022 at 04:35:11PM +1000, Alexey Kardashevskiy wrote: > > On 5/21/22 04:35, Scott Cheloha wrote: > > PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits > > guest control of one or more virtual watchdog timers. The timers have > > millisecond granularity. The guest is terminated when a timer > > expires. > > > > This patch adds a watchdog driver for these timers, "pseries-wdt". > > > > pseries_wdt_probe() currently assumes the existence of only one > > platform device and always assigns it watchdogNumber 1. If we ever > > expose more than one timer to userspace we will need to devise a way > > to assign a distinct watchdogNumber to each platform device at device > > registration time. > > This one should go before 4/4 in the series for bisectability. > > What is platform_device_register_simple("pseries-wdt",...) going to do > without the driver? This is a chicken-and-egg problem without an obvious solution. A device without a driver is a body without a soul. A driver without a device is a ghost without a machine. ... or something like that, don't quote me :) Absent some very compelling reasoning, I would like to keep the current order. It feels logical to me to keep the powerpc/pseries patches adjacent and prior to the watchdog driver patch. > > Signed-off-by: Scott Cheloha > > --- > > .../watchdog/watchdog-parameters.rst | 12 + > > drivers/watchdog/Kconfig | 8 + > > drivers/watchdog/Makefile | 1 + > > drivers/watchdog/pseries-wdt.c| 337 ++ > > 4 files changed, 358 insertions(+) > > create mode 100644 drivers/watchdog/pseries-wdt.c > > > > diff --git a/Documentation/watchdog/watchdog-parameters.rst > > b/Documentation/watchdog/watchdog-parameters.rst > > index 223c99361a30..4ffe725e796c 100644 > > --- a/Documentation/watchdog/watchdog-parameters.rst > > +++ b/Documentation/watchdog/watchdog-parameters.rst > > @@ -425,6 +425,18 @@ pnx833x_wdt: > > - > > +pseries-wdt: > > +action: > > + Action taken when watchdog expires: 1 (power off), 2 (restart), > > + 3 (dump and restart). (default=2) > > +timeout: > > + Initial watchdog timeout in seconds. (default=60) > > +nowayout: > > + Watchdog cannot be stopped once started. > > + (default=kernel config parameter) > > + > > +- > > + > > rc32434_wdt: > > timeout: > > Watchdog timeout value, in seconds (default=20) > > diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig > > index c4e82a8d863f..06b412603f3e 100644 > > --- a/drivers/watchdog/Kconfig > > +++ b/drivers/watchdog/Kconfig > > @@ -1932,6 +1932,14 @@ config MEN_A21_WDT > > # PPC64 Architecture > > +config PSERIES_WDT > > + tristate "POWER Architecture Platform Watchdog Timer" > > + depends on PPC_PSERIES > > + select WATCHDOG_CORE > > + help > > + Driver for virtual watchdog timers provided by PAPR > > + hypervisors (e.g. PowerVM, KVM). > > + > > config WATCHDOG_RTAS > > tristate "RTAS watchdog" > > depends on PPC_RTAS > > diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile > > index f7da867e8782..f35660409f17 100644 > > --- a/drivers/watchdog/Makefile > > +++ b/drivers/watchdog/Makefile > > @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o > > obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o > > # PPC64 Architecture > > +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o > > obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o > > # S390 Architecture > > diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c > > new file mode 100644 > > index ..f41bc4d3b7a2 > > --- /dev/null > > +++ b/drivers/watchdog/pseries-wdt.c > > @@ -0,0 +1,337 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > +/* > > + * Copyright (c) 2022 International Business Machines, Inc. > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#define DRV_NAME "pseries-wdt" > > + > > +/* > > + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify > > + * defining bitfields as described in the PAPR without nee
Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
On Wed, May 25, 2022 at 12:52:09AM -0700, Guenter Roeck wrote: > On 5/24/22 23:35, Alexey Kardashevskiy wrote: > > > > On 5/21/22 04:35, Scott Cheloha wrote: > > > PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits > > > guest control of one or more virtual watchdog timers. The timers have > > > millisecond granularity. The guest is terminated when a timer > > > expires. > > > > > > This patch adds a watchdog driver for these timers, "pseries-wdt". > > > > > > pseries_wdt_probe() currently assumes the existence of only one > > > platform device and always assigns it watchdogNumber 1. If we ever > > > expose more than one timer to userspace we will need to devise a way > > > to assign a distinct watchdogNumber to each platform device at device > > > registration time. > > > > This one should go before 4/4 in the series for bisectability. > > > > What is platform_device_register_simple("pseries-wdt",...) going to do > > without the driver? > > > > > > > > Signed-off-by: Scott Cheloha > > > --- > > > .../watchdog/watchdog-parameters.rst | 12 + > > > drivers/watchdog/Kconfig | 8 + > > > drivers/watchdog/Makefile | 1 + > > > drivers/watchdog/pseries-wdt.c | 337 ++ > > > 4 files changed, 358 insertions(+) > > > create mode 100644 drivers/watchdog/pseries-wdt.c > > > > > > diff --git a/Documentation/watchdog/watchdog-parameters.rst > > > b/Documentation/watchdog/watchdog-parameters.rst > > > index 223c99361a30..4ffe725e796c 100644 > > > --- a/Documentation/watchdog/watchdog-parameters.rst > > > +++ b/Documentation/watchdog/watchdog-parameters.rst > > > @@ -425,6 +425,18 @@ pnx833x_wdt: > > > - > > > +pseries-wdt: > > > + action: > > > + Action taken when watchdog expires: 1 (power off), 2 (restart), > > > + 3 (dump and restart). (default=2) > > > + timeout: > > > + Initial watchdog timeout in seconds. (default=60) > > > + nowayout: > > > + Watchdog cannot be stopped once started. > > > + (default=kernel config parameter) > > > + > > > +- > > > + > > > rc32434_wdt: > > > timeout: > > > Watchdog timeout value, in seconds (default=20) > > > diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig > > > index c4e82a8d863f..06b412603f3e 100644 > > > --- a/drivers/watchdog/Kconfig > > > +++ b/drivers/watchdog/Kconfig > > > @@ -1932,6 +1932,14 @@ config MEN_A21_WDT > > > # PPC64 Architecture > > > +config PSERIES_WDT > > > + tristate "POWER Architecture Platform Watchdog Timer" > > > + depends on PPC_PSERIES > > > + select WATCHDOG_CORE > > > + help > > > + Driver for virtual watchdog timers provided by PAPR > > > + hypervisors (e.g. PowerVM, KVM). > > > + > > > config WATCHDOG_RTAS > > > tristate "RTAS watchdog" > > > depends on PPC_RTAS > > > diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile > > > index f7da867e8782..f35660409f17 100644 > > > --- a/drivers/watchdog/Makefile > > > +++ b/drivers/watchdog/Makefile > > > @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o > > > obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o > > > # PPC64 Architecture > > > +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o > > > obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o > > > # S390 Architecture > > > diff --git a/drivers/watchdog/pseries-wdt.c > > > b/drivers/watchdog/pseries-wdt.c > > > new file mode 100644 > > > index ..f41bc4d3b7a2 > > > --- /dev/null > > > +++ b/drivers/watchdog/pseries-wdt.c > > > @@ -0,0 +1,337 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > +/* > > > + * Copyright (c) 2022 International Business Machines, Inc. > > > + */ > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +#define DRV_NAME "pseries-wdt" > > > + > > >
Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
On Wed, Jun 01, 2022 at 08:45:03AM -0700, Guenter Roeck wrote: > On 6/1/22 08:07, Scott Cheloha wrote: > [ ... ] > > > > > +static unsigned long action = PSERIES_WDTF_ACTION_HARD_RESTART; > > > > > + > > > > > +static int action_get(char *buf, const struct kernel_param *kp) > > > > > +{ > > > > > + int val; > > > > > + > > > > > + switch (action) { > > > > > + case PSERIES_WDTF_ACTION_HARD_POWEROFF: > > > > > + val = 1; > > > > > + break; > > > > > + case PSERIES_WDTF_ACTION_HARD_RESTART: > > > > > + val = 2; > > > > > + break; > > > > > + case PSERIES_WDTF_ACTION_DUMP_RESTART: > > > > > + val = 3; > > > > > + break; > > > > > + default: > > > > > + return -EINVAL; > > > > > + } > > > > > + return sprintf(buf, "%d\n", val); > > > > > +} > > > > > + > > > > > +static int action_set(const char *val, const struct kernel_param *kp) > > > > > +{ > > > > > + int choice; > > > > > + > > > > > + if (kstrtoint(val, 10, &choice)) > > > > > + return -EINVAL; > > > > > + switch (choice) { > > > > > + case 1: > > > > > + action = PSERIES_WDTF_ACTION_HARD_POWEROFF; > > > > > + return 0; > > > > > + case 2: > > > > > + action = PSERIES_WDTF_ACTION_HARD_RESTART; > > > > > + return 0; > > > > > + case 3: > > > > > + action = PSERIES_WDTF_ACTION_DUMP_RESTART; > > > > > + return 0; > > > > > + } > > > > > + return -EINVAL; > > > > > +} > > > > > + > > > > > +static const struct kernel_param_ops action_ops = { > > > > > + .get = action_get, > > > > > + .set = action_set, > > > > > +}; > > > > > +module_param_cb(action, &action_ops, NULL, 0444); > > > > > > > > > > > > 0644 here and below? > > > > > > > > > > That would make the module parameters have to be runtime > > > configurable, which does not make sense at least for > > > the two parameters below. > > > > Agreed. > > > > > I don't know though if it is really valuable to have all the > > > above code instead of just > > > storing the action numbers and doing the conversion to action > > > once in the probe function. The above code really only > > > makes sense if the action is changeable during runtime and more > > > is done that just converting it to another value. > > > > Having a setter that runs exactly once during module attach is > > obvious. We need a corresponding .get() method to convert on the way > > out anyway. > > > > Why would a get method be needed if the module parameter is just kept as-is ? In my original plan they weren't kept as-is. They were converted on the way in and on the way out. > > I don't see any upside to moving the action_set() code into > > pseries_wdt_probe() aside from maybe shaving a few SLOC. The module > > is already very compact. > > I disagree. The get method is unnecessary. The module parameter values (1..3) > add unnecessary complexity. It could as well be 0..2, making it easier to > convert. Yes, we could do that. > The actual action could be stored in struct pseries_wdt, or converted using > something > like > > u8 pseries_actions[] = { > PSERIES_WDTF_ACTION_HARD_POWEROFF, > PSERIES_WDTF_ACTION_HARD_RESTART, > PSERIES_WDTF_ACTION_DUMP_RESTART > }; > > flags = pseries_actions[action] | PSERIES_WDTF_OP_START; > > or, alternatively, in probe > > if (action > 2) > return -EINVAL; > pw->action = pseries_actions[action]; > and, in the start function, > flags = pw->action | PSERIES_WDTF_OP_START; I like this, we'll go with this. > That not only reduces code size but also improves readability. > get/set methods are useful, but should be limited to cases where they > are really needed, ie do something besides converting values. You could argue > that you want to be able to change the reboot method on the fly, by making > the module parameter writeable, but then the .set method should actually > change the method accordingly and not just convert values. And even then > I'd argue that it would be better not to convert the 'action' value itself > but to keep it at 0, 1, 2 (or 1, 2, 3 if you prefer) and use param_get_uint > (or param_get_ulong) for the get method. Okay, that makes sense. > Regarding max_timeout, we have calculations such as > > unsigned int t = wdd->timeout * 1000; > > in the assumption that timeouts larger than UINT_MAX/1000 seconds (or ~50 > days) > don't really make much sense. watchdog_timeout_invalid() will also return > -EINVAL > if the provided timeout value is larger than UINT_MAX / 1000. Oh, I see. Changed in v2.
[PATCH v2 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. This patch series adds support for this hypercall to powerpc/pseries kernels and introduces a new watchdog driver, "pseries-wdt", for the virtual timers exposed by the hypercall. This series is preceded by the following: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/ PATCH v1: https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/ Changes of note from PATCH v1: - Trim down the large comment documenting the H_WATCHDOG hypercall. The comment is likely to rot, so remove anything we aren't using and anything overly obvious. - Remove any preprocessor definitions not actually used in the module right now. If we want to use other features offered by the hypercall we can add them in later. They're just clutter until then. - Simplify the "action" module parameter. The value is now an index into an array of possible timeoutAction values. This design removes the need for the custom get/set methods used in PATCH v1. Now we merely need to check that the "action" value is a valid index during pseries_wdt_probe(). Easy. - Make the timeoutAction a member of pseries_wdt, "action". This eliminates the use of a global variable during pseries_wdt_start(). - Use watchdog_init_timeout() idiomatically. Check its return value and error out of pseries_wdt_probe() if it fails.
[PATCH v2 1/4] powerpc/pseries: hvcall.h: add H_WATCHDOG opcode, H_NOOP return code
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. Add the opcode for the H_WATCHDOG hypercall to hvcall.h. While here, add a definition for H_NOOP, a possible return code for H_WATCHDOG. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/hvcall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index d92a20a85395..4b4f69c35b4f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -87,6 +87,7 @@ #define H_P7 -60 #define H_P8 -61 #define H_P9 -62 +#define H_NOOP -63 #define H_TOO_BIG -64 #define H_UNSUPPORTED -67 #define H_OVERLAP -68 @@ -324,7 +325,8 @@ #define H_RPT_INVALIDATE 0x448 #define H_SCM_FLUSH0x44C #define H_GET_ENERGY_SCALE_INFO0x450 -#define MAX_HCALL_OPCODE H_GET_ENERGY_SCALE_INFO +#define H_WATCHDOG 0x45C +#define MAX_HCALL_OPCODE H_WATCHDOG /* Scope args for H_SCM_UNBIND_ALL */ #define H_UNBIND_SCOPE_ALL (0x1) -- 2.27.0
[PATCH v2 2/4] powerpc/pseries: add FW_FEATURE_WATCHDOG flag
PAPR v2.12 specifies a new optional function set, "hcall-watchdog", for the /rtas/ibm,hypertas-functions property. The presence of this function set indicates support for the H_WATCHDOG hypercall. Check for this function set and, if present, set the new FW_FEATURE_WATCHDOG flag. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/firmware.h | 3 ++- arch/powerpc/platforms/pseries/firmware.c | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 834b8ecf..398e0b5e485f 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -55,6 +55,7 @@ #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100) #define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0200) #define FW_FEATURE_ENERGY_SCALE_INFO ASM_CONST(0x0400) +#define FW_FEATURE_WATCHDOGASM_CONST(0x0800) #ifndef __ASSEMBLY__ @@ -76,7 +77,7 @@ enum { FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY | - FW_FEATURE_ENERGY_SCALE_INFO, + FW_FEATURE_ENERGY_SCALE_INFO | FW_FEATURE_WATCHDOG, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 09c119b2f623..080108d129ed 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -67,6 +67,7 @@ hypertas_fw_features_table[] = { {FW_FEATURE_PAPR_SCM, "hcall-scm"}, {FW_FEATURE_RPT_INVALIDATE, "hcall-rpt-invalidate"}, {FW_FEATURE_ENERGY_SCALE_INFO, "hcall-energy-scale-info"}, + {FW_FEATURE_WATCHDOG, "hcall-watchdog"}, }; /* Build up the firmware features bitmask using the contents of -- 2.27.0
[PATCH v2 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. Signed-off-by: Scott Cheloha --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c| 264 ++ 4 files changed, 285 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..29153eed6689 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: +action: + Action taken when watchdog expires: 0 (power off), 1 (restart), + 2 (dump and restart). (default=1) +timeout: + Initial watchdog timeout in seconds. (default=60) +nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c4e82a8d863f..06b412603f3e 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1932,6 +1932,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index f7da867e8782..f35660409f17 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..cfe53587457d --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,264 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify + * defining bitfields as described in the PAPR without needing to + * transpose values to the more C-like 63->0 ordering. + */ +#define SETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) +#define GETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) + +/* + * The H_WATCHDOG hypercall first appears in PAPR v2.12 and is + * described fully in sections 14.5 and 14.15.6. + * + * + * H_WATCHDOG Input + * + * R4: "flags": + * + * Bits 48-55: "operation" + * + * 0x01 Start Watchdog + * 0x02 Stop Watchdog + * 0x03 Query Watchdog Capabilities + */ +#define PSERIES_WDTF_OP(op)SETFIELD((op), 48, 55) +#define PSERIES_WDTF_OP_START PSERIES_WDTF_OP(0x1) +#define PSERIES_WDTF_OP_STOP PSERIES_WDTF_OP(0x2) +#define PSERIES_WDTF_OP_QUERY PSERIES_WDTF_OP(0x3) + +/* + * Bits 56-63: "timeoutAction" (for "Start Watchdog" only) + * + * 0x01 Hard poweroff + * 0x02 Hard restart + * 0x03 Dump restart + */ +#define PSERIES_WDTF_ACTION(ac)SETFIELD(ac, 56, 63) +#define PSERIES_WDTF_ACTION_HARD_POWEROFF PSERIES_WDTF_ACTION(0x1) +#define PSERIES_WDTF_ACTION_HARD_RESTART PSERIES_WDTF_ACTION(0x2) +#define PSERIES_WDTF_ACTION_DUMP_RESTART PSERIES_WDTF_ACTION(0x3) + +/* + * H_WATCHDOG Output + * + * R3: Return code + * + * H_SUCCESSThe operation completed. + * + * H_BUSY The hypervisor is too busy; retry the operation. + * + * H_PARAMETER The g
[PATCH v2 3/4] powerpc/pseries: register pseries-wdt device with platform bus
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. These timers do not conform to PowerPC device conventions. They are not affixed to any extant bus, nor do they have full representation in the device tree. As a workaround we represent them as platform devices. This patch registers a single platform device, "pseries-wdt", with the platform bus if the FW_FEATURE_WATCHDOG flag is set. A driver for this device, "pseries-wdt", will be introduced in a subsequent patch. Signed-off-by: Scott Cheloha --- arch/powerpc/platforms/pseries/setup.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index afb074269b42..233c64f59815 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -169,6 +170,22 @@ static void __init fwnmi_init(void) #endif } +/* + * Affix a device for the first timer to the platform bus if + * we have firmware support for the H_WATCHDOG hypercall. + */ +static struct platform_device *pseries_wdt_pdev; + +static __init int pseries_wdt_init(void) +{ + if (!firmware_has_feature(FW_FEATURE_WATCHDOG)) + return 0; + pseries_wdt_pdev = platform_device_register_simple("pseries-wdt", + 0, NULL, 0); + return 0; +} +machine_subsys_initcall(pseries, pseries_wdt_init); + static void pseries_8259_cascade(struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); -- 2.27.0
Re: [PATCH v2 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
On Fri, Jun 24, 2022 at 11:51:01PM +1000, Michael Ellerman wrote: > Scott Cheloha writes: > ... > > + > > +static struct platform_driver pseries_wdt_driver = { > > + .driver = { > > + .name = DRV_NAME, > > + .owner = THIS_MODULE, > > That owner assignment is not required. > > It's set for you by platform_driver_register() via > module_platform_driver(). Great, removed.
Re: [PATCH v2 3/4] powerpc/pseries: register pseries-wdt device with platform bus
On Fri, Jun 24, 2022 at 11:27:36PM +1000, Michael Ellerman wrote: > Nathan Lynch writes: > > Scott Cheloha writes: > >> PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits > >> guest control of one or more virtual watchdog timers. > ... > > > > Seems like we don't need pseries_wdt_pdev as it's unused elsewhere? But > > that's quite minor. > > It's minor but please drop it in the next version. Dropped.
Re: [PATCH v2 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
On Fri, Jun 24, 2022 at 11:27:24PM +1000, Michael Ellerman wrote: > Hi Scott, > > A few comments below ... > > Scott Cheloha writes: > > > > [...] > > > > diff --git a/Documentation/watchdog/watchdog-parameters.rst > > b/Documentation/watchdog/watchdog-parameters.rst > > index 223c99361a30..29153eed6689 100644 > > --- a/Documentation/watchdog/watchdog-parameters.rst > > +++ b/Documentation/watchdog/watchdog-parameters.rst > > @@ -425,6 +425,18 @@ pnx833x_wdt: > > > > - > > > > +pseries-wdt: > > +action: > > + Action taken when watchdog expires: 0 (power off), 1 (restart), > > + 2 (dump and restart). (default=1) > > I doesn't look like these values match what other drivers use to any > great extent. > > So why not use the values from PAPR directly? > > ie. 1 = power off, 2 = hard reset, 3 = dump & restart. > > It seems like it would be easier to follow if the values map directly. > > It's possible in future PAPR adds 247 to mean something, in which case > maybe we'd want to map that to a less silly value, but at least for now > the PAPR values are sensible enough. I tried using 1-2-3 in Patch v1 but Guenter objected and we switched: https://lore.kernel.org/linux-watchdog/a6090ef3-f597-e10b-010b-cc32bff08...@roeck-us.net/ I think the code is fine to read as-is. We're not expecting the administrator to read the PAPR, right? So 1-2-3 is not any more intuitive for the user than 0-1-2. Given that it's all arbitrary and there aren't any hard rules for module parameters outside of general programmer "that seems fine"-ness, I would really like to leave the numbers as-is. > > +timeout: > > + Initial watchdog timeout in seconds. (default=60) > > That seems like a pretty common value, I don't see any guidance in PAPR. > Do we have any input from PowerVM on whether that's a good value? Currently the minimum timeout is 500ms on all the builds I've tried. I doubt the minimum will ever be anywhere near as large as 60s on a practical H_WATCHDOG implementation, so I don't think there is any risk of the driver failing to probe. Real software using the watchdog API will set a timeout to a smaller value if it needs to. 60 seconds gives userland ample time to reconfigure the watchdog without risk of it expiring in the midst of a bunch of ioctl(2) calls before they reach the main loop. > > diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c > > new file mode 100644 > > index ..cfe53587457d > > --- /dev/null > > +++ b/drivers/watchdog/pseries-wdt.c > > @@ -0,0 +1,264 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > +/* > > + * Copyright (c) 2022 International Business Machines, Inc. > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#define DRV_NAME "pseries-wdt" > > + > > +/* > > + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify > > + * defining bitfields as described in the PAPR without needing to > > + * transpose values to the more C-like 63->0 ordering. > > + */ > > +#define SETFIELD(_v, _b, _e) \ > > + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) > > +#define GETFIELD(_v, _b, _e) \ > > + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) > > This will probably sound like a cranky maintainer rant, but ..., > I really dislike these GETFIELD/SETFIELD macros. > > I know you didn't invent them, but I would be much happier if you didn't > use them. > > I know they (slightly) simplify things when you're transcribing values > from PAPR into the source, but that happens only once. > > And then for the rest of eternity the source is harder to read because > there's this ridiculous level of indirection through insane macros just > to define some constants. > > Anyone trying to use a debugger against this code will see a value in > memory like 0x200 and have to sit down and work out which SETFIELD() > macro it corresponds to. Don't look at me, I never would have come up with them. I got them from Alexey :) I will drop them. > > +/* > > + * The H_WATCHDOG hypercall first appears in PAPR v2.12 and is > > + * described fully in sections 14.5 and 14.15.6. > > + * > > + * > > + * H_WATCHDOG Input > > + * > > + * R4: "
[PATCH v3 0/4] pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. This patch series adds support for this hypercall to powerpc/pseries kernels and introduces a new watchdog driver, "pseries-wdt", for the virtual timers exposed by the hypercall. This series is preceded by the following: RFC v1: https://lore.kernel.org/linux-watchdog/20220413165104.179144-1-chel...@linux.ibm.com/ RFC v2: https://lore.kernel.org/linux-watchdog/20220509174357.5448-1-chel...@linux.ibm.com/ PATCH v1: https://lore.kernel.org/linux-watchdog/20220520183552.33426-1-chel...@linux.ibm.com/ PATCH v2: https://lore.kernel.org/linux-watchdog/20220602175353.68942-1-chel...@linux.ibm.com/ Changes of note from PATCH v2: - Don't keep a pointer to the platform device at registration time. We don't use the pointer for anything and we cannot hotplug the "device". - Drop the GETFIELD() and SETFIELD() macros: Michael Ellerman really doesn't like them. Use plain integer constants and custom bitfield extraction macros for the capability output instead. (After making the change I can see the upside to plain constants.) - Actually use PSERIES_WDTQ_MAX_NUMBER(): check that the hypervisor gave us at least one timer to work with. - Use MSEC_PER_SEC in a few places instead of the literal 1000 to show the reader what we're doing. - Use "reverse xmas tree" sorting for automatic variable declarations. - Note where the max_timeout of (UINT_MAX / 1000) comes from. - Nix email addresses from the MODULE_AUTHOR() macros, they tend to rot.
[PATCH v3 1/4] powerpc/pseries: hvcall.h: add H_WATCHDOG opcode, H_NOOP return code
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. Add the opcode for the H_WATCHDOG hypercall to hvcall.h. While here, add a definition for H_NOOP, a possible return code for H_WATCHDOG. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/hvcall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index d92a20a85395..4b4f69c35b4f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -87,6 +87,7 @@ #define H_P7 -60 #define H_P8 -61 #define H_P9 -62 +#define H_NOOP -63 #define H_TOO_BIG -64 #define H_UNSUPPORTED -67 #define H_OVERLAP -68 @@ -324,7 +325,8 @@ #define H_RPT_INVALIDATE 0x448 #define H_SCM_FLUSH0x44C #define H_GET_ENERGY_SCALE_INFO0x450 -#define MAX_HCALL_OPCODE H_GET_ENERGY_SCALE_INFO +#define H_WATCHDOG 0x45C +#define MAX_HCALL_OPCODE H_WATCHDOG /* Scope args for H_SCM_UNBIND_ALL */ #define H_UNBIND_SCOPE_ALL (0x1) -- 2.27.0
[PATCH v3 2/4] powerpc/pseries: add FW_FEATURE_WATCHDOG flag
PAPR v2.12 specifies a new optional function set, "hcall-watchdog", for the /rtas/ibm,hypertas-functions property. The presence of this function set indicates support for the H_WATCHDOG hypercall. Check for this function set and, if present, set the new FW_FEATURE_WATCHDOG flag. Signed-off-by: Scott Cheloha --- arch/powerpc/include/asm/firmware.h | 3 ++- arch/powerpc/platforms/pseries/firmware.c | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 834b8ecf..398e0b5e485f 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -55,6 +55,7 @@ #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100) #define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0200) #define FW_FEATURE_ENERGY_SCALE_INFO ASM_CONST(0x0400) +#define FW_FEATURE_WATCHDOGASM_CONST(0x0800) #ifndef __ASSEMBLY__ @@ -76,7 +77,7 @@ enum { FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY | - FW_FEATURE_ENERGY_SCALE_INFO, + FW_FEATURE_ENERGY_SCALE_INFO | FW_FEATURE_WATCHDOG, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 09c119b2f623..080108d129ed 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -67,6 +67,7 @@ hypertas_fw_features_table[] = { {FW_FEATURE_PAPR_SCM, "hcall-scm"}, {FW_FEATURE_RPT_INVALIDATE, "hcall-rpt-invalidate"}, {FW_FEATURE_ENERGY_SCALE_INFO, "hcall-energy-scale-info"}, + {FW_FEATURE_WATCHDOG, "hcall-watchdog"}, }; /* Build up the firmware features bitmask using the contents of -- 2.27.0
[PATCH v3 3/4] powerpc/pseries: register pseries-wdt device with platform bus
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. These timers do not conform to PowerPC device conventions. They are not affixed to any extant bus, nor do they have full representation in the device tree. As a workaround we represent them as platform devices. This patch registers a single platform device, "pseries-wdt", with the platform bus if the FW_FEATURE_WATCHDOG flag is set. A driver for this device, "pseries-wdt", will be introduced in a subsequent patch. Signed-off-by: Scott Cheloha --- arch/powerpc/platforms/pseries/setup.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index ee4f1db49515..dd9d3f500cff 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -169,6 +170,18 @@ static void __init fwnmi_init(void) #endif } +/* + * Affix a device for the first timer to the platform bus if + * we have firmware support for the H_WATCHDOG hypercall. + */ +static __init int pseries_wdt_init(void) +{ + if (firmware_has_feature(FW_FEATURE_WATCHDOG)) + platform_device_register_simple("pseries-wdt", 0, NULL, 0); + return 0; +} +machine_subsys_initcall(pseries, pseries_wdt_init); + static void pseries_8259_cascade(struct irq_desc *desc) { struct irq_chip *chip = irq_desc_get_chip(desc); -- 2.27.0
[PATCH v3 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. Signed-off-by: Scott Cheloha --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c| 239 ++ 4 files changed, 260 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..29153eed6689 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: +action: + Action taken when watchdog expires: 0 (power off), 1 (restart), + 2 (dump and restart). (default=1) +timeout: + Initial watchdog timeout in seconds. (default=60) +nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index 32fd37698932..a2429604a4ab 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1962,6 +1962,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index c324e9d820e9..cdeb119e6e61 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -187,6 +187,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..7f53b5293409 --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,239 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * H_WATCHDOG Input + * + * R4: "flags": + * + * Bits 48-55: "operation" + */ +#define PSERIES_WDTF_OP_START 0x100UL /* start timer */ +#define PSERIES_WDTF_OP_STOP 0x200UL /* stop timer */ +#define PSERIES_WDTF_OP_QUERY 0x300UL /* query timer capabilities */ + +/* + * Bits 56-63: "timeoutAction" (for "Start Watchdog" only) + */ +#define PSERIES_WDTF_ACTION_HARD_POWEROFF 0x1UL /* poweroff */ +#define PSERIES_WDTF_ACTION_HARD_RESTART 0x2UL /* restart */ +#define PSERIES_WDTF_ACTION_DUMP_RESTART 0x3UL /* dump + restart */ + +/* + * H_WATCHDOG Output + * + * R3: Return code + * + * H_SUCCESSThe operation completed. + * + * H_BUSY The hypervisor is too busy; retry the operation. + * + * H_PARAMETER The given "flags" are somehow invalid. Either the + * "operation" or "timeoutAction" is invalid, or a + * reserved bit is set. + * + * H_P2 The given "watchdogNumber" is zero or exceeds the + * supported maximum value. + * + * H_P3 The given "timeoutInMs" is below the supported + * minimum value. + * + * H_NOOP The given "watchdogNumber" is already stopped. + * + * H_HARDWARE The operation failed for ineffable reasons. + * + * H_FUNCTION The H_WATCHDOG hypercall is not supported by this + * hypervisor. + * + * R4: + * + * - For the "Query Watchdog Capabilities" operation, a 64-bit + * structure: + */ +#define PSERIES_WDTQ_MIN_TIMEOUT(cap) (((cap) >> 48) & 0x) +#define PSERIES_WD
Re: [PATCH v3 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
On Wed, Jul 13, 2022 at 01:50:14PM -0700, Guenter Roeck wrote: > On 7/13/22 13:23, Scott Cheloha wrote: > > PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits > > guest control of one or more virtual watchdog timers. The timers have > > millisecond granularity. The guest is terminated when a timer > > expires. > > > > This patch adds a watchdog driver for these timers, "pseries-wdt". > > > > pseries_wdt_probe() currently assumes the existence of only one > > platform device and always assigns it watchdogNumber 1. If we ever > > expose more than one timer to userspace we will need to devise a way > > to assign a distinct watchdogNumber to each platform device at device > > registration time. > > > > Signed-off-by: Scott Cheloha > > Acked-by: Guenter Roeck Guenter, Michael, Which tree is taking this series? watchdog-next or PPC development? -Scott