On 05/03/2017 09:22 AM, Rashmica Gupta wrote: > On 28/04/17 19:52, Anshuman Khandual wrote: >> On 04/28/2017 11:12 AM, Rashmica Gupta wrote: >>> Some powerpc hardware features may want to gain access to a chunk of >> What kind of features ? Please add specifics. >> >>> undisturbed real memory. This update provides a means to unplug said >>> memory >> Undisturbed ? Meaning part of memblock and currently inside the buddy >> allocator which we are trying to hot unplug out ? >> >>> from the kernel with a set of debugfs calls. By writing an integer >>> containing >>> the size of memory to be unplugged into >> Does the size has some constraints like aligned with memblock section >> size ? LMB size ? page block size ? etc. Please add the details. > > Will do. > >> >>> /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that >>> much >>> memory from the end of each available chip's memory space (ie each >>> memory node). >> <size> amount (I guess bytes in this case) of memory will be removed >> from the end of the NUMA node ? Whats the guarantee that they would be >> free at that time and not being pinned by some process ? If its not >> guaranteed to be freed, then interface description should state that >> clearly. > > We start looking from the end of the NUMA node but of course there is no > guarantee > that we will always be able to find some memory there that we are able > to remove.
Okay. Do we have interface for giving this memory back to the buddy allocator again when we are done with HW tracing ? If not we need to add one. > >>> In addition, the means to read out the contents of the unplugged >>> memory is also >>> provided by reading out the >>> /sys/kernel/debug/powerpc/memtrace/<chip-id>/trace >>> file. >> All of the debugfs file interfaces added here should be documented some >> where in detail. >> >>> Signed-off-by: Anton Blanchard <an...@samba.org> >>> Signed-off-by: Rashmica Gupta <rashmic...@gmail.com> >>> >>> --- >>> This requires the 'Wire up hpte_removebolted for powernv' patch. >>> >>> RFC -> v1: Added in two missing locks. Replaced the open-coded >>> flush_memory_region() with the existing >>> flush_inval_dcache_range(start, end). >>> >>> memtrace_offline_pages() is open-coded because offline_pages is >>> designed to be >>> called through the sysfs interface - not directly. >>> >>> We could move the offlining of pages to userspace, which removes some >>> of this >>> open-coding. This would then require passing info to the kernel such >>> that it >>> can then remove the memory that has been offlined. This could be done >>> using >>> notifiers, but this isn't simple due to locking (remove_memory needs >>> mem_hotplug_begin() which the sysfs interface already has). This >>> could also be >>> done through the debugfs interface (similar to what is done here). >>> Either way, >>> this would require the process that needs the memory to have >>> open-coded code >>> which it shouldn't really be involved with. >>> >>> As the current remove_memory() function requires the memory to >>> already be >>> offlined, it makes sense to keep the offlining and removal of memory >>> functionality grouped together so that a process can simply make one >>> request to >>> unplug some memory. Ideally there would be a kernel function we could >>> call that >>> would offline the memory and then remove it. >>> >>> >>> arch/powerpc/platforms/powernv/memtrace.c | 276 >>> ++++++++++++++++++++++++++++++ >>> 1 file changed, 276 insertions(+) >>> create mode 100644 arch/powerpc/platforms/powernv/memtrace.c >>> >>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c >>> b/arch/powerpc/platforms/powernv/memtrace.c >>> new file mode 100644 >>> index 0000000..86184b1 >>> --- /dev/null >>> +++ b/arch/powerpc/platforms/powernv/memtrace.c >>> @@ -0,0 +1,276 @@ >>> +/* >>> + * This program is free software; you can redistribute it and/or modify >>> + * it under the terms of the GNU General Public License as published by >>> + * the Free Software Foundation; either version 2 of the License, or >>> + * (at your option) any later version. >>> + * >>> + * This program is distributed in the hope that it will be useful, >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >>> + * GNU General Public License for more details. >>> + * >>> + * Copyright (C) IBM Corporation, 2014 >>> + * >>> + * Author: Anton Blanchard <an...@au.ibm.com> >>> + */ >>> + >>> +#define pr_fmt(fmt) "powernv-memtrace: " fmt >>> + >>> +#include <linux/bitops.h> >>> +#include <linux/string.h> >>> +#include <linux/memblock.h> >>> +#include <linux/init.h> >>> +#include <linux/moduleparam.h> >>> +#include <linux/fs.h> >>> +#include <linux/debugfs.h> >>> +#include <linux/slab.h> >>> +#include <linux/memory.h> >>> +#include <linux/memory_hotplug.h> >>> +#include <asm/machdep.h> >>> +#include <asm/debugfs.h> >>> +#include <asm/cacheflush.h> >>> + >>> +struct memtrace_entry { >>> + void *mem; >>> + u64 start; >>> + u64 size; >>> + u32 nid; >>> + struct dentry *dir; >>> + char name[16]; >>> +}; >> Little bit of description about the structure here will help. > > Something like 'this enables us to keep track of the memory removed from > each node'? Right, something like that. > >>> + >>> +static struct memtrace_entry *memtrace_array; >>> +static unsigned int memtrace_array_nr; >>> + >>> +static ssize_t memtrace_read(struct file *filp, char __user *ubuf, >>> + size_t count, loff_t *ppos) >>> +{ >>> + struct memtrace_entry *ent = filp->private_data; >>> + >>> + return simple_read_from_buffer(ubuf, count, ppos, ent->mem, >>> ent->size); >>> +} >>> + >>> +static bool valid_memtrace_range(struct memtrace_entry *dev, >>> + unsigned long start, unsigned long size) >>> +{ >>> + if ((dev->start <= start) && >> Switch the position of start and dev->start above. Will make >> it easy while reading. >> >>> + ((start + size) <= (dev->start + dev->size))) >>> + return true; >>> + >>> + return false; >>> +} >>> + >>> +static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma) >>> +{ >>> + unsigned long size = vma->vm_end - vma->vm_start; >>> + struct memtrace_entry *dev = filp->private_data; >>> + >>> + if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size)) >>> + return -EINVAL; >>> + >>> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >> Why we do this ? Its coming from real RAM not IO memory. Then the page >> protection still needs changes ? > > Once the memory is removed from the kernel mappings we want to mark it as > uncachable. Got it but why ? Uncachable marking are for pages which will be mapped to IO ranges which should not be cached just to prevent the possibility of stale data. > >>> + >>> + if (io_remap_pfn_range(vma, vma->vm_start, >>> + vma->vm_pgoff + (dev->start >> PAGE_SHIFT), >>> + size, vma->vm_page_prot)) >> You can just call remap_pfn_rang() instead though they are all the same. >> There is nothing I/O here should be explicit. > > Good point. > >>> + return -EAGAIN; >>> + >>> + return 0; >>> +} >>> + >>> +static const struct file_operations memtrace_fops = { >>> + .llseek = default_llseek, >>> + .read = memtrace_read, >>> + .mmap = memtrace_mmap, >>> + .open = simple_open, >>> +}; >>> + >>> +static int check_memblock_online(struct memory_block *mem, void *arg) >>> +{ >>> + if (mem->state != MEM_ONLINE) >>> + return -1; >>> + >>> + return 0; >>> +} >>> + >>> +static int change_memblock_state(struct memory_block *mem, void *arg) >>> +{ >>> + unsigned long state = (unsigned long)arg; >>> + >>> + mem->state = state; >>> + return 0; >>> +} >>> + >>> +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 >>> nr_pages) >>> +{ >>> + u64 end_pfn = start_pfn + nr_pages - 1; >>> + >>> + if (walk_memory_range(start_pfn, end_pfn, NULL, >>> + check_memblock_online)) >>> + return false; >>> + >>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE, >>> + change_memblock_state); >>> + >> walk_memory_range() might be expensive, cant we just change the state >> to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during >> the first loop and bail out if any of the memblock is not in MEM_ONLINE >> in the first place. > > Good idea. > >>> + mem_hotplug_begin(); >>> + if (offline_pages(start_pfn, nr_pages)) { >>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE, >>> + change_memblock_state); >>> + mem_hotplug_done(); >> Right, this can remain as is. If we fail to offline pages, mark the >> memory blocks as MEM_ONLINE again. >> >>> + return false; >>> + } >>> + >>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE, >>> + change_memblock_state); >>> + mem_hotplug_done(); >> Right. >> >>> + >>> + /* Clear the dcache to remove any references to the memory */ >>> + flush_inval_dcache_range((u64)__va(start_pfn << PAGE_SHIFT), >>> + (u64)__va(end_pfn << PAGE_SHIFT)); >> I am wondering why this is required now when we dont do anything for >> cache flushing calls from core VM. If its really required now then >> it also should be required during memory hot unplug operations in >> general as well. > > I could not see if this was being done when removing memory so figured > that it was better to put it in than not do it. Looking at the definitions I had pointed out before which gets called from core VM, powerpc does not need to do anything specific for cache invalidation or flushing. But I am not really sure on this. So let it be. > >> >> /* >> * No cache flushing is required when address mappings are changed, >> * because the caches on PowerPCs are physically addressed. >> */ >> #define flush_cache_all() do { } while (0) >> #define flush_cache_mm(mm) do { } while (0) >> #define flush_cache_dup_mm(mm) do { } while (0) >> #define flush_cache_range(vma, start, end) do { } while (0) >> #define flush_cache_page(vma, vmaddr, pfn) do { } while (0) >> #define flush_icache_page(vma, page) do { } while (0) >> #define flush_cache_vmap(start, end) do { } while (0) >> #define flush_cache_vunmap(start, end) do { } while (0) >> >> >>> + >>> + /* Now remove memory from the mappings */ >>> + lock_device_hotplug(); >>> + remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << >>> PAGE_SHIFT); >>> + unlock_device_hotplug(); >> Right. Now we have successfully taken down the memory. >> >>> + >>> + return true; >>> +} >>> + >>> +static u64 memtrace_alloc_node(u32 nid, u64 size) >>> +{ >>> + u64 start_pfn, end_pfn, nr_pages; >>> + u64 base_pfn; >>> + >>> + if (!NODE_DATA(nid) || !node_spanned_pages(nid)) >>> + return 0; >> Why NODE_DATA check is required here ? Each node should have one >> allocated and initialized by now, else we have bigger problems. >> Is there any specific reason to check for spanned pages instead >> of present/managed pages. > > Anton wrote this check, so will need to confirm with him. I assume > we check node_spanned_pages() rather than node_present_pages() > because in arch/powerpc/mm/numa.c we set node_spanned_pages() and > not node_present_pages()? I guess any thing is okay but NODE_DATA() seems redundant though. struct pglist_data { .......................... unsigned long node_present_pages; /* total number of physical pages */ unsigned long node_spanned_pages; /* total size of physical page range, including holes */ } >> >>> + >>> + start_pfn = node_start_pfn(nid); >>> + end_pfn = node_end_pfn(nid); >>> + nr_pages = size >> PAGE_SHIFT; >>> + >>> + /* Trace memory needs to be aligned to the size */ >>> + end_pfn = round_down(end_pfn - nr_pages, nr_pages); >>> + >>> + for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= >>> nr_pages) { >>> + if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) >>> + return base_pfn << PAGE_SHIFT; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static int memtrace_init_regions_runtime(u64 size) >>> +{ >>> + u64 m; >>> + u32 nid; >>> + >>> + memtrace_array = kzalloc(sizeof(struct memtrace_entry) * >>> + num_online_nodes(), GFP_KERNEL); >>> + if (!memtrace_array) { >>> + pr_err("Failed to allocate memtrace_array\n"); >>> + return -EINVAL; >>> + } >>> + >>> + for_each_online_node(nid) { >>> + m = memtrace_alloc_node(nid, size); >>> + /* >>> + * A node might not have any local memory, so warn but >>> + * continue on. >>> + */ >>> + if (!m) { >>> + pr_err("Failed to allocate trace memory on node %d\n", >>> + nid); >>> + } else { >>> + pr_info("Allocated trace memory on node %d at 0x%016llx\n", >>> + nid, m); >>> + >>> + memtrace_array[memtrace_array_nr].start = m; >>> + memtrace_array[memtrace_array_nr].size = size; >>> + memtrace_array[memtrace_array_nr].nid = nid; >>> + memtrace_array_nr++; >>> + } >>> + } >>> + return 0; >>> +} >> All the pr_info() and pr_err() prints should have a "memtrace :" >> before the >> actual string to make it clear that its coming from this feature. >> > Good point! >>> + >>> +static struct dentry *memtrace_debugfs_dir; >>> + >>> +static int memtrace_init_debugfs(void) >>> +{ >>> + int ret = 0; >>> + int i; >>> + >>> + for (i = 0; i < memtrace_array_nr; i++) { >>> + struct dentry *dir; >>> + struct memtrace_entry *ent = &memtrace_array[i]; >>> + >>> + ent->mem = ioremap(ent->start, ent->size); >>> + /* Warn but continue on */ >>> + if (!ent->mem) { >>> + pr_err("Failed to map trace memory at 0x%llx\n", >>> + ent->start); >>> + ret = -1; >>> + continue; >>> + } >>> + >>> + snprintf(ent->name, 16, "%08x", ent->nid); >>> + dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir); >>> + if (!dir) >>> + return -1; >>> + >>> + ent->dir = dir; >>> + debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops); >>> + debugfs_create_x64("start", 0400, dir, &ent->start); >>> + debugfs_create_x64("size", 0400, dir, &ent->size); >>> + debugfs_create_u32("node", 0400, dir, &ent->nid); >>> + } >> Oh okay, its creating all the four files. Please create corresponding >> to each of the files some where. Documentation/ABI/testing lists the >> actual system ABI on /sys/ not the sys/kernel/debug/ ones I guess. > > I'm not exactly sure what you are saying here... Seeing that there is > documentation about debugfs files in Documentation/ABI/testing, I'll > follow suit > and put it there. I meant the same. > >>> + >>> + return ret; >>> +} >>> + >>> +static u64 memtrace_size; >>> + >>> +static int memtrace_enable_set(void *data, u64 val) >>> +{ >>> + if (memtrace_size) >>> + return -EINVAL; >>> + >>> + if (!val) >>> + return -EINVAL; >>> + >>> + /* Make sure size is aligned to a memory block */ >>> + if (val & (memory_block_size_bytes()-1)) >> As I have mentioned earlier, this should be mentioned in the interface >> description some where. >> >>> + return -EINVAL; >>> + >>> + if (memtrace_init_regions_runtime(val)) >>> + return -EINVAL; >>> + >>> + if (memtrace_init_debugfs()) >>> + return -EINVAL; >>> + >>> + memtrace_size = val; >>> + >>> + return 0; >>> +} >>> + >>> +static int memtrace_enable_get(void *data, u64 *val) >>> +{ >>> + *val = memtrace_size; >>> + return 0; >>> +} >>> + >>> +DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get, >>> memtrace_enable_set, "0x%016llx\n"); >>> + >>> +static int memtrace_init(void) >>> +{ >>> + memtrace_debugfs_dir = debugfs_create_dir("memtrace", >>> powerpc_debugfs_root); >>> + if (!memtrace_debugfs_dir) >>> + return -1; >>> + >>> + debugfs_create_file("enable", 0600, memtrace_debugfs_dir, >>> + NULL, &memtrace_init_fops); >>> + >>> + return 0; >>> +} >>> +machine_device_initcall(powernv, memtrace_init); >>> + >>> >> BTW how we start the tracing process for the trace to be collected in the >> interface before we can read them ? This interface does not seem to have >> a handler. When it directs the HW to start collecting the traces ? >> >> debugfs_create_x64("start", 0400, dir, &ent->start); >> >> > I think you're asking 'what is actually going to call this code and do > the tracing'? No, when you call this interface, where is the routine to start the actual tracing invoking appropriate platform functions or HW instructions ? I dont see such a function associated with 'start' interface mentioned above. > There is some userspace code in the works to do this. The person who was > writing it > is doing something else now, and I think it's a bit gross so I'm trying > to tidy it > up a little. Okay, please do provide some pointers when the code is ready which will help in better understanding of this interface.