On 05/15/2017 09:04 AM, Rashmica Gupta wrote: > > > On 14/05/17 14:55, Anshuman Khandual wrote: >> On 05/09/2017 12:36 PM, Rashmica Gupta wrote: >>> Sorry for the late reply, I somehow missed this. >>> >>> >>> On 03/05/17 21:56, Anshuman Khandual wrote: >>>> On 05/03/2017 09:22 AM, Rashmica Gupta wrote: >>>>> On 28/04/17 19:52, Anshuman Khandual wrote: >>>>>> On 04/28/2017 11:12 AM, Rashmica Gupta wrote: >>>>>>> Some powerpc hardware features may want to gain access to a chunk of >>>>>> What kind of features ? Please add specifics. >>>>>> >>>>>>> undisturbed real memory. This update provides a means to unplug >>>>>>> said >>>>>>> memory >>>>>> Undisturbed ? Meaning part of memblock and currently inside the buddy >>>>>> allocator which we are trying to hot unplug out ? >>>>>> >>>>>>> from the kernel with a set of debugfs calls. By writing an integer >>>>>>> containing >>>>>>> the size of memory to be unplugged into >>>>>> Does the size has some constraints like aligned with memblock section >>>>>> size ? LMB size ? page block size ? etc. Please add the details. >>>>> Will do. >>>>> >>>>>>> /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that >>>>>>> much >>>>>>> memory from the end of each available chip's memory space (ie each >>>>>>> memory node). >>>>>> <size> amount (I guess bytes in this case) of memory will be removed >>>>>> from the end of the NUMA node ? Whats the guarantee that they >>>>>> would be >>>>>> free at that time and not being pinned by some process ? If its not >>>>>> guaranteed to be freed, then interface description should state that >>>>>> clearly. >>>>> We start looking from the end of the NUMA node but of course there >>>>> is no >>>>> guarantee >>>>> that we will always be able to find some memory there that we are able >>>>> to remove. >>>> Okay. Do we have interface for giving this memory back to the buddy >>>> allocator again when we are done with HW tracing ? If not we need to >>>> add one. >>> Not at the moment. Last time I spoke to Anton he said something along >>> the lines >>> of it not being too important as if you are getting the hardware traces >>> for debugging >>> purposes you are probably not worried about a bit of memory being out of >>> action. >>> >>> However I can't see why having an interface to online the memory would >>> be a bad thing, >>> so I'll look into it. >> Yes, the interface to put them back into buddy is important even if the >> amount of memory is very less for tracing. Just need to trigger hotplug >> and online procedure to put it back. >> >>>>>>> In addition, the means to read out the contents of the unplugged >>>>>>> memory is also >>>>>>> provided by reading out the >>>>>>> /sys/kernel/debug/powerpc/memtrace/<chip-id>/trace >>>>>>> file. >>>>>> All of the debugfs file interfaces added here should be documented >>>>>> some >>>>>> where in detail. >>>>>> >>>>>>> Signed-off-by: Anton Blanchard <an...@samba.org> >>>>>>> Signed-off-by: Rashmica Gupta <rashmic...@gmail.com> >>>>>>> >>>>>>> --- >>>>>>> This requires the 'Wire up hpte_removebolted for powernv' patch. >>>>>>> >>>>>>> RFC -> v1: Added in two missing locks. Replaced the open-coded >>>>>>> flush_memory_region() with the existing >>>>>>> flush_inval_dcache_range(start, end). >>>>>>> >>>>>>> memtrace_offline_pages() is open-coded because offline_pages is >>>>>>> designed to be >>>>>>> called through the sysfs interface - not directly. >>>>>>> >>>>>>> We could move the offlining of pages to userspace, which removes >>>>>>> some >>>>>>> of this >>>>>>> open-coding. This would then require passing info to the kernel such >>>>>>> that it >>>>>>> can then remove the memory that has been offlined. This could be >>>>>>> done >>>>>>> using >>>>>>> notifiers, but this isn't simple due to locking (remove_memory needs >>>>>>> mem_hotplug_begin() which the sysfs interface already has). This >>>>>>> could also be >>>>>>> done through the debugfs interface (similar to what is done here). >>>>>>> Either way, >>>>>>> this would require the process that needs the memory to have >>>>>>> open-coded code >>>>>>> which it shouldn't really be involved with. >>>>>>> >>>>>>> As the current remove_memory() function requires the memory to >>>>>>> already be >>>>>>> offlined, it makes sense to keep the offlining and removal of memory >>>>>>> functionality grouped together so that a process can simply make one >>>>>>> request to >>>>>>> unplug some memory. Ideally there would be a kernel function we >>>>>>> could >>>>>>> call that >>>>>>> would offline the memory and then remove it. >>>>>>> >>>>>>> >>>>>>> arch/powerpc/platforms/powernv/memtrace.c | 276 >>>>>>> ++++++++++++++++++++++++++++++ >>>>>>> 1 file changed, 276 insertions(+) >>>>>>> create mode 100644 arch/powerpc/platforms/powernv/memtrace.c >>>>>>> >>>>>>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c >>>>>>> b/arch/powerpc/platforms/powernv/memtrace.c >>>>>>> new file mode 100644 >>>>>>> index 0000000..86184b1 >>>>>>> --- /dev/null >>>>>>> +++ b/arch/powerpc/platforms/powernv/memtrace.c >>>>>>> @@ -0,0 +1,276 @@ >>>>>>> +/* >>>>>>> + * This program is free software; you can redistribute it and/or >>>>>>> modify >>>>>>> + * it under the terms of the GNU General Public License as >>>>>>> published by >>>>>>> + * the Free Software Foundation; either version 2 of the >>>>>>> License, or >>>>>>> + * (at your option) any later version. >>>>>>> + * >>>>>>> + * This program is distributed in the hope that it will be useful, >>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >>>>>>> + * GNU General Public License for more details. >>>>>>> + * >>>>>>> + * Copyright (C) IBM Corporation, 2014 >>>>>>> + * >>>>>>> + * Author: Anton Blanchard <an...@au.ibm.com> >>>>>>> + */ >>>>>>> + >>>>>>> +#define pr_fmt(fmt) "powernv-memtrace: " fmt >>>>>>> + >>>>>>> +#include <linux/bitops.h> >>>>>>> +#include <linux/string.h> >>>>>>> +#include <linux/memblock.h> >>>>>>> +#include <linux/init.h> >>>>>>> +#include <linux/moduleparam.h> >>>>>>> +#include <linux/fs.h> >>>>>>> +#include <linux/debugfs.h> >>>>>>> +#include <linux/slab.h> >>>>>>> +#include <linux/memory.h> >>>>>>> +#include <linux/memory_hotplug.h> >>>>>>> +#include <asm/machdep.h> >>>>>>> +#include <asm/debugfs.h> >>>>>>> +#include <asm/cacheflush.h> >>>>>>> + >>>>>>> +struct memtrace_entry { >>>>>>> + void *mem; >>>>>>> + u64 start; >>>>>>> + u64 size; >>>>>>> + u32 nid; >>>>>>> + struct dentry *dir; >>>>>>> + char name[16]; >>>>>>> +}; >>>>>> Little bit of description about the structure here will help. >>>>> Something like 'this enables us to keep track of the memory removed >>>>> from >>>>> each node'? >>>> Right, something like that. >>>> >>>>>>> + >>>>>>> +static struct memtrace_entry *memtrace_array; >>>>>>> +static unsigned int memtrace_array_nr; >>>>>>> + >>>>>>> +static ssize_t memtrace_read(struct file *filp, char __user *ubuf, >>>>>>> + size_t count, loff_t *ppos) >>>>>>> +{ >>>>>>> + struct memtrace_entry *ent = filp->private_data; >>>>>>> + >>>>>>> + return simple_read_from_buffer(ubuf, count, ppos, ent->mem, >>>>>>> ent->size); >>>>>>> +} >>>>>>> + >>>>>>> +static bool valid_memtrace_range(struct memtrace_entry *dev, >>>>>>> + unsigned long start, unsigned long size) >>>>>>> +{ >>>>>>> + if ((dev->start <= start) && >>>>>> Switch the position of start and dev->start above. Will make >>>>>> it easy while reading. >>>>>> >>>>>>> + ((start + size) <= (dev->start + dev->size))) >>>>>>> + return true; >>>>>>> + >>>>>>> + return false; >>>>>>> +} >>>>>>> + >>>>>>> +static int memtrace_mmap(struct file *filp, struct vm_area_struct >>>>>>> *vma) >>>>>>> +{ >>>>>>> + unsigned long size = vma->vm_end - vma->vm_start; >>>>>>> + struct memtrace_entry *dev = filp->private_data; >>>>>>> + >>>>>>> + if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, >>>>>>> size)) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >>>>>> Why we do this ? Its coming from real RAM not IO memory. Then the >>>>>> page >>>>>> protection still needs changes ? >>>>> Once the memory is removed from the kernel mappings we want to mark >>>>> it as >>>>> uncachable. >>>> Got it but why ? Uncachable marking are for pages which will be mapped >>>> to IO ranges which should not be cached just to prevent the possibility >>>> of stale data. >>>>>>> + >>>>>>> + if (io_remap_pfn_range(vma, vma->vm_start, >>>>>>> + vma->vm_pgoff + (dev->start >> PAGE_SHIFT), >>>>>>> + size, vma->vm_page_prot)) >>>>>> You can just call remap_pfn_rang() instead though they are all the >>>>>> same. >>>>>> There is nothing I/O here should be explicit. >>>>> Good point. >>>>> >>>>>>> + return -EAGAIN; >>>>>>> + >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +static const struct file_operations memtrace_fops = { >>>>>>> + .llseek = default_llseek, >>>>>>> + .read = memtrace_read, >>>>>>> + .mmap = memtrace_mmap, >>>>>>> + .open = simple_open, >>>>>>> +}; >>>>>>> + >>>>>>> +static int check_memblock_online(struct memory_block *mem, void >>>>>>> *arg) >>>>>>> +{ >>>>>>> + if (mem->state != MEM_ONLINE) >>>>>>> + return -1; >>>>>>> + >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +static int change_memblock_state(struct memory_block *mem, void >>>>>>> *arg) >>>>>>> +{ >>>>>>> + unsigned long state = (unsigned long)arg; >>>>>>> + >>>>>>> + mem->state = state; >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 >>>>>>> nr_pages) >>>>>>> +{ >>>>>>> + u64 end_pfn = start_pfn + nr_pages - 1; >>>>>>> + >>>>>>> + if (walk_memory_range(start_pfn, end_pfn, NULL, >>>>>>> + check_memblock_online)) >>>>>>> + return false; >>>>>>> + >>>>>>> + walk_memory_range(start_pfn, end_pfn, (void >>>>>>> *)MEM_GOING_OFFLINE, >>>>>>> + change_memblock_state); >>>>>>> + >>>>>> walk_memory_range() might be expensive, cant we just change the state >>>>>> to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during >>>>>> the first loop and bail out if any of the memblock is not in >>>>>> MEM_ONLINE >>>>>> in the first place. >>>>> Good idea. >>>>> >>>>>>> + mem_hotplug_begin(); >>>>>>> + if (offline_pages(start_pfn, nr_pages)) { >>>>>>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE, >>>>>>> + change_memblock_state); >>>>>>> + mem_hotplug_done(); >>>>>> Right, this can remain as is. If we fail to offline pages, mark the >>>>>> memory blocks as MEM_ONLINE again. >>>>>> >>>>>>> + return false; >>>>>>> + } >>>>>>> + >>>>>>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE, >>>>>>> + change_memblock_state); >>>>>>> + mem_hotplug_done(); >>>>>> Right. >>>>>> >>>>>>> + >>>>>>> + /* Clear the dcache to remove any references to the memory */ >>>>>>> + flush_inval_dcache_range((u64)__va(start_pfn << PAGE_SHIFT), >>>>>>> + (u64)__va(end_pfn << PAGE_SHIFT)); >>>>>> I am wondering why this is required now when we dont do anything for >>>>>> cache flushing calls from core VM. If its really required now then >>>>>> it also should be required during memory hot unplug operations in >>>>>> general as well. >>>>> I could not see if this was being done when removing memory so figured >>>>> that it was better to put it in than not do it. >>>> Looking at the definitions I had pointed out before which gets >>>> called from core VM, powerpc does not need to do anything specific >>>> for cache invalidation or flushing. But I am not really sure on >>>> this. So let it be. >>>> >>>>>> /* >>>>>> * No cache flushing is required when address mappings are >>>>>> changed, >>>>>> * because the caches on PowerPCs are physically addressed. >>>>>> */ >>>>>> #define flush_cache_all() do { } while (0) >>>>>> #define flush_cache_mm(mm) do { } while (0) >>>>>> #define flush_cache_dup_mm(mm) do { } while (0) >>>>>> #define flush_cache_range(vma, start, end) do { } while (0) >>>>>> #define flush_cache_page(vma, vmaddr, pfn) do { } while (0) >>>>>> #define flush_icache_page(vma, page) do { } while (0) >>>>>> #define flush_cache_vmap(start, end) do { } while (0) >>>>>> #define flush_cache_vunmap(start, end) do { } while (0) >>>>>> >>>>>> >>>>>>> + >>>>>>> + /* Now remove memory from the mappings */ >>>>>>> + lock_device_hotplug(); >>>>>>> + remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << >>>>>>> PAGE_SHIFT); >>>>>>> + unlock_device_hotplug(); >>>>>> Right. Now we have successfully taken down the memory. >>>>>> >>>>>>> + >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> +static u64 memtrace_alloc_node(u32 nid, u64 size) >>>>>>> +{ >>>>>>> + u64 start_pfn, end_pfn, nr_pages; >>>>>>> + u64 base_pfn; >>>>>>> + >>>>>>> + if (!NODE_DATA(nid) || !node_spanned_pages(nid)) >>>>>>> + return 0; >>>>>> Why NODE_DATA check is required here ? Each node should have one >>>>>> allocated and initialized by now, else we have bigger problems. >>>>>> Is there any specific reason to check for spanned pages instead >>>>>> of present/managed pages. >>>>> Anton wrote this check, so will need to confirm with him. I assume >>>>> we check node_spanned_pages() rather than node_present_pages() >>>>> because in arch/powerpc/mm/numa.c we set node_spanned_pages() and >>>>> not node_present_pages()? >>>> I guess any thing is okay but NODE_DATA() seems redundant though. >>> Agreed. >>> >>>> struct pglist_data { >>>> .......................... >>>> unsigned long node_present_pages; /* total number of physical >>>> pages */ >>>> unsigned long node_spanned_pages; /* total size of physical page >>>> range, including holes */ >>>> >>>> } >>>> >>>>>>> + >>>>>>> + start_pfn = node_start_pfn(nid); >>>>>>> + end_pfn = node_end_pfn(nid); >>>>>>> + nr_pages = size >> PAGE_SHIFT; >>>>>>> + >>>>>>> + /* Trace memory needs to be aligned to the size */ >>>>>>> + end_pfn = round_down(end_pfn - nr_pages, nr_pages); >>>>>>> + >>>>>>> + for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= >>>>>>> nr_pages) { >>>>>>> + if (memtrace_offline_pages(nid, base_pfn, nr_pages) == >>>>>>> true) >>>>>>> + return base_pfn << PAGE_SHIFT; >>>>>>> + } >>>>>>> + >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +static int memtrace_init_regions_runtime(u64 size) >>>>>>> +{ >>>>>>> + u64 m; >>>>>>> + u32 nid; >>>>>>> + >>>>>>> + memtrace_array = kzalloc(sizeof(struct memtrace_entry) * >>>>>>> + num_online_nodes(), GFP_KERNEL); >>>>>>> + if (!memtrace_array) { >>>>>>> + pr_err("Failed to allocate memtrace_array\n"); >>>>>>> + return -EINVAL; >>>>>>> + } >>>>>>> + >>>>>>> + for_each_online_node(nid) { >>>>>>> + m = memtrace_alloc_node(nid, size); >>>>>>> + /* >>>>>>> + * A node might not have any local memory, so warn but >>>>>>> + * continue on. >>>>>>> + */ >>>>>>> + if (!m) { >>>>>>> + pr_err("Failed to allocate trace memory on node %d\n", >>>>>>> + nid); >>>>>>> + } else { >>>>>>> + pr_info("Allocated trace memory on node %d at >>>>>>> 0x%016llx\n", >>>>>>> + nid, m); >>>>>>> + >>>>>>> + memtrace_array[memtrace_array_nr].start = m; >>>>>>> + memtrace_array[memtrace_array_nr].size = size; >>>>>>> + memtrace_array[memtrace_array_nr].nid = nid; >>>>>>> + memtrace_array_nr++; >>>>>>> + } >>>>>>> + } >>>>>>> + return 0; >>>>>>> +} >>>>>> All the pr_info() and pr_err() prints should have a "memtrace :" >>>>>> before the >>>>>> actual string to make it clear that its coming from this feature. >>>>>> >>>>> Good point! >>>>>>> + >>>>>>> +static struct dentry *memtrace_debugfs_dir; >>>>>>> + >>>>>>> +static int memtrace_init_debugfs(void) >>>>>>> +{ >>>>>>> + int ret = 0; >>>>>>> + int i; >>>>>>> + >>>>>>> + for (i = 0; i < memtrace_array_nr; i++) { >>>>>>> + struct dentry *dir; >>>>>>> + struct memtrace_entry *ent = &memtrace_array[i]; >>>>>>> + >>>>>>> + ent->mem = ioremap(ent->start, ent->size); >>>>>>> + /* Warn but continue on */ >>>>>>> + if (!ent->mem) { >>>>>>> + pr_err("Failed to map trace memory at 0x%llx\n", >>>>>>> + ent->start); >>>>>>> + ret = -1; >>>>>>> + continue; >>>>>>> + } >>>>>>> + >>>>>>> + snprintf(ent->name, 16, "%08x", ent->nid); >>>>>>> + dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir); >>>>>>> + if (!dir) >>>>>>> + return -1; >>>>>>> + >>>>>>> + ent->dir = dir; >>>>>>> + debugfs_create_file("trace", 0400, dir, ent, >>>>>>> &memtrace_fops); >>>>>>> + debugfs_create_x64("start", 0400, dir, &ent->start); >>>>>>> + debugfs_create_x64("size", 0400, dir, &ent->size); >>>>>>> + debugfs_create_u32("node", 0400, dir, &ent->nid); >>>>>>> + } >>>>>> Oh okay, its creating all the four files. Please create corresponding >>>>>> to each of the files some where. Documentation/ABI/testing lists the >>>>>> actual system ABI on /sys/ not the sys/kernel/debug/ ones I guess. >>>>> I'm not exactly sure what you are saying here... Seeing that there is >>>>> documentation about debugfs files in Documentation/ABI/testing, I'll >>>>> follow suit >>>>> and put it there. >>>> I meant the same. >>>> >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> + >>>>>>> +static u64 memtrace_size; >>>>>>> + >>>>>>> +static int memtrace_enable_set(void *data, u64 val) >>>>>>> +{ >>>>>>> + if (memtrace_size) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + if (!val) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + /* Make sure size is aligned to a memory block */ >>>>>>> + if (val & (memory_block_size_bytes()-1)) >>>>>> As I have mentioned earlier, this should be mentioned in the >>>>>> interface >>>>>> description some where. >>>>>> >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + if (memtrace_init_regions_runtime(val)) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + if (memtrace_init_debugfs()) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + memtrace_size = val; >>>>>>> + >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +static int memtrace_enable_get(void *data, u64 *val) >>>>>>> +{ >>>>>>> + *val = memtrace_size; >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get, >>>>>>> memtrace_enable_set, "0x%016llx\n"); >>>>>>> + >>>>>>> +static int memtrace_init(void) >>>>>>> +{ >>>>>>> + memtrace_debugfs_dir = debugfs_create_dir("memtrace", >>>>>>> powerpc_debugfs_root); >>>>>>> + if (!memtrace_debugfs_dir) >>>>>>> + return -1; >>>>>>> + >>>>>>> + debugfs_create_file("enable", 0600, memtrace_debugfs_dir, >>>>>>> + NULL, &memtrace_init_fops); >>>>>>> + >>>>>>> + return 0; >>>>>>> +} >>>>>>> +machine_device_initcall(powernv, memtrace_init); >>>>>>> + >>>>>>> >>>>>> BTW how we start the tracing process for the trace to be collected >>>>>> in the >>>>>> interface before we can read them ? This interface does not seem to >>>>>> have >>>>>> a handler. When it directs the HW to start collecting the traces ? >>>>>> >>>>>> debugfs_create_x64("start", 0400, dir, &ent->start); >>>>>> >>>>>> >>>>> I think you're asking 'what is actually going to call this code and do >>>>> the tracing'? >>>> No, when you call this interface, where is the routine to start the >>>> actual tracing invoking appropriate platform functions or HW >>>> instructions ? I dont see such a function associated with 'start' >>>> interface mentioned above. >>> Essentially, >>> DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get, >>> memtrace_enable_set, "0x%016llx\n"); >>> >>> means that when a number is written to the memtrace/enable file, the >>> number is >>> read as a u64 and passed to memtrace_enable_set() >> Right, then we create the following interfaces for each entry of the >> memory >> trace. By now all the memory ranges are ioremapped. >> >> + debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops); >> + debugfs_create_x64("start", 0400, dir, &ent->start); >> + debugfs_create_x64("size", 0400, dir, &ent->size); >> + debugfs_create_u32("node", 0400, dir, &ent->nid); >> >> Then what really starts the HW trace ? where we tell the HW to start >> tracing >> and put the trace details in the memory buffer allocated and ioremapped ? >> IIUC just making the memory ioremapped() wont start the trace >> automatically. >> > > Ah sorry, misunderstood you. Yes, this patch is only to remove the > memory from kernel mappings so the hardware tracing can use it - not the > actual code that invokes the tracing. That code involves some hardware > information that hasn't been made publicly accessible, so it is not my > place to share that. The hardware trace will be written to the 'removed' > memory directly by the hardware and the debugfs files allow us to read > the trace without having to reonline the memory.
It makes sense now. Thanks Rashmica.