powernv: Enable removal of memory for in memory tracing

Rashmica Gupta Sun, 14 May 2017 20:36:05 -0700


On 14/05/17 14:55, Anshuman Khandual wrote:

On 05/09/2017 12:36 PM, Rashmica Gupta wrote:

Sorry for the late reply, I somehow missed this.


On 03/05/17 21:56, Anshuman Khandual wrote:

On 05/03/2017 09:22 AM, Rashmica Gupta wrote:

On 28/04/17 19:52, Anshuman Khandual wrote:

On 04/28/2017 11:12 AM, Rashmica Gupta wrote:

Some powerpc hardware features may want to gain access to a chunk of

What kind of features ? Please add specifics.

undisturbed real memory.  This update provides a means to unplug said
memory

Undisturbed ? Meaning part of memblock and currently inside the buddy
allocator which we are trying to hot unplug out ?

from the kernel with a set of debugfs calls.  By writing an integer
containing
    the size of memory to be unplugged into

Does the size has some constraints like aligned with memblock section
size ? LMB size ? page block size ? etc. Please add the details.

Will do.

/sys/kernel/debug/powerpc/memtrace/enable, the code will remove that
much
memory from the end of each available chip's memory space (ie each
memory node).

<size> amount (I guess bytes in this case) of memory will be removed
from the end of the NUMA node ? Whats the guarantee that they would be
free at that time and not being pinned by some process ? If its not
guaranteed to be freed, then interface description should state that
clearly.

We start looking from the end of the NUMA node but of course there is no
guarantee
that we will always be able to find some memory there that we are able
to remove.

Okay. Do we have interface for giving this memory back to the buddy
allocator again when we are done with HW tracing ? If not we need to
add one.

Not at the moment. Last time I spoke to Anton he said something along
the lines
of it not being too important as if you are getting the hardware traces
for debugging
purposes you are probably not worried about a bit of memory being out of
action.

However I can't see why having an interface to online the memory would
be a bad thing,
so I'll look into it.

Yes, the interface to put them back into buddy is important even if the
amount of memory is very less for tracing. Just need to trigger hotplug
and online procedure to put it back.

In addition, the means to read out the contents of the unplugged
memory is also
provided by reading out the
/sys/kernel/debug/powerpc/memtrace/<chip-id>/trace
file.

All of the debugfs file interfaces added here should be documented some
where in detail.

Signed-off-by: Anton Blanchard <an...@samba.org>
Signed-off-by: Rashmica Gupta <rashmic...@gmail.com>

---
This requires the 'Wire up hpte_removebolted for powernv' patch.

RFC -> v1: Added in two missing locks. Replaced the open-coded
flush_memory_region() with the existing
flush_inval_dcache_range(start, end).

memtrace_offline_pages() is open-coded because offline_pages is
designed to be
called through the sysfs interface - not directly.

We could move the offlining of pages to userspace, which removes some
of this
open-coding. This would then require passing info to the kernel such
that it
can then remove the memory that has been offlined. This could be done
using
notifiers, but this isn't simple due to locking (remove_memory needs
mem_hotplug_begin() which the sysfs interface already has). This
could also be
done through the debugfs interface (similar to what is done here).
Either way,
this would require the process that needs the memory to have
open-coded code
which it shouldn't really be involved with.

As the current remove_memory() function requires the memory to
already be
offlined, it makes sense to keep the offlining and removal of memory
functionality grouped together so that a process can simply make one
request to
unplug some memory. Ideally there would be a kernel function we could
call that
would offline the memory and then remove it.


    arch/powerpc/platforms/powernv/memtrace.c | 276
++++++++++++++++++++++++++++++
    1 file changed, 276 insertions(+)
    create mode 100644 arch/powerpc/platforms/powernv/memtrace.c

diff --git a/arch/powerpc/platforms/powernv/memtrace.c
b/arch/powerpc/platforms/powernv/memtrace.c
new file mode 100644
index 0000000..86184b1
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -0,0 +1,276 @@
+/*
+ * This program is free software; you can redistribute it and/or
modify
+ * it under the terms of the GNU General Public License as
published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) IBM Corporation, 2014
+ *
+ * Author: Anton Blanchard <an...@au.ibm.com>
+ */
+
+#define pr_fmt(fmt) "powernv-memtrace: " fmt
+
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/memblock.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/slab.h>
+#include <linux/memory.h>
+#include <linux/memory_hotplug.h>
+#include <asm/machdep.h>
+#include <asm/debugfs.h>
+#include <asm/cacheflush.h>
+
+struct memtrace_entry {
+    void *mem;
+    u64 start;
+    u64 size;
+    u32 nid;
+    struct dentry *dir;
+    char name[16];
+};

Little bit of description about the structure here will help.

Something like 'this enables us to keep track of the memory removed from
each node'?

Right, something like that.

+
+static struct memtrace_entry *memtrace_array;
+static unsigned int memtrace_array_nr;
+
+static ssize_t memtrace_read(struct file *filp, char __user *ubuf,
+                 size_t count, loff_t *ppos)
+{
+    struct memtrace_entry *ent = filp->private_data;
+
+    return simple_read_from_buffer(ubuf, count, ppos, ent->mem,
ent->size);
+}
+
+static bool valid_memtrace_range(struct memtrace_entry *dev,
+                 unsigned long start, unsigned long size)
+{
+    if ((dev->start <= start) &&

Switch the position of start and dev->start above. Will make
it easy while reading.

+        ((start + size) <= (dev->start + dev->size)))
+        return true;
+
+    return false;
+}
+
+static int memtrace_mmap(struct file *filp, struct vm_area_struct
*vma)
+{
+    unsigned long size = vma->vm_end - vma->vm_start;
+    struct memtrace_entry *dev = filp->private_data;
+
+    if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT,
size))
+        return -EINVAL;
+
+    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

Why we do this ? Its coming from real RAM not IO memory. Then the page
protection still needs changes ?

Once the memory is removed from the kernel mappings we want to mark
it as
uncachable.

Got it but why ? Uncachable marking are for pages which will be mapped
to IO ranges which should not be cached just to prevent the possibility
of stale data.

+
+    if (io_remap_pfn_range(vma, vma->vm_start,
+                   vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
+                   size, vma->vm_page_prot))

You can just call remap_pfn_rang() instead though they are all the
same.
There is nothing I/O here should be explicit.

Good point.

+        return -EAGAIN;
+
+    return 0;
+}
+
+static const struct file_operations memtrace_fops = {
+    .llseek = default_llseek,
+    .read    = memtrace_read,
+    .mmap    = memtrace_mmap,
+    .open    = simple_open,
+};
+
+static int check_memblock_online(struct memory_block *mem, void *arg)
+{
+    if (mem->state != MEM_ONLINE)
+        return -1;
+
+    return 0;
+}
+
+static int change_memblock_state(struct memory_block *mem, void *arg)
+{
+    unsigned long state = (unsigned long)arg;
+
+    mem->state = state;
+    return 0;
+}
+
+static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64
nr_pages)
+{
+    u64 end_pfn = start_pfn + nr_pages - 1;
+
+    if (walk_memory_range(start_pfn, end_pfn, NULL,
+        check_memblock_online))
+        return false;
+
+    walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
+              change_memblock_state);
+

walk_memory_range() might be expensive, cant we just change the state
to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
the first loop and bail out if any of the memblock is not in MEM_ONLINE
in the first place.

Good idea.

+    mem_hotplug_begin();
+    if (offline_pages(start_pfn, nr_pages)) {
+        walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
+                  change_memblock_state);
+        mem_hotplug_done();

Right, this can remain as is. If we fail to offline pages, mark the
memory blocks as MEM_ONLINE again.

+        return false;
+    }
+
+    walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
+              change_memblock_state);
+    mem_hotplug_done();

Right.

+
+    /* Clear the dcache to remove any references to the memory */
+    flush_inval_dcache_range((u64)__va(start_pfn << PAGE_SHIFT),
+                   (u64)__va(end_pfn << PAGE_SHIFT));

I am wondering why this is required now when we dont do anything for
cache flushing calls from core VM. If its really required now then
it also should be required during memory hot unplug operations in
general as well.

I could not see if this was being done when removing memory so figured
that it was better to put it in than not do it.

Looking at the definitions I had pointed out before which gets
called from core VM, powerpc does not need to do anything specific
for cache invalidation or flushing. But I am not really sure on
this. So let it be.

/*
    * No cache flushing is required when address mappings are changed,
    * because the caches on PowerPCs are physically addressed.
    */
#define flush_cache_all()            do { } while (0)
#define flush_cache_mm(mm)            do { } while (0)
#define flush_cache_dup_mm(mm)            do { } while (0)
#define flush_cache_range(vma, start, end)    do { } while (0)
#define flush_cache_page(vma, vmaddr, pfn)    do { } while (0)
#define flush_icache_page(vma, page)        do { } while (0)
#define flush_cache_vmap(start, end)        do { } while (0)
#define flush_cache_vunmap(start, end)        do { } while (0)

+
+    /* Now remove memory from the mappings */
+    lock_device_hotplug();
+    remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages <<
PAGE_SHIFT);
+    unlock_device_hotplug();

Right. Now we have successfully taken down the memory.

+
+    return true;
+}
+
+static u64 memtrace_alloc_node(u32 nid, u64 size)
+{
+    u64 start_pfn, end_pfn, nr_pages;
+    u64 base_pfn;
+
+    if (!NODE_DATA(nid) || !node_spanned_pages(nid))
+        return 0;

Why NODE_DATA check is required here ? Each node should have one
allocated and initialized by now, else we have bigger problems.
Is there any specific reason to check for spanned pages instead
of present/managed pages.

Anton wrote this check, so will need to confirm with him. I assume
we check node_spanned_pages() rather than node_present_pages()
because in arch/powerpc/mm/numa.c we set node_spanned_pages() and
not node_present_pages()?

I guess any thing is okay but NODE_DATA() seems redundant though.

Agreed.

struct pglist_data {
          ..........................
     unsigned long node_present_pages; /* total number of physical
pages */
     unsigned long node_spanned_pages; /* total size of physical page
                          range, including holes */

}

+
+    start_pfn = node_start_pfn(nid);
+    end_pfn = node_end_pfn(nid);
+    nr_pages = size >> PAGE_SHIFT;
+
+    /* Trace memory needs to be aligned to the size */
+    end_pfn = round_down(end_pfn - nr_pages, nr_pages);
+
+    for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -=
nr_pages) {
+        if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true)
+            return base_pfn << PAGE_SHIFT;
+    }
+
+    return 0;
+}
+
+static int memtrace_init_regions_runtime(u64 size)
+{
+    u64 m;
+    u32 nid;
+
+    memtrace_array = kzalloc(sizeof(struct memtrace_entry) *
+                num_online_nodes(), GFP_KERNEL);
+    if (!memtrace_array) {
+        pr_err("Failed to allocate memtrace_array\n");
+        return -EINVAL;
+    }
+
+    for_each_online_node(nid) {
+        m = memtrace_alloc_node(nid, size);
+        /*
+         * A node might not have any local memory, so warn but
+         * continue on.
+         */
+        if (!m) {
+            pr_err("Failed to allocate trace memory on node %d\n",
+                 nid);
+        } else {
+            pr_info("Allocated trace memory on node %d at
0x%016llx\n",
+                 nid, m);
+
+            memtrace_array[memtrace_array_nr].start = m;
+            memtrace_array[memtrace_array_nr].size = size;
+            memtrace_array[memtrace_array_nr].nid = nid;
+            memtrace_array_nr++;
+        }
+    }
+    return 0;
+}

All the pr_info() and pr_err() prints should have a "memtrace :"
before the
actual string to make it clear that its coming from this feature.

Good point!

+
+static struct dentry *memtrace_debugfs_dir;
+
+static int memtrace_init_debugfs(void)
+{
+    int ret = 0;
+    int i;
+
+    for (i = 0; i < memtrace_array_nr; i++) {
+        struct dentry *dir;
+        struct memtrace_entry *ent = &memtrace_array[i];
+
+        ent->mem = ioremap(ent->start, ent->size);
+        /* Warn but continue on */
+        if (!ent->mem) {
+            pr_err("Failed to map trace memory at 0x%llx\n",
+                 ent->start);
+            ret = -1;
+            continue;
+        }
+
+        snprintf(ent->name, 16, "%08x", ent->nid);
+        dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
+        if (!dir)
+            return -1;
+
+        ent->dir = dir;
+        debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
+        debugfs_create_x64("start", 0400, dir, &ent->start);
+        debugfs_create_x64("size", 0400, dir, &ent->size);
+        debugfs_create_u32("node", 0400, dir, &ent->nid);
+    }

Oh okay, its creating all the four files. Please create corresponding
to each of the files some where. Documentation/ABI/testing lists the
actual system ABI on /sys/ not the sys/kernel/debug/ ones I guess.

I'm not exactly sure what you are saying here... Seeing that there is
documentation about debugfs files in Documentation/ABI/testing, I'll
follow suit
and put it there.

I meant the same.

+
+    return ret;
+}
+
+static u64 memtrace_size;
+
+static int memtrace_enable_set(void *data, u64 val)
+{
+    if (memtrace_size)
+        return -EINVAL;
+
+    if (!val)
+        return -EINVAL;
+
+    /* Make sure size is aligned to a memory block */
+    if (val & (memory_block_size_bytes()-1))

As I have mentioned earlier, this should be mentioned in the interface
description some where.

+        return -EINVAL;
+
+    if (memtrace_init_regions_runtime(val))
+        return -EINVAL;
+
+    if (memtrace_init_debugfs())
+        return -EINVAL;
+
+    memtrace_size = val;
+
+    return 0;
+}
+
+static int memtrace_enable_get(void *data, u64 *val)
+{
+    *val = memtrace_size;
+    return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
memtrace_enable_set, "0x%016llx\n");
+
+static int memtrace_init(void)
+{
+    memtrace_debugfs_dir = debugfs_create_dir("memtrace",
powerpc_debugfs_root);
+    if (!memtrace_debugfs_dir)
+        return -1;
+
+    debugfs_create_file("enable", 0600, memtrace_debugfs_dir,
+                NULL, &memtrace_init_fops);
+
+    return 0;
+}
+machine_device_initcall(powernv, memtrace_init);
+

BTW how we start the tracing process for the trace to be collected
in the
interface before we can read them ? This interface does not seem to
have
a handler. When it directs the HW to start collecting the traces ?

debugfs_create_x64("start", 0400, dir, &ent->start);

I think you're asking 'what is actually going to call this code and do
the tracing'?

No, when you call this interface, where is the routine to start the
actual tracing invoking appropriate platform functions or HW
instructions ? I dont see such a function associated with 'start'
interface mentioned above.

Essentially,
DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
memtrace_enable_set, "0x%016llx\n");

means that when a number is written to the memtrace/enable file, the
number is
read as a u64 and passed to memtrace_enable_set()

Right, then we create the following interfaces for each entry of the memory
trace. By now all the memory ranges are ioremapped.

+               debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
+               debugfs_create_x64("start", 0400, dir, &ent->start);
+               debugfs_create_x64("size", 0400, dir, &ent->size);
+               debugfs_create_u32("node", 0400, dir, &ent->nid);

Then what really starts the HW trace ? where we tell the HW to start tracing
and put the trace details in the memory buffer allocated and ioremapped ?
IIUC just making the memory ioremapped() wont start the trace automatically.

Ah sorry, misunderstood you. Yes, this patch is only to remove thememory from kernel mappings so the hardware tracing can use it - not theactual code that invokes the tracing. That code involves some hardwareinformation that hasn't been made publicly accessible, so it is not myplace to share that. The hardware trace will be written to the 'removed'memory directly by the hardware and the debugfs files allow us to readthe trace without having to reonline the memory.

Re: [PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

Reply via email to