Re: [Qemu-devel] [PATCH V8 07/11] NUMA: set guest numa nodes memory policy

Wanlong Gao Tue, 20 Aug 2013 19:45:37 -0700

On 08/20/2013 09:41 PM, Andrew Jones wrote:
> 
> 
> ----- Original Message -----
>> Set the guest numa nodes memory policies using the mbind(2)
>> system call node by node.
>> After this patch, we are able to set guest nodes memory policies
>> through the QEMU options, this arms to solve the guest cross
>> nodes memory access performance issue.
>> And as you all know, if PCI-passthrough is used,
>> direct-attached-device uses DMA transfer between device and qemu process.
>> All pages of the guest will be pinned by get_user_pages().
>>
>> KVM_ASSIGN_PCI_DEVICE ioctl
>>   kvm_vm_ioctl_assign_device()
>>     =>kvm_assign_device()
>>       => kvm_iommu_map_memslots()
>>         => kvm_iommu_map_pages()
>>            => kvm_pin_pages()
>>
>> So, with direct-attached-device, all guest page's page count will be +1 and
>> any page migration will not work. AutoNUMA won't too.
>>
>> So, we should set the guest nodes memory allocation policies before
>> the pages are really mapped.
>>
>> Signed-off-by: Andre Przywara <andre.przyw...@amd.com>
>> Signed-off-by: Wanlong Gao <gaowanl...@cn.fujitsu.com>
>> ---
>>  numa.c | 89
>>  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 89 insertions(+)
>>
>> diff --git a/numa.c b/numa.c
>> index 436b8e0..b2c0048 100644
>> --- a/numa.c
>> +++ b/numa.c
>> @@ -28,6 +28,16 @@
>>  #include "qapi-visit.h"
>>  #include "qapi/opts-visitor.h"
>>  #include "qapi/dealloc-visitor.h"
>> +#include "exec/memory.h"
>> +
>> +#ifdef CONFIG_NUMA
>> +#include <numa.h>
>> +#include <numaif.h>
>> +#ifndef MPOL_F_RELATIVE_NODES
>> +#define MPOL_F_RELATIVE_NODES (1 << 14)
>> +#define MPOL_F_STATIC_NODES   (1 << 15)
>> +#endif
>> +#endif
>>  
>>  QemuOptsList qemu_numa_opts = {
>>      .name = "numa",
>> @@ -209,6 +219,78 @@ void set_numa_nodes(void)
>>      }
>>  }
>>  
>> +#ifdef CONFIG_NUMA
>> +static int node_parse_bind_mode(unsigned int nodeid)
>> +{
>> +    int bind_mode;
>> +
>> +    switch (numa_info[nodeid].policy) {
>> +    case NUMA_NODE_POLICY_MEMBIND:
>> +        bind_mode = MPOL_BIND;
>> +        break;
>> +    case NUMA_NODE_POLICY_INTERLEAVE:
>> +        bind_mode = MPOL_INTERLEAVE;
>> +        break;
>> +    case NUMA_NODE_POLICY_PREFERRED:
>> +        bind_mode = MPOL_PREFERRED;
>> +        break;
>> +    case NUMA_NODE_POLICY_DEFAULT:
>> +    default:
>> +        bind_mode = MPOL_DEFAULT;
>> +        return bind_mode;
>> +    }
>> +
>> +    bind_mode |= numa_info[nodeid].relative ?
>> +        MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
>> +
>> +    return bind_mode;
>> +}
>> +#endif
>> +
>> +static int set_node_mem_policy(int nodeid)
>> +{
>> +#ifdef CONFIG_NUMA
>> +    void *ram_ptr;
>> +    RAMBlock *block;
>> +    ram_addr_t len, ram_offset = 0;
>> +    int bind_mode;
>> +    int i;
>> +
>> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>> +        if (!strcmp(block->mr->name, "pc.ram")) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (block->host == NULL) {
>> +        return -1;
>> +    }
>> +
>> +    ram_ptr = block->host;
>> +    for (i = 0; i < nodeid; i++) {
>> +        len = numa_info[i].node_mem;
>> +        ram_offset += len;
>> +    }
>> +
>> +    len = numa_info[i].node_mem;
>> +    bind_mode = node_parse_bind_mode(i);
>> +
>> +    /* This is a workaround for a long standing bug in Linux'
>> +     * mbind implementation, which cuts off the last specified
>> +     * node. To stay compatible should this bug be fixed, we
>> +     * specify one more node and zero this one out.
>> +     */
>> +    clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem);
>> +    if (mbind(ram_ptr + ram_offset, len, bind_mode,
>> +        numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0)) {
>> +            perror("mbind");
>> +            return -1;
>> +    }
> 
>>From my quick read of this patch series, I think these two calls of
> numa_num_configured_nodes() are the only places that libnuma is used.
> Is it really worth the new dependency? Actually libnuma will only calculate
> what it returns from numa_num_configured_nodes() once, because it simply
> counts bits in a bitmask that it only initializes at library load time. So
> it would be more robust wrt to node onlining/offlining to avoid libnuma and
> to just fetch information from sysfs as needed anyway. In this particular
> code though, I think replacing numa_num_configured_nodes() with a maxnode,
> where
> 
> unsigned long maxnode = find_last_bit(numa_info[i].host_mem, MAX_CPUMASK_BITS)


Sorry I can't understand this since numa_numa_configured_nodes() is for host,
but why could we find the last bit of guest setting to replace it?

Thanks,
Wanlong Gao

> 
> would work the best.
> 
> Another comment I have on this function is that I'd prefer to see something
> like
> 
> unsigned long *nodes = numa_info[nodeid].host_mem;
> 
> at the top, and then use that for a shorter name, rather than abusing
> the fact that i == nodeid after the loop, presumably just to keep the name
> short.
> 
> drew
> 
>> +#endif
>> +
>> +    return 0;
>> +}
>> +
>>  void set_numa_modes(void)
>>  {
>>      CPUState *cpu;
>> @@ -221,4 +303,11 @@ void set_numa_modes(void)
>>              }
>>          }
>>      }
>> +
>> +    for (i = 0; i < nb_numa_nodes; i++) {
>> +        if (set_node_mem_policy(i) == -1) {
>> +            fprintf(stderr,
>> +                    "qemu: can not set host memory policy for node%d\n", i);
>> +        }
>> +    }
>>  }
>> --
>> 1.8.4.rc1.21.gfb56570
>>
>>
>>
>

Re: [Qemu-devel] [PATCH V8 07/11] NUMA: set guest numa nodes memory policy

Reply via email to