Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus

Longpeng (Mike) Thu, 01 Sep 2016 18:53:15 -0700

Hi Michael,

On 2016/9/1 21:27, Michael S. Tsirkin wrote:


> On Thu, Sep 01, 2016 at 02:58:05PM +0800, l00371263 wrote:
>> From: "Longpeng(Mike)" <longpe...@huawei.com>
>>
>> Some software algorithms are based on the hardware's cache info, for example,
>> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will 
>> trigger
>> a resched IPI and told cpu2 to do the wakeup if they don't share low level
>> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share 
>> llc.
>> The relevant linux-kernel code as bellow:
>>
>>      static void ttwu_queue(struct task_struct *p, int cpu)
>>      {
>>              struct rq *rq = cpu_rq(cpu);
>>              ......
>>              if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
>>                      ......
>>                      ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
>>                      return;
>>              }
>>              ......
>>              ttwu_do_activate(rq, p, 0); /* access target's rq directly */
>>              ......
>>      }
>>
>> In real hardware, the cpus on the same socket share L3 cache, so one won't
>> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't 
>> present a
>> virtual L3 cache info for VM, then the linux guest will trigger lots of RES 
>> IPIs
>> under some workloads even if the virtual cpus belongs to the same virtual 
>> socket.
>>
>> For KVM, this degrades performance, because there will be lots of vmexit due 
>> to
>> guest send IPIs.
>>
>> The workload is a SAP HANA's testsuite, we run it one round(about 40 
>> minuates)
>> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering 
>> during
>> the period:
>>
>>         No-L3           With-L3(applied this patch)
>> cpu0:        363890          44582
>> cpu1:        373405          43109
>> cpu2:        340783          43797
>> cpu3:        333854          43409
>> cpu4:        327170          40038
>> cpu5:        325491          39922
>> cpu6:        319129          42391
>> cpu7:        306480          41035
>> cpu8:        161139          32188
>> cpu9:        164649          31024
>> cpu10:       149823          30398
>> cpu11:       149823          32455
>> cpu12:       164830          35143
>> cpu13:       172269          35805
>> cpu14:       179979          33898
>> cpu15:       194505          32754
>> avg: 268963.6        40129.8
>>
>> The VM's topology is "1*socket 8*cores 2*threads".
>> After present virtual L3 cache info for VM, the amounts of RES IPI in guest
>> reduce 85%.
>>
>> And we also test the overall system performance if vcpus actually run on
>> sparate physical sockets. With L3 cache, the performance improves 7.2%~33.1%
>> (avg: 15.7%).
> 
> Any idea why?  I'm guessing that on bare metal, it is
> sometimes cheaper to send IPIs with a separate cache, but on KVM,
> it is always cheaper to use memory, as this reduces the # of exits.
> Is this it?

Yeah, I think so, vmexit due to vcpu send IPIs is expensive.

> 
> It's worth listing here so that e.g. if it ever becomes possible to send
> IPIs without exits, we know we need to change this code.
> 

OK! I will add in next version.

>> Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
>> ---
>> Hi Eduardo, 
>>`
>> Changes since v1:
>>   - fix the compat problem: set compat_props on PC_COMPAT_2_7.
>>   - fix a "intentionally introducde bug": make intel's and amd's 
>> consistently.
>>   - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
>>   - test the performance if vcpus running on sparate sockets: with L3 cache,
>>     the performance improves 7.2%~33.1%(avg: 15.7%).
>> ---
>>  include/hw/i386/pc.h |  8 ++++++++
>>  target-i386/cpu.c    | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>>  target-i386/cpu.h    |  3 +++
>>  3 files changed, 55 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 74c175c..6072625 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
>>  int e820_get_num_entries(void);
>>  bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>>  
>> +#define PC_COMPAT_2_7 \
>> +    {\
>> +        .driver   = TYPE_X86_CPU,\
>> +        .property = "compat-cache",\
>> +        .value    = "on",\
>> +    },
>> +
>>  #define PC_COMPAT_2_6 \
>> +    PC_COMPAT_2_7 \
>>      HW_COMPAT_2_6 \
>>      {\
>>          .driver   = "fw_cfg_io",\
> 
> 
> Could this get a more informative name?
> E.g. l3-cache-shared ?

Thanks! I will take your suggestion in next version.

> 
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 6a1afab..224d967 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -57,6 +57,7 @@
>>  #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>>  #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>>  #define CPUID_2_L2_2MB_8WAY_64B   0x7d
>> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>>  
>>  
>>  /* CPUID Leaf 4 constants: */
>> @@ -131,11 +132,18 @@
>>  #define L2_LINES_PER_TAG       1
>>  #define L2_SIZE_KB_AMD       512
>>  
>> -/* No L3 cache: */
>> +/* Level 3 unified cache: */
>>  #define L3_SIZE_KB             0 /* disabled */
>>  #define L3_ASSOCIATIVITY       0 /* disabled */
>>  #define L3_LINES_PER_TAG       0 /* disabled */
>>  #define L3_LINE_SIZE           0 /* disabled */
>> +#define L3_N_LINE_SIZE         64
>> +#define L3_N_ASSOCIATIVITY     16
>> +#define L3_N_SETS           16384
>> +#define L3_N_PARTITIONS         1
>> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
>> +#define L3_N_LINES_PER_TAG      1
>> +#define L3_N_SIZE_KB_AMD    16384
>>  
>>  /* TLB definitions: */
>>  
>> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>  {
>>      X86CPU *cpu = x86_env_get_cpu(env);
>>      CPUState *cs = CPU(cpu);
>> +    uint32_t pkg_offset;
>>  
>>      /* test if maximum index reached */
>>      if (index & 0x80000000) {
>> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>          }
>>          *eax = 1; /* Number of CPUID[EAX=2] calls required */
>>          *ebx = 0;
>> -        *ecx = 0;
>> +        if (cpu->enable_compat_cache) {
>> +            *ecx = 0;
>> +        } else {
>> +            *ecx = L3_N_DESCRIPTOR;
>> +        }
>>          *edx = (L1D_DESCRIPTOR << 16) | \
>>                 (L1I_DESCRIPTOR <<  8) | \
>>                 (L2_DESCRIPTOR);
>> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>                  *ecx = L2_SETS - 1;
>>                  *edx = CPUID_4_NO_INVD_SHARING;
>>                  break;
>> +            case 3: /* L3 cache info */
>> +                if (cpu->enable_compat_cache) {
>> +                    *eax = 0;
>> +                    *ebx = 0;
>> +                    *ecx = 0;
>> +                    *edx = 0;
>> +                    break;
>> +                }
>> +                *eax |= CPUID_4_TYPE_UNIFIED | \
>> +                        CPUID_4_LEVEL(3) | \
>> +                        CPUID_4_SELF_INIT_LEVEL;
>> +                pkg_offset = apicid_pkg_offset(cs->nr_cores, 
>> cs->nr_threads);
>> +                *eax |= ((1 << pkg_offset) - 1) << 14;
>> +                *ebx = (L3_N_LINE_SIZE - 1) | \
>> +                       ((L3_N_PARTITIONS - 1) << 12) | \
>> +                       ((L3_N_ASSOCIATIVITY - 1) << 22);
>> +                *ecx = L3_N_SETS - 1;
>> +                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
>> +                break;
>>              default: /* end of info */
>>                  *eax = 0;
>>                  *ebx = 0;
>> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
>> uint32_t count,
>>          *ecx = (L2_SIZE_KB_AMD << 16) | \
>>                 (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>>                 (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
>> -        *edx = ((L3_SIZE_KB/512) << 18) | \
>> -               (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> -               (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> +        if (cpu->enable_compat_cache) {
>> +            *edx = ((L3_SIZE_KB / 512) << 18) | \
>> +                   (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> +                   (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> +        } else {
>> +            *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
>> +                   (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
>> +                   (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
>> +        }
>>          break;
>>      case 0x80000007:
>>          *eax = 0;
>> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
>>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
>> +    DEFINE_PROP_BOOL("compat-cache", X86CPU, enable_compat_cache, false),
>>      DEFINE_PROP_END_OF_LIST()
>>  };
>>  
>> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
>> index 65615c0..61ef4e3 100644
>> --- a/target-i386/cpu.h
>> +++ b/target-i386/cpu.h
>> @@ -1202,6 +1202,9 @@ struct X86CPU {
>>       */
>>      bool enable_lmce;
>>  
>> +    /* Compatibility bits for old machine types */
>> +    bool enable_compat_cache;
>> +
>>      /* Compatibility bits for old machine types: */
>>      bool enable_cpuid_0xb;
>>  
>> -- 
>> 1.8.3.1
>>
> 
> .
> 


-- 
Regards,
Longpeng(Mike)

Re: [Qemu-devel] [PATCH v2] target-i386: present virtual L3 cache info for vcpus

Reply via email to