On 04.04.2013, at 15:33, Michael S. Tsirkin wrote: > On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote: >> >> On 04.04.2013, at 14:56, Gleb Natapov wrote: >> >>> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote: >>>> >>>> On 04.04.2013, at 14:45, Gleb Natapov wrote: >>>> >>>>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote: >>>>>> >>>>>> On 04.04.2013, at 14:38, Gleb Natapov wrote: >>>>>> >>>>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote: >>>>>>>> >>>>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote: >>>>>>>> >>>>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote: >>>>>>>>>> >>>>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote: >>>>>>>>>> >>>>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to >>>>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we >>>>>>>>>>> know the address from the VMCS so if the address is unique, we can >>>>>>>>>>> look >>>>>>>>>>> up the eventfd directly, bypassing emulation. >>>>>>>>>>> >>>>>>>>>>> Add an interface for userspace to specify this per-address, we can >>>>>>>>>>> use this e.g. for virtio. >>>>>>>>>>> >>>>>>>>>>> The implementation adds a separate bus internally. This serves two >>>>>>>>>>> purposes: >>>>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO >>>>>>>>>>> - minimize disruption in other code (since we don't know the length, >>>>>>>>>>> devices on the MMIO bus only get a valid address in write, this >>>>>>>>>>> way we don't need to touch all devices to teach them handle >>>>>>>>>>> an dinvalid length) >>>>>>>>>>> >>>>>>>>>>> At the moment, this optimization is only supported for EPT on x86 >>>>>>>>>>> and >>>>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but >>>>>>>>>>> slowly. >>>>>>>>>>> >>>>>>>>>>> TODO: NPT, MMU and non x86 architectures. >>>>>>>>>>> >>>>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for >>>>>>>>>>> pre-review and suggestions. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> >>>>>>>>>> >>>>>>>>>> This still uses page fault intercepts which are orders of magnitudes >>>>>>>>>> slower than hypercalls. Why don't you just create a PV MMIO >>>>>>>>>> hypercall that the guest can use to invoke MMIO accesses towards the >>>>>>>>>> host based on physical addresses with explicit length encodings? >>>>>>>>>> >>>>>>>>> It is slower, but not an order of magnitude slower. It become faster >>>>>>>>> with newer HW. >>>>>>>>> >>>>>>>>>> That way you simplify and speed up all code paths, exceeding the >>>>>>>>>> speed of PIO exits even. It should also be quite easily portable, as >>>>>>>>>> all other platforms have hypercalls available as well. >>>>>>>>>> >>>>>>>>> We are trying to avoid PV as much as possible (well this is also PV, >>>>>>>>> but not guest visible >>>>>>>> >>>>>>>> Also, how is this not guest visible? Who sets >>>>>>>> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates >>>>>>>> that the guest does so, so it is guest visible. >>>>>>>> >>>>>>> QEMU sets it. >>>>>> >>>>>> How does QEMU know? >>>>>> >>>>> Knows what? When to create such eventfd? virtio device knows. >>>> >>>> Where does it know from? >>>> >>> It does it always. >>> >>>>> >>>>>>> >>>>>>>> +/* >>>>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this >>>>>>>> address >>>>>>>> + * are writes of specified length, starting at the specified address. >>>>>>>> + * If not - it's a Guest bug. >>>>>>>> + * Can not be used together with either PIO or DATAMATCH. >>>>>>>> + */ >>>>>>>> >>>>>>> Virtio spec will state that access to a kick register needs to be of >>>>>>> specific length. This is reasonable thing for HW to ask. >>>>>> >>>>>> This is a spec change. So the guest would have to indicate that it >>>>>> adheres to a newer spec. Thus it's a guest visible change. >>>>>> >>>>> There is not virtio spec that has kick register in MMIO. The spec is in >>>>> the works AFAIK. Actually PIO will not be deprecated and my suggestion >>>> >>>> So the guest would indicate that it supports a newer revision of the spec >>>> (in your case, that it supports MMIO). How is that any different from >>>> exposing that it supports a PV MMIO hcall? >>>> >>> Guest will indicate nothing. New driver will use MMIO if PIO is bar is >>> not configured. All driver will not work for virtio devices with MMIO >>> bar, but not PIO bar. >> >> I can't parse that, sorry :). > > It's simple. Driver does iowrite16 or whatever is appropriate for the OS. > QEMU tells KVM which address driver uses, to make exits faster. This is not > different from how eventfd works. For example if exits to QEMU suddenly > become > very cheap we can remove eventfd completely. > >>> >>>>> is to move to MMIO only when PIO address space is exhausted. For PCI it >>>>> will be never, for PCI-e it will be after ~16 devices. >>>> >>>> Ok, let's go back a step here. Are you actually able to measure any speed >>>> in performance with this patch applied and without when going through MMIO >>>> kicks? >>>> >>>> >>> That's the question for MST. I think he did only micro benchmarks till >>> now and he already posted his result here: >>> >>> mmio-wildcard-eventfd:pci-mem 3529 >>> mmio-pv-eventfd:pci-mem 1878 >>> portio-wildcard-eventfd:pci-io 1846 >>> >>> So the patch speedup mmio by almost 100% and it is almost the same as PIO. >> >> Those numbers don't align at all with what I measured. > > Yep. But why? > Could be a different hardware. My laptop is i7, what did you measure on? > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 42 > model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz > stepping : 7 > microcode : 0x28 > cpu MHz : 2801.000 > cache size : 4096 KB
processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 8 model name : Six-Core AMD Opteron(tm) Processor 8435 stepping : 0 cpu MHz : 800.000 cache size : 512 KB physical id : 0 siblings : 6 core id : 0 cpu cores : 6 apicid : 8 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save pausefilter bogomips : 5199.87 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate > > Or could be different software, this is on top of 3.9.0-rc5, what > did you try? 3.0 plus kvm-kmod of whatever was current back in autumn :). > >> MST, could you please do a real world latency benchmark with virtio-net and >> >> * normal ioeventfd >> * mmio-pv eventfd >> * hcall eventfd > > I can't do this right away, sorry. For MMIO we are discussing the new > layout on the virtio mailing list, guest and qemu need a patch for this > too. My hcall patches are stale and would have to be brought up to > date. > > >> to give us some idea how much performance we would gain from each approach? >> Thoughput should be completely unaffected anyway, since virtio just >> coalesces kicks internally. > > Latency is dominated by the scheduling latency. > This means virtio-net is not the best benchmark. So what is a good benchmark? Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks. > >> I'm also slightly puzzled why the wildcard eventfd mechanism is so >> significantly slower, while it was only a few percent on my test system. >> What are the numbers you're listing above? Cycles? How many cycles do you >> execute in a second? >> >> >> Alex > > > It's the TSC divided by number of iterations. kvm unittest this value, here's > what it does (removed some dead code): > > #define GOAL (1ull << 30) > > do { > iterations *= 2; > t1 = rdtsc(); > > for (i = 0; i < iterations; ++i) > func(); > t2 = rdtsc(); > } while ((t2 - t1) < GOAL); > printf("%s %d\n", test->name, (int)((t2 - t1) / iterations)); So it's the number of cycles per run. That means translated my numbers are: MMIO: 4307 PIO: 3658 HCALL: 1756 MMIO - PIO = 649 which aligns roughly with your PV MMIO callback. My MMIO benchmark was to poke the LAPIC version register. That does go through instruction emulation, no? Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/