2015-12-16 16:46 GMT+01:00 Paolo Bonzini <pbonz...@redhat.com>: > > > On 16/12/2015 15:25, Vincenzo Maffione wrote: >>> vhost-net actually had better performance, so virtio-net dataplane >>> was never committed. As Michael mentioned, in practice on Linux you >>> use vhost, and non-Linux hypervisors you do not use QEMU. :) >> >> Yes, I understand. However, another possible use-case would using QEMU >> + virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet >> generators or packe processors, where very high packet rates are >> common), where is not possible to use vhost. > > Yes, of course. That was tongue in cheek. Another possibility for your > use case is to interface with netmap through vhost-user, but I'm happy > if you choose to improve virtio.c instead! > >>> The main optimization that vring.c has is to cache the translation of >>> the rings. Using address_space_map/unmap for rings in virtio.c would be >>> a noticeable improvement, as your numbers for patch 3 show. However, by >>> caching translations you also conveniently "forget" to promptly mark the >>> pages as dirty. As you pointed out this is obviously an issue for >>> migration. You can then add a notifier for runstate changes. When >>> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would >>> be unmapped, and then remapped the next time the VM starts running again. >> >> Ok so it seems feasible with a bit of care. The numbers we've been >> seing in various experiments have always shown that this optimization >> could easily double the 2 Mpps packet rate bottleneck. > > Cool. Bonus points for nicely abstracting it so that virtio.c is just a > user. > >>> You also guessed right that there are consistency issues; for these you >>> can add a MemoryListener that invalidates all mappings. >> >> Yeah, but I don't know exactly what kind of inconsinstencies there can >> be. Maybe the memory we are mapping may be hot-unplugged? > > Yes. Just blow away all mappings in the MemoryListener commit callback. > >>> That said, I'm wondering where the cost of address translation lies---is >>> it cache-unfriendly data structures, locked operations, or simply too >>> much code to execute? It was quite surprising to me that on virtio-blk >>> benchmarks we were spending 5% of the time doing memcpy! (I have just >>> extracted from my branch the patches to remove that, and sent them to >>> qemu-devel). >> >> I feel it's just too much code (but I may be wrong). > > That is likely to be a good guess, but notice that the fast path doesn't > actually have _that much_ code, because a lot of "if"s that are almost > always false. > > Looking at a profile would be useful. Is it flat, or does something > (e.g. address_space_translate) actually stand out?
I'm so sorry, I forget to answer this. This is what perf top shows while doing the experiment 12.35% qemu-system-x86_64 [.] address_space_map 10.87% qemu-system-x86_64 [.] vring_desc_read.isra.0 7.50% qemu-system-x86_64 [.] address_space_lduw_le 6.32% qemu-system-x86_64 [.] address_space_translate 5.84% qemu-system-x86_64 [.] address_space_translate_internal 5.75% qemu-system-x86_64 [.] phys_page_find 5.74% qemu-system-x86_64 [.] qemu_ram_block_from_host 4.04% qemu-system-x86_64 [.] address_space_stw_le 4.02% qemu-system-x86_64 [.] address_space_write 3.33% qemu-system-x86_64 [.] virtio_should_notify So it seems most of the time is spent while doing translations. > >> I'm not sure whether you are thinking that 5% is too much or too little. >> To me it's too little, showing that most of the overhead it's >> somewhere else (e.g. memory translation, or backend processing). In a >> ideal transmission system, most of the overhead should be spent on >> copying, because it means that you successfully managed to suppress >> notifications and translation overhead. > > On copying data, though---not on copying virtio descriptors. 5% for > those is entirely wasted time. > > Also, note that I'm looking at disk I/O rather than networking, where > there should be no copies at all. In the experiment I'm doing there is a per-packet copy from the guest memory to the netmap backend. Cheers, Vincenzo > > Paolo > >>> Examples of missing optimizations in exec.c include: >>> >>> * caching enough information in RAM MemoryRegions to avoid the calls to >>> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer); >>> >>> * adding a MRU cache to address_space_lookup_region. >>> >>> In particular, the former should be easy if you want to give it a >>> try---easier than caching ring translations in virtio.c. >> >> Thank you so much for the insights :) > -- Vincenzo Maffione