On 16/12/2015 15:25, Vincenzo Maffione wrote: >> vhost-net actually had better performance, so virtio-net dataplane >> was never committed. As Michael mentioned, in practice on Linux you >> use vhost, and non-Linux hypervisors you do not use QEMU. :) > > Yes, I understand. However, another possible use-case would using QEMU > + virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet > generators or packe processors, where very high packet rates are > common), where is not possible to use vhost.
Yes, of course. That was tongue in cheek. Another possibility for your use case is to interface with netmap through vhost-user, but I'm happy if you choose to improve virtio.c instead! >> The main optimization that vring.c has is to cache the translation of >> the rings. Using address_space_map/unmap for rings in virtio.c would be >> a noticeable improvement, as your numbers for patch 3 show. However, by >> caching translations you also conveniently "forget" to promptly mark the >> pages as dirty. As you pointed out this is obviously an issue for >> migration. You can then add a notifier for runstate changes. When >> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would >> be unmapped, and then remapped the next time the VM starts running again. > > Ok so it seems feasible with a bit of care. The numbers we've been > seing in various experiments have always shown that this optimization > could easily double the 2 Mpps packet rate bottleneck. Cool. Bonus points for nicely abstracting it so that virtio.c is just a user. >> You also guessed right that there are consistency issues; for these you >> can add a MemoryListener that invalidates all mappings. > > Yeah, but I don't know exactly what kind of inconsinstencies there can > be. Maybe the memory we are mapping may be hot-unplugged? Yes. Just blow away all mappings in the MemoryListener commit callback. >> That said, I'm wondering where the cost of address translation lies---is >> it cache-unfriendly data structures, locked operations, or simply too >> much code to execute? It was quite surprising to me that on virtio-blk >> benchmarks we were spending 5% of the time doing memcpy! (I have just >> extracted from my branch the patches to remove that, and sent them to >> qemu-devel). > > I feel it's just too much code (but I may be wrong). That is likely to be a good guess, but notice that the fast path doesn't actually have _that much_ code, because a lot of "if"s that are almost always false. Looking at a profile would be useful. Is it flat, or does something (e.g. address_space_translate) actually stand out? > I'm not sure whether you are thinking that 5% is too much or too little. > To me it's too little, showing that most of the overhead it's > somewhere else (e.g. memory translation, or backend processing). In a > ideal transmission system, most of the overhead should be spent on > copying, because it means that you successfully managed to suppress > notifications and translation overhead. On copying data, though---not on copying virtio descriptors. 5% for those is entirely wasted time. Also, note that I'm looking at disk I/O rather than networking, where there should be no copies at all. Paolo >> Examples of missing optimizations in exec.c include: >> >> * caching enough information in RAM MemoryRegions to avoid the calls to >> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer); >> >> * adding a MRU cache to address_space_lookup_region. >> >> In particular, the former should be easy if you want to give it a >> try---easier than caching ring translations in virtio.c. > > Thank you so much for the insights :)