On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote: > The emulated devices can run simultaneously with the guest, so > we need to be careful with ordering of load and stores done by > them to the guest system memory, which need to be observed in > the right order by the guest operating system. > > This adds a barrier call to the basic DMA read/write ops which > is currently implemented as a smp_mb(), but could be later > improved for more fine grained control of barriers. > > Additionally, a _relaxed() variant of the accessors is provided > to easily convert devices who would be performance sensitive > and negatively impacted by the change. > > Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> > ---
(Note to Rusty: The number I told you on ST is wrong, see below) So I tried to do some performance measurements with that patch using netperf on an x86 laptop (x220 with core i7). It's a bit tricky. For example, if I just create a tap interface, give it a local IP on the laptop and a different IP on the guest, (ie talking to a netserver on the host basically from the guest via tap), the performance is pretty poor and the numbers seem useless with and without the barrier. So I did tests involving talking to a server on our gigabit network instead. The baseline is the laptop without kvm talking to the server. The TCP_STREAM test results are: (The "*" at the beginning of the lines is something I added to distinguish multi-line results on some tests) MIGRATED TCP STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec * 87380 16384 16384 10.02 933.02 * 87380 16384 16384 10.03 908.64 * 87380 16384 16384 10.03 926.78 * 87380 16384 16384 10.03 919.73 It's a bit noisy, ideally I should do a point-to-point setup to an otherwise idle machine, here I'm getting some general lab network noise but it gives us a pretty good baseline to begin with. I have not managed to get any sensible result out of UDP_STREAM in that configuration for some reason (ie just the host laptop), ie they look insane : MIGRATED UDP STREAM TEST Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec *229376 65507 10.00 270468 0 14173.84 126976 10.00 44 2.31 *229376 65507 10.00 266526 0 13967.32 126976 10.00 41 2.15 So we don't have a good comparison baseline but we can still compare KVM against itself with and without the barrier. Now KVM. This is x86_64 running an ubuntu precise guest (I had the ISO laying around) and using the default setup which appears to be an emulated e1000. I've done some tests with slirp just to see how bad it was and it's bad enough to be irrelevant. The numbers have thus been done using a tap interface bridged to the host ethernet (who happens to also be some kind of e1000). For each test I've done 3 series of numbers: - Without the barrier added - With the barrier added to dma_memory_rw - With the barrier added to dma_memory_rw -and- dma_memory_map First TCP_STREAM. The numbers are a bit noisy, I suspect somebody was hammering the server machine while I was doing one of the tests, but here's what I got when it appeared to have stabilized: MIGRATED TCP STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec no barrier * 87380 16384 16384 10.01 880.31 * 87380 16384 16384 10.01 876.73 * 87380 16384 16384 10.01 880.73 * 87380 16384 16384 10.01 878.63 barrier * 87380 16384 16384 10.01 869.39 * 87380 16384 16384 10.01 864.99 * 87380 16384 16384 10.01 886.13 * 87380 16384 16384 10.01 872.90 barrier + map * 87380 16384 16384 10.01 867.45 * 87380 16384 16384 10.01 868.51 * 87380 16384 16384 10.01 888.94 * 87380 16384 16384 10.01 888.19 As far as I can tell, it's all in the noise. I was about to concede a small (1 % ?) loss to the barrier until I ran the last 2 tests and then I stopped caring :-) With UDP_STREAM, we get something like that: MIGRATED UDP STREAM TEST Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec no barrier *229376 65507 10.00 5208 0 272.92 126976 10.00 5208 272.92 *229376 65507 10.00 5447 0 285.44 126976 10.00 5447 285.44 *229376 65507 10.00 5119 0 268.22 126976 10.00 5119 268.22 barrier *229376 65507 10.00 5326 0 279.06 126976 10.00 5326 279.06 *229376 65507 10.00 5072 0 265.75 126976 10.00 5072 265.75 *229376 65507 10.00 5282 0 276.78 126976 10.00 5282 276.78 barrier + map *229376 65507 10.00 5512 0 288.83 126976 10.00 5512 288.83 *229376 65507 10.00 5571 0 291.94 126976 10.00 5571 291.94 *229376 65507 10.00 5195 0 272.23 126976 10.00 5195 272.23 So I think here too we're in the noise. In fact, that makes me want to stick the barrier in map() as well (though see my other email about using a flag to implement "relaxed" to avoid an explosion of accessors). Now, I suspect somebody needs to re-run those tests on HW that is known to be more sensitive to memory barriers, it could be that my SB i7 in 64-bit mode is just the best case scenario and that some old core1 or 2 using a 32-bit lock instruction will suck a lot more. In any case, it looks like the performance loss is minimal if measurable at all, and in case there's a real concern on a given driver we can always fix -that- driver to use more relaxed accessors. Cheers, Ben.