Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions

Benjamin Herrenschmidt Tue, 22 May 2012 00:18:24 -0700

On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> The emulated devices can run simultaneously with the guest, so
> we need to be careful with ordering of load and stores done by
> them to the guest system memory, which need to be observed in
> the right order by the guest operating system.
> 
> This adds a barrier call to the basic DMA read/write ops which
> is currently implemented as a smp_mb(), but could be later
> improved for more fine grained control of barriers.
> 
> Additionally, a _relaxed() variant of the accessors is provided
> to easily convert devices who would be performance sensitive
> and negatively impacted by the change.
> 
> Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> ---


(Note to Rusty: The number I told you on ST is wrong, see below)

So I tried to do some performance measurements with that patch using
netperf on an x86 laptop (x220 with core i7).

It's a bit tricky. For example, if I just create a tap interface,
give it a local IP on the laptop and a different IP on the guest,
(ie talking to a netserver on the host basically from the guest
via tap), the performance is pretty poor and the numbers seem
useless with and without the barrier.

So I did tests involving talking to a server on our gigabit network
instead.

The baseline is the laptop without kvm talking to the server. The
TCP_STREAM test results are:

(The "*" at the beginning of the lines is something I added to
 distinguish multi-line results on some tests)

MIGRATED TCP STREAM TEST
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

* 87380  16384  16384    10.02     933.02   
* 87380  16384  16384    10.03     908.64   
* 87380  16384  16384    10.03     926.78   
* 87380  16384  16384    10.03     919.73   

It's a bit noisy, ideally I should do a point-to-point setup to
an otherwise idle machine, here I'm getting some general lab network
noise but it gives us a pretty good baseline to begin with.

I have not managed to get any sensible result out of UDP_STREAM in
that configuration for some reason (ie just the host laptop), ie
they look insane :

MIGRATED UDP STREAM TEST
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

*229376   65507   10.00      270468      0    14173.84
 126976           10.00          44              2.31
*229376   65507   10.00      266526      0    13967.32
 126976           10.00          41              2.15

So we don't have a good comparison baseline but we can still compare
KVM against itself with and without the barrier.

Now KVM. This is x86_64 running an ubuntu precise guest (I had the
ISO laying around) and using the default setup which appears to be
an emulated e1000. I've done some tests with slirp just to see how
bad it was and it's bad enough to be irrelevant. The numbers have
thus been done using a tap interface bridged to the host ethernet
(who happens to also be some kind of e1000).

For each test I've done 3 series of numbers:

 - Without the barrier added
 - With the barrier added to dma_memory_rw
 - With the barrier added to dma_memory_rw -and- dma_memory_map

First TCP_STREAM. The numbers are a bit noisy, I suspect somebody
was hammering the server machine while I was doing one of the tests,
but here's what I got when it appeared to have stabilized:

MIGRATED TCP STREAM TEST
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

no barrier
* 87380 16384   16384    10.01    880.31
* 87380 16384   16384    10.01    876.73
* 87380 16384   16384    10.01    880.73
* 87380 16384   16384    10.01    878.63
barrier
* 87380 16384   16384    10.01    869.39
* 87380 16384   16384    10.01    864.99
* 87380 16384   16384    10.01    886.13
* 87380 16384   16384    10.01    872.90
barrier + map
* 87380 16384   16384    10.01    867.45
* 87380 16384   16384    10.01    868.51
* 87380 16384   16384    10.01    888.94
* 87380 16384   16384    10.01    888.19

As far as I can tell, it's all in the noise. I was about to concede a
small (1 % ?) loss to the barrier until I ran the last 2 tests and
then I stopped caring :-)

With UDP_STREAM, we get something like that:
MIGRATED UDP STREAM TEST
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

no barrier
*229376   65507   10.00        5208     0      272.92
 126976           10.00        5208            272.92
*229376   65507   10.00        5447     0      285.44
 126976           10.00        5447            285.44
*229376   65507   10.00        5119     0      268.22
 126976           10.00        5119            268.22
barrier
*229376   65507   10.00        5326     0      279.06
 126976           10.00        5326            279.06
*229376   65507   10.00        5072     0      265.75
 126976           10.00        5072            265.75
*229376   65507   10.00        5282     0      276.78
 126976           10.00        5282            276.78
barrier + map
*229376   65507   10.00        5512     0      288.83
 126976           10.00        5512            288.83
*229376   65507   10.00        5571     0      291.94
 126976           10.00        5571            291.94
*229376   65507   10.00        5195     0      272.23
 126976           10.00        5195            272.23

So I think here too we're in the noise. In fact, that makes me want to
stick the barrier in map() as well (though see my other email about
using a flag to implement "relaxed" to avoid an explosion of accessors).

Now, I suspect somebody needs to re-run those tests on HW that is known
to be more sensitive to memory barriers, it could be that my SB i7 in
64-bit mode is just the best case scenario and that some old core1 or 2
using a 32-bit lock instruction will suck a lot more.

In any case, it looks like the performance loss is minimal if measurable
at all, and in case there's a real concern on a given driver we can
always fix -that- driver to use more relaxed accessors.

Cheers,
Ben.

Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions

Reply via email to