Il 11/04/2013 16:57, Michael R. Hines ha scritto:
> We have hardware already with front side bus speeds of 13 GB/s.
> 
> We also already have 5 GB/s RDMA hardware, and we will likely
> have even faster RDMA hardware in the future.
> 
> This analysis is not factoring into account the cycles it takes to
> map the pages before they are checked for duplicate bytes,

Do you mean the TLB misses?

> regardless whether or not very little of the page is actually
> cached on the processor.
> 
> This analysis is also not taking into account the possibility that the
> VM may be CPU-bound at the same time that QEMU is competing
> to execute is_dup_page().

is_dup_page() is memory-bound, not CPU-bound.  Note that is_dup_page
only needs 1% of the bandwidth it scans (32 bytes for a cache line out
of 4096 bytes/page).  Scanning 30 GB/s only requires reading 250 MB/s
from memory to the FSB.

> Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
> for is_dup_page() could be very easily reached given the right
> conditions - and we do have many workloads both HPC and Multi-tier
> which can easily cause QEMU's zero scanning performance to suffer.

These are the real world scenarios that I was talking about.  Do you
have profiles of these, with the latest QEMU code, that show
is_dup_page() to be expensive?

We could try prefetching the first cache line *of the next page* before
running is_dup_page.  There's a lot of things to test before giving up
and inventing a new API.

Paolo

Reply via email to