Il 11/04/2013 16:57, Michael R. Hines ha scritto: > We have hardware already with front side bus speeds of 13 GB/s. > > We also already have 5 GB/s RDMA hardware, and we will likely > have even faster RDMA hardware in the future. > > This analysis is not factoring into account the cycles it takes to > map the pages before they are checked for duplicate bytes,
Do you mean the TLB misses? > regardless whether or not very little of the page is actually > cached on the processor. > > This analysis is also not taking into account the possibility that the > VM may be CPU-bound at the same time that QEMU is competing > to execute is_dup_page(). is_dup_page() is memory-bound, not CPU-bound. Note that is_dup_page only needs 1% of the bandwidth it scans (32 bytes for a cache line out of 4096 bytes/page). Scanning 30 GB/s only requires reading 250 MB/s from memory to the FSB. > Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth > for is_dup_page() could be very easily reached given the right > conditions - and we do have many workloads both HPC and Multi-tier > which can easily cause QEMU's zero scanning performance to suffer. These are the real world scenarios that I was talking about. Do you have profiles of these, with the latest QEMU code, that show is_dup_page() to be expensive? We could try prefetching the first cache line *of the next page* before running is_dup_page. There's a lot of things to test before giving up and inventing a new API. Paolo