> pages will be sent. Before that during the migration setup, the > ioctl(KVM_GET_DIRTY_LOG) is called once, so the kernel begins to produce > the dirty bitmap from this moment. When the pages "that haven't been > sent" are written, the kernel space marks them as dirty. However I don't > think this is correct, because these pages will be sent during this and the > next > iterations with the same content (if they are not written again after they are > sent). It only makes sense to mark the pages which have already been sent > during one iteration as dirty when they are written. > > > > > > > > > > > > > > > > > > > > > Am I right about this consideration? If I am right, is there some > advice to improve this? > > > > > > > > > > > > I think you're right that this can happen; to clarify I think the > > > > > > case you're talking about is: > > > > > > > > > > > > Iteration 1 > > > > > > sync bitmap > > > > > > start sending pages > > > > > > page 'n' is modified - but hasn't been sent yet > > > > > > page 'n' gets sent > > > > > > Iteration 2 > > > > > > sync bitmap > > > > > > 'page n is shown as modified' > > > > > > send page 'n' again > > > > > > > > > > > > > > > > Yes,this is right the case I am talking about. > > > > > > > > > > > So you're right that is wasteful; I guess it's more wasteful > > > > > > on big VMs with slow networks where the length of each iteration > > > > > > is large. > > > > > > > > > > I think this is "very" wasteful. Assume the workload writes the pages > dirty randomly within the guest address space, and the transfer speed is > constant. Intuitively, I think nearly half of the dirty pages produced in > Iteration 1 is not really dirty. This means the time of Iteration 2 is double > of > that to send only really dirty pages. > > > > > > > > It makes sense, can you get some perf numbers to show what kinds of > > > > workloads get impacted the most? That would also help us to figure > > > > out what kinds of speed improvements we can expect. > > > > > > > > > > > > Amit > > > > > > I have picked up 6 workloads and got the following statistics numbers > > > of every iteration (except the last stop-copy one) during precopy. > > > These numbers are obtained with the basic precopy migration, without > > > the capabilities like xbzrle or compression, etc. The network for the > > > migration is exclusive, with a separate network for the workloads. > > > They are both gigabit ethernet. I use qemu-2.5.1. > > > > > > Three (booting, idle, web server) of them converged to the stop-copy > phase, > > > with the given bandwidth and default downtime (300ms), while the other > > > three (kernel compilation, zeusmp, memcached) did not. > > > > > > One page is "not-really-dirty", if it is written first and is sent later > > > (and not written again after that) during one iteration. I guess this > > > would not happen so often during the other iterations as during the 1st > > > iteration. Because all the pages of the VM are sent to the dest node > during > > > the 1st iteration, while during the others, only part of the pages are > > > sent. > > > So I think the "not-really-dirty" pages should be produced mainly during > > > the 1st iteration , and maybe very little during the other iterations. > > > > > > If we could avoid resending the "not-really-dirty" pages, intuitively, I > > > think the time spent on Iteration 2 would be halved. This is a chain > reaction, > > > because the dirty pages produced during Iteration 2 is halved, which > incurs > > > that the time spent on Iteration 3 is halved, then Iteration 4, 5... > > > > Yes; these numbers don't show how many of them are false dirty though. > > > > One problem is thinking about pages that have been redirtied, if the page is > dirtied > > after the sync but before the network write then it's the false-dirty that > > you're describing. > > > > However, if the page is being written a few times, and so it would have > been written > > after the network write then it isn't a false-dirty. > > > > You might be able to figure that out with some kernel tracing of when the > dirtying > > happens, but it might be easier to write the fix! > > > > Dave > > Hi, I have made some new progress now. > > To tell how many false dirty pages there are exactly in each iteration, I > malloc > a > buffer in memory as big as the size of the whole VM memory. When a page > is > transferred to the dest node, it is copied to the buffer; During the next > iteration, > if one page is transferred, it is compared to the old one in the buffer, and > the > old one will be replaced for next comparison if it is really dirty. Thus, we > are > now > able to get the exact number of false dirty pages. > > This time, I use 15 workloads to get the statistic number. They are: > > 1. 11 benchmarks picked up from cpu2006 benchmark suit. They are all > scientific > computing workloads like Quantum Chromodynamics, Fluid Dynamics, etc. > I pick > up these 11 benchmarks because compared to others, they have bigger > memory > occupation and higher memory dirty rate. Thus most of them could not > converge > to stop-and-copy using the default migration speed (32MB/s). > 2. kernel compilation > 3. idle VM > 4. Apache web server which serves static content > > (the above workloads are all running in VM with 1 vcpu and 1GB memory, > and the > migration speed is the default 32MB/s) > > 5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are used > as the cache. > After filling up the 4GB cache, a client writes the cache at a constant > speed > during migration. This time, migration speed has no limit, and is up to > the > capability of 1Gbps Ethernet. > > Summarize the results first: (and you can read the precise number below) > > 1. 4 of these 15 workloads have a big proportion (>60%, even >80% during > some iterations) > of false dirty pages out of all the dirty pages since iteration 2 (and > the big > proportion lasts during the following iterations). They are > cpu2006.zeusmp, > cpu2006.bzip2, cpu2006.mcf, and memcached. > 2. 2 workloads (idle, webserver) spend most of the migration time on > iteration 1, even > though the proportion of false dirty pages is big since iteration 2, the > space > to > optimize is small. > 3. 1 workload (kernel compilation) only have a big proportion during > iteration 2, not > in the other iterations. > 4. 8 workloads (the other 8 benchmarks of cpu2006) have little proportion of > false > dirty pages since iteration 2. So the spaces to optimize for them are > small. > > Now I want to talk a little more about the reasons why false dirty pages are > produced. > The first reason is what we have discussed before---the mechanism to track > the dirty > pages. > And then I come up with another reason. Here is the situation: a write > operation to one > memory page happens, but it doesn't change any content of the page. So it's > "write but > not dirty", and kernel still marks it as dirty. One guy in our lab has done > some > experiments > to figure out the proportion of "write but not dirty" operations, and he uses > the cpu2006 > benchmark suit. According to his results, general workloads has a little > proportion (<10%) > of "write but not dirty" out of all the write operations, while few workloads > have higher > proportion (one even as high as 50%). Now we are not sure why "write but > not dirty" would > happen, it just happened. > > So these two reasons contribute to the false dirty pages. To optimize, I > compute and store > the SHA1 hash before transferring each page. Next time, if one page needs > retransmission, its > SHA1 hash is computed again, and compared to the old hash. If the hash is > the same, it's a > false dirty page, and we just skip this page; Otherwise, the page is > transferred, and the new > hash replaces the old one for next comparison. > The reason to use SHA1 hash but not byte-by-byte comparison is the > memory overheads. One SHA1 > hash is 20 bytes. So we need extra 20/4096 (<1/200) memory space of the > whole VM memory, which > is relatively small. > As far as I know, SHA1 hash is widely used in the scenes of deduplication for > backup systems. > They have proven that the probability of hash collision is far smaller than > disk > hardware fault, > so it's secure hash, that is, if the hashes of two chunks are the same, the > content must be the > same. So I think the SHA1 hash could replace byte-to-byte comparison in the > VM memory scenery. > > Then I do the same migration experiments using the SHA1 hash. For the 4 > workloads which have > big proportions of false dirty pages, the improvement is remarkable. Without > optimization, > they either can not converge to stop-and-copy, or take a very long time to > complete. With the > SHA1 hash method, all of them now complete in a relatively short time. > For the reason I have talked above, the other workloads don't get notable > improvements from the > optimization. So below, I only show the exact number after optimization for > the 4 workloads with > remarkable improvements. > > Any comments or suggestions? >
It seems the current XBZRLE feature can be used to solve false dirty issue, no? Liang