On Sun, Nov 04, 2012 at 08:17:29PM +0100, Juan Quintela wrote: > David Gibson <d...@au1.ibm.com> wrote: > > On Fri, Nov 02, 2012 at 11:58:32AM +0100, Juan Quintela wrote: > >> David Gibson <d...@au1.ibm.com> wrote: > >> > On Wed, Oct 31, 2012 at 01:08:16PM +0200, Orit Wasserman wrote: > >> >> On 10/31/2012 05:43 AM, David Gibson wrote: > >> > >> Reviewed-by: Juan Quintela <quint...@redhat.com> > >> > >> Good catch, I missunderstood the function when fixing a different bug, > >> and never undrestood why it fixed it. > > > > Actually.. it just occurred to me that I think there has to be another > > bug here somewhere.. > > I am at KVM Forum and LinuxCon for this week, so I can't test anything. > > For some reason, I missunderstood bitmap_set() and though this was the > value that we "initiliazed" the bitmap with. So, I changed from 0 to 1, > and ..... I was sending half the pages over the wire. Yes, that is > right, just half of them. So clearly we have some bug somewhere > else :-()
Ah. > > I haven't actually observed any effects from the memory corruption - > > though it's certainly a real bug. I found this because another effect > > of this bug is that migration_dirty_pages count was set to 1 more than > > the actual number of dirty bits in the bitmap. That meant the dirty > > pages count was never reaching zero and so the migration/savevm never > > terminated. > > I wonder what is on page 0 on an x86, problably some BIOS data that > never changes? No clue about pseries. On pseries it will be exception vectors. So it will change on the firmware->kernel transition. And I think there may be a very few special variables in there, so it will change later. Note that the migration_dirty_pages count will remain mismatched, whether or not page 0 gets touched. > > Except.. that every so often the migration *did* terminate (maybe 1 > > time in 5). Also I kind of hope somebody would have noticed this > > earlier if migrations never terminated on x86 too. But as far as I > > can tell, if initially mismatched like this it ought to be impossible > > for the dirty page count to ever reach zero. Which suggests there is > > another bug with the dirty count tracking :(. > > We use the dirty bitmap count to know how many pages are dirty, but once > that the number is low enough, we just sent "the rest" of the pages. > So, it would always converge (or not) independent of that bug by one. > We never test for zero dirty pages, we test "are we able to send this > many pages" over "max_downtime". So this explains why it works for you > sometimes. Hrm. It explains why it works sometimes (phew the bug I though must be there needn't be). It doesn't explain why it mostly didn't - I actually saw it spinning there on ram_save_iterate(), not completing with one dirty page remaining. But t > > It's possible the memory corruption could account for this, of course > > - since that in theory at least, could have almost any strange effect > > on the program's behavior. But that doesn't seem particularly likely > > to me. > > This depends on _what_ is on page zero, if that is differente for > whatever we put there during boot, and it we ever wrote to that page > again, we would mark that pge dirty anyways, so I would put that > "corruption" problem as highly improbable. Not that we shouldn't fix > the bug, but I doubt that you are getting memory corruption due to this > bug. > > The only way that you can get memory corruption is if you write to that > page just before you do migration, and then you never wrote to it again. > What is on hardware page zero on pseries? or it is just a normal page? You misunderstand. I wasn't talking about potential corruption of guest memory because we fail to transmit page 0 (though that is also a bug). I'm talking about corruption of qemu internal memory because the bitmap_set() writes one 1 bit beyond the end of the space allocated for the migration bitmap - it has no kind of bounds checking. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson